Web Scraping is data scraping used for extracting data from websites.
Applications
Data source:
When service is unavailable
Crawler & Indexer:
Offers web data search and data integration
Test:
Simulating client behavior to do the testing
Basic Flow
1 Retrieval:get web page
2 Analysis: Raw HTML file –> extract useable data
Goal
parse the HTML into a structured data then extract the data we want.
Methods
1 XPath: XML Path Language is a query language for selecting nodes from an XML document1
2
3
4
5/bookstore/book[price>35.00]/title: select all title elements of the book elements of the bookstore element that have a price element with a value greater than 35.00
//title[@lang='en']:selects all title elements that have a "lang" attribute with a value of "en"
/bookstore/book/[position()<3]:selects the first two book elements that are children of the bookstore element
Must know the position or the structure to use XPath.
Package in Python:1
2
3
2 Regular Expression:
**Package in Python**: ```import re
3 Beautiful Soup:
Package in Python:from bs4 import BeautifulSoup
Simple Python Scraper
Basic flow:
1 Request Web server to retrieve HTML content:requests
2 Parse the HTML into structured data:lxml
3 Use XPath and Regex to extract useful information:lxml, re
4 Store the information:pymongo
Integration with RabbitMQ
Why integrating with queue?
1 Store scraping tasks temporarily
2 Make scraper running continuously
3 Let scraper feed itself
4 Coordinate multiple scrapers working together
Avoid blocking
Scraping is a gray area:
Data is priceless;
Great pressure on target website;
Scraping is usually unauthorized.
Methods:
1 Limit Scraping Rate:
2 Follow Website’s robots.txt
3 User Agent: A web browser telling a web site information about operating system.
4 Proxy
5 TOR(The Onion Router):Packages: socks, socket
Tor directs Internet traffic through a free, worldwide, volunteer network consisting of more than seven thousand relays.
6 Captcha
7 One-click Captcha:
Criteria:
1)Behavior prior to clicking
2)Cursor movement(path/acceleration)-PC
3)Test against your browser
4)Cookie
5)History