Big data-web scrapers

Web Scraping is data scraping used for extracting data from websites.

Applications

Data source:
When service is unavailable
Crawler & Indexer:
Offers web data search and data integration
Test:
Simulating client behavior to do the testing

Basic Flow

1 Retrieval:get web page
2 Analysis: Raw HTML file –> extract useable data

Goal

parse the HTML into a structured data then extract the data we want.

Methods
1 XPath: XML Path Language is a query language for selecting nodes from an XML document

1
2
3
4
5
/bookstore/book[price>35.00]/title: select all title elements of the book elements of the bookstore element that have a price element with a value greater than 35.00

//title[@lang='en']:selects all title elements that have a "lang" attribute with a value of "en"

/bookstore/book/[position()<3]:selects the first two book elements that are children of the bookstore element

Must know the position or the structure to use XPath.
Package in Python:

lxml import html```
1
2
3

2 Regular Expression:
**Package in Python**: ```import re

3 Beautiful Soup:
Package in Python:from bs4 import BeautifulSoup

Simple Python Scraper

Basic flow:
1 Request Web server to retrieve HTML content:requests
2 Parse the HTML into structured data:lxml
3 Use XPath and Regex to extract useful information:lxml, re
4 Store the information:pymongo

Integration with RabbitMQ

Why integrating with queue?
1 Store scraping tasks temporarily
2 Make scraper running continuously
3 Let scraper feed itself
4 Coordinate multiple scrapers working together

Avoid blocking

Scraping is a gray area:
Data is priceless;
Great pressure on target website;
Scraping is usually unauthorized.
Methods:
1 Limit Scraping Rate:
2 Follow Website’s robots.txt
3 User Agent: A web browser telling a web site information about operating system.
4 Proxy
5 TOR(The Onion Router):Packages: socks, socketTor directs Internet traffic through a free, worldwide, volunteer network consisting of more than seven thousand relays.
6 Captcha
7 One-click Captcha:
Criteria:
1)Behavior prior to clicking
2)Cursor movement(path/acceleration)-PC
3)Test against your browser
4)Cookie
5)History