HTML Scraper
The scraper has three components: http
→ split
→ extract
, executed in the same order. You make an http
request to fetch a page, split
the page into different sections and ultimately extract
the data from each section using a custom parser.
The best part about this scraper is that you can create a chain of actions that you need to perform.
Order of execution
http
→ split
→ extract
→ http
→ split
and so on…
Example
Say I want extract al the information of students who got admitted to the University of Southern California, Los Angeles. I would do it as follows —
- Make an http request to this page — http://edulix.com/universityfinder/university_of_southern_california.
- Page consists of multiple anchor tags containing links of each
# Standard require Scraper = require 'HTML-Scraper' # Specify the key to read urls Scraperhttp 'url' split '.archive a' extract href: "http://tusharm.com" + docattr 'href' text: dochtml http 'href' extract http: $'a:nth-child(2)'attr'href' #Launch with base params $launch url: 'http://tusharm.com/projects.html' #Returns a promise then consolelog val done