sqrap
A configurable web scraper that can map information from a website using a json schema.
Installation
npm i sqrap
Usage
The sqrap
module exports a function that accepts two parameters, the url of the resource to exttract the information and a configuration object thats should contain the custom selectors to extract values from the specified resource and optionally http options, based on the request module.
Selectors
You can use selectors to extract information from a specific page for a specific property that you can define. For each property you can specify a set of selectors. The names of the properties are up to you.
e.g.
const selectors = author: selector: 'span[itemprop="author"] > span[itemprop="name"]' text: true title: selector: 'h1' text: 'true' text: selector: 'h1' text: true selector: '.field-name-summary' text: true selector: 'div[itemprop="articleBody"]' text: true image: selector: 'meta[property="og:image"]' attribute: 'content' htmlText: selector: 'div.group-left' html: true ;
Every selector item has 2 properties. The one is always a selector
and the second can be one of text
, attribute
and html
.
text
It will extract all the text included in the selected DOM element.
attribute
It will extract the value of an attribute of the selected DOM element.
html
It will extract all the html included in the selected DOM element.
Example usage
'use strict'; const sqrap = ; const selectors = logo: selector: '#hplogo' attribute: 'src' title: selector: 'title' text: 'true' content: selector: '#SIvCob' html: true ; const url = 'http://www.google.com'; ;
Response