spool-scraper
A Spool to make Scraping the web super easy by implementing Crawler.
Install
$ npm install --save @fabrix/spool-scraper
Configure
// config/main.ts
import { ScraperSpool } from '@fabrix/spool-scraper'
export const main = {
spools: [
// ... other spools
ScraperSpool
]
}
Configuration
// config/scraper.ts
export const scraper = {
max_connections: 10,
rate_limit: 1000,
encoding: null,
jQuery: true,
force_UTF8: true,
retries: 3,
retry_timeout: 10000,
incoming_encoding: null,
skip_duplicates: false,
// Boolean If true, userAgent should be an array and rotate it (Default false)
rotate_UA: false,
// String|Array, If rotateUA is false, but userAgent is an array, crawler will use the first one.
user_agent: [],
// String If truthy sets the HTTP referer header
referer: null,
// Object Raw key-value of http headers
headers: null,
pre_request: (opts, done) => {
// 'options' here is not the 'options' you pass to 'c.queue',
// instead, it's the options that is going to be passed to 'request' module
console.log(opts)
// when done is called, the request will start
done()
}
}
For more information about store (type and configuration) please see the scraper documentation.
Usage
For the best results, create a Scrape Class and override the default process method.
import { Scrape } from '@fabrix/spool-scraper'
export class AmazonScrape extends Scrape {
process(res): Promise<any> {
const $ = res.$
const amazon = $('.nav-logo-base').text()
return Promise.resolve(amazon)
}
}
Then you can either queue your scrape or scrape directly
// Return a result immediately <see config for options>
const direct = this.app.scrapes.AmazonScrape.direct('https://amazon.com', options, preRequest)
// Add this to the queue <see config for options>
this.app.scrapes.AmazonScrape.queue('https://amazon.com', options, preRequest)