node-distributed-crawler
Features
- Distributed crawler
- Configurable url parser and data parser
- jQuery selector using cheerio
- Parsed data insertion in Mongodb collection
- Domain wise interval configuration in distributed enviroment
- node 0.8+ support
Note: update to latest version (0.0.4+), don't use 0.0.1
I am actively updating this library, for any feature suggestion or git fork request are welcomed :)
Installation
$ npm install dcrawler
Usage
var DCrawler = ; var options = mongodbUri: "mongodb://0.0.0.0:27017/crawler-data" profilePath: __dirname + "/" + "profile";var logs = dbUri: "mongodb://0.0.0.0:27017/crawler-log" storeHost: true;var dc = options logs;dcstart;
Note: mongodb connection uri (mongodbUri
& dbUri
) should be same (queueing of urls should be centeralized)
The DCrawler takes options and log options construcotr:
- options with following porperties *:
- mongodbUri: Mongodb connection uri (Eg: 'mongodb://0.0.0.0:27017/crawler') *
- profilePath: Location of profile directory which contains config files. (Eg: /home/crawler/profile) *
- logs to store logs in centrelized location using winston-mongodb with following porperties:
- dbUri: Mongodb connection uri (Eg: 'mongodb://0.0.0.0:27017/crawler')
- storeHost: Boolean, true or false to store workers host name or not in log collection.
Note: logs
is required when you want to store centralize logs in mongodb, if you don't want to store logs no need to pass logOptions variable in DCrawler constructor
var dc = options;
Create config file for each domain inside profilePath directory. Check example profile example.com, contains config with following porperties:
- collection: Name on collection to store parsed data in mongodb. (Eg: 'products') *
- url: Url to start crawling. String or Array of url. (Eg: 'http://example.com' or ['http://example.com']) *
- interval: Interval between request in miliseconds. Default is
1000
(Eg: For 2 secods interval:2000
) - followUrl: Boolean, true or false to fetch further url from the crawled page and crawl that url as well.
- resume: Boolean, true or false to resume crawling from previous crawled data.
- beforeStart: Function to execute before start crawling. Function has config param which contains perticular profile config object. Example function:
{ console;}
- parseUrl: Function to get further url from crawled page. Function has
error
,response
object and$
jQuery object param. Function returns Array of url string. Example function:
{ var _url = ; try
- parseData: Function to exctract information from crawled page. Function has
error
,response
object and$
jQuery object param. Function returns data Object to insert in collection . Example function:
{ var _data = null; try var _id = ; var name = ; var price = ; var url = responseuri; _data = _id: _id name: name price: price url: url catch e console; return _data;}
- onComplete: Function to execute on completing crawling. Function has
config
param which contains perticular profile config object. Example function:
{ console;}
Chirag (blikenoother -[at]- gmail [dot] com)