crawlerr
crawlerr is simple, yet powerful web crawler for Node.js, based on Promises. This tool allows you to crawl specific urls only based on wildcards. It uses Bloom filter for caching. A browser-like feeling.
- Simple: our crawler is simple to use;
- Elegant: provides a verbose, Express-like API;
- MIT Licensed: free for personal and commercial use;
- Server-side DOM: we use JSDOM to make you feel like in your browser;
- Configurable pool size, retries, rate limit and more;
Installation
$ npm install crawlerr
Usage
crawlerr(base [, options])
You can find several examples in the examples/
directory. There are the some of the most important ones:
Example 1: Requesting title from a page
const spider = ; spider ;
Example 2: Scanning a website for specific links
const spider = ; spider;
Example 3: Server side DOM
const spider = ; spider;
Example 4: Setting cookies
const url = "http://example.com/";const spider = ; spiderrequest;spiderrequest; spider;
API
crawlerr(base [, options])
Creates a new Crawlerr
instance for a specific website with custom options
. All routes will be resolved to base
.
Option | Default | Description |
---|---|---|
concurrent |
10 |
How many request can be run simultaneously |
interval |
250 |
How often should new request be send (in ms) |
… | null |
See request defaults for more informations |
.get(url)
public Requests url
. Returns a Promise
which resolves with { req, res, uri }
, where:
req
is the Request object;res
is the Response object;uri
is the absoluteurl
(resolved frombase
).
Example:
spider ;
.when(pattern)
public Searches the entire website for urls which match the specified pattern
. pattern
can include named wildcards which can be then retrieved in the response via res.param
.
Example:
spider ;
.on(event, callback)
public Executes a callback
for a given event
. For more informations about which events are emitted, refer to queue-promise.
Example:
spider;spider;
.start()
/.stop()
public Starts/stops the crawler.
Example:
spiderstart;spider;
.request
public A configured request
object which is used by retry-request
when crawling webpages. Extends from request.jar()
. Can be configured when initializing a new crawler instance through options
. See crawler options and request
documentation for more informations.
Example:
const url = "https://example.com";const spider = ;const request = spiderrequest; request;
Request
Extends the default Node.js
incoming message.
get(header)
public Returns the value of a HTTP header
. The Referrer
header field is special-cased, both Referrer
and Referer
are interchangeable.
Example:
req; // => "text/plain"req; // => "text/plain"
is(...types)
public Check if the incoming request contains the "Content-Type" header field, and it contains the give mime type
. Based on type-is.
Example:
// Returns true with "Content-Type: text/html; charset=utf-8"req;req;req;
param(name [, default])
public Return the value of param name
when present or defaultValue
:
- checks route placeholders, ex:
user/[all:username]
; - checks body params, ex:
id=12, {"id":12}
; - checks query string params, ex:
?id=12
;
Example:
// .when("/users/[all:username]/[digit:someID]")req; // /users/foobar/123456 => foobarreq; // /users/foobar/123456 => 123456
Response
jsdom
public Returns the JSDOM object.
window
public Returns the DOM window for response content. Based on JSDOM.
document
public Returns the DOM document for response content. Based on JSDOM.
Example:
resdocument;resdocument;// …
Tests
npm test