Fetch Politely
A library for ensuring polite outgoing HTTP requests that respect robots.txt and aren't made too close to each other
Installation
npm install fetch-politely --save
Usage
Simple:
var fetchInstance = { if err return; // The URL has been cleared for fetching – the hostname isn't throttled and robots.txt doesn't ban it} // Robots.txt checking requires specification of a User Agent as Robots.txt can contain User Agent specific rules // See http://en.wikipedia.org/wiki/User_agent for more info in format userAgent: 'Your-Application-Name/your.app.version (http://example.com/optional/full/app/url)'; // When a slot has been reserved the callback sent in the constructor will be calledfetchInstance;
FetchPolitely()
var fetchInstance = new FetchPolitely(callback, [options]);
Parameters
- callback –
function (err, url, message, [content]) {};
, called for each succesful request slot
Options
- throttleDuration – for how long in milliseconds to throttle requests to each
hostname
. Defaults to10
seconds. - returnContent – whether to fetch and return the content with the callback when a URL has received a request slot. Defaults to
false
. - logger – a Bunyan compatible logger library. Defaults to bunyan-duckling which uses
console.log()
/.error()
. - lookup – an object or class that keeps track of throttled hosts and queued URL:s. Defaults to
PoliteLookup
. - lookupOptions – an object that defines extra lookup options.
- allowed – a function that checks whether a URL is allowed to be fetched. Defaults to
PoliteRobot.allowed()
. - robotCache – a cache method used by
PoliteRobot
to cache fetchedrobots.txt
. Defaults to wrapped lru-cache. - robotCacheLimit – a limit of the number of items to keep in the default lru-cache of
PoliteRobot
. - robotPool – an HTTP agent to use for the request-library of
PoliteRobot
. - userAgent – required by
PoliteRobot
andoptions.returnContent
. The User Agent to use for HTTP requests.
Methods
- requestSlot – tries to reserve a request slot for a URL.
Static
- FetchPolitely.PoliteError – a very polite error object used for eg. informing about denied URL:s
- FetchPolitely.PoliteLookup – defines the interface for keeping track of throttled hosts and queued URL:s
- FetchPolitely.PolitePGLookup – alternative lookup that uses PostgreSQL as the backend
- FetchPolitely.PoliteRobot – checks whether URL:s are allowed to be fetched according to Robots.txt.
fetchInstance.requestSlot()
fetchInstance.requestSlot(url, [message], [options]);
Parameters
- url – the URL to reserve a request slot for
- message – a JSON-encodeable optional message containing eg. instructions for the
FetchPolitely
callback.
Options
- allow – if set to
true
the URL will always be allowd and not be sent to theallowed
function. - allowDuplicates – if set to
false
no more than one item of everyurl
+message
combination will be queued.
PoliteLookup
The simplest of simple implementations for keeping track of throttled hosts and queued URL:s. Handles it all in-memory. Same interface can be used to build a database backend for this though.
PolitePGLookup
A PostgreSQL + Knex-driven lookup that throttles hosts and queues URL using database tables.
Use by setting up the tables in pglookup.sql
and include by setting the FetchPolitely
options to:
lookup: FetchPolitelyPolitePGLookup lookupOptions: knex: knexInstance
Pull Requests are welcome if someone wants to pull out the Knex-dependency. Most projects where this has been used with Postgres has been using Knex so it got used here as well.
lookupOptions
- knex – required – the database connection to use, provided through a Knex object.
- purgeWindow – the minimum interval in milliseconds between two host purges. Defaults to
500
ms. - concurrentReleases – how many parallell database lookups to perform to check for released URL:s. Defaults to
2
. - releasesPerBatch – how many URL:s to fetch in each database lookup. Defaults to
5
. - onlyDeduplicateMessages – bool that if set will only deduplicate URL:s with the same message when deduplicating. Defaults to
false
.
Lint / Test
npm test
or to watch, install grunt-cli
then do grunt watch