MicroFrontier ·
A web crawler frontier implementation in TypeScript backed by Redis. MicroFrontier is a scalable and distributed frontier implemented through Redis Queues.
- [x] Fast Ingestion & High throughput
- [x] Multiple priority queues
- [x] Custom priority strategy
- [x] Per-Hostname crawl rate limit or default delay fallback
- [x] Easy to use HTTP Microservice
- [x] Multi-processing support
Example of Mercator Frontier1
Usage
MicroFrontier can be used both as a Javascript library SDK, from the command line or with a Docker deploy.
Command Line
Install microfrontier with:
npm i -g microfrontier
Run microfrontier
microfrontier --port 3035 --redis:host localhost #see configuration for other parameters
As a package
Npm:
npm i microfrontier
Yarn:
yarn add microfrontier
Docker
docker pull adileo/microfrontier
Configuration
ENV VAR | CLI PARAMS | Description |
---|---|---|
host | --host | Host name to start the microservice http server. Default value: 127.0.0.1
|
port | --port | Port to start the microservice http server. Default value: 8090
|
redis_host | --redis:host | Redis server host. Default value: 127.0.0.1
|
redis_port | --redis:port | Redis server port. Default value: 6379
|
redis_* | --redis:* | Parameters are interpreted by nconf and passed to ioredis as the client config. |
config_frontierName | --config:frontierName | Prefix used for Redis keys. |
config_* | --config:* | Parameters are interpreted by nconf , default value below. |
{
frontierName: 'frontier',
priorities: {
'high': {probability: 0.6},
'normal': {probability: 0.3},
'low': {probability: 0.1},
},
defaultCrawlDelay: 1000
}
How to
Adding an URL to the frontier
Via HTTP
curl --location --request POST 'http://127.0.0.1:8090/frontier' \
--header 'Content-Type: application/json' \
--data-raw '{
"url": "http://www.example.com",
"priority": "normal",
"meta": {
"foo": "bar"
}
}'
Via SDK
import { URLFrontier } from "microfrontier"
const frontier = new URLFrontier(config)
frontier.add("http://www.example.com", "normal", {"foo": "bar"}).then(() => {
console.log('URL added')
})
Getting an URL from the frontier
curl --location --request GET 'http://127.0.0.1:8090/frontier'
import { URLFrontier } from "microfrontier"
const frontier = new URLFrontier(config)
frontier.get().then((item) => {
// {url: "http://www.example.com", meta: {"foo":"bar"}}
})
Citations
[1]: High-Performance Web Crawling - Marc Najork, Allan Heydon