crwlr
a minimal puppeteer crawler api
Huh?
- crwlr:
- handles the boring boilerplate work of actually crawling a site
- You provide:
- <String>
url
to start from - <Puppeteer Browser>
browser
instance with your own.launch(options)
pageOptions
as you wish:- <Object>
goto
to be provided as options topage.goto(url, options)
- <Function>
prepare(page)
binds event handlers and/or set properties for every new page - <Function>
resolved(response, page)
fires after everypage.goto()
has resolved
- <Object>
- <String>
Installation
$ npm install --save crwlr
Usage
Basic Example - Without Any Options
'use strict'; const puppeteer = ;const crwlr = ; const site = 'https://buster.neocities.org/crwlr/'; // *** Basic Example Without Any Options *** //async { const browser = await puppeteer; let crawledPages = await ; console;};/*[ 'https://buster.neocities.org/crwlr/', 'https://buster.neocities.org/crwlr/other.html', 'https://buster.neocities.org/crwlr/mixed-content.html', 'https://buster.neocities.org/crwlr/missing.html', 'https://buster.neocities.org/crwlr/dummy.pdf' ]*/
Advanced Example - With Options
'use strict'; const puppeteer = ;const crwlr = ; const site = 'https://buster.neocities.org/crwlr/'; // *** Advanced Example With Options *** //async { const browser = await puppeteer; const pageOptions = { page; } goto: waitUntil: 'networkidle2' { console; } ; await ;};/*=> resolved: 200 https://buster.neocities.org/crwlr/=> resolved: 200 https://buster.neocities.org/crwlr/other.htmlhttps://buster.neocities.org/crwlr/mixed-content.html => requested: https://mixed-script.badssl.com/nonsecure.js=> resolved: 200 https://buster.neocities.org/crwlr/mixed-content.html=> resolved: 404 https://buster.neocities.org/crwlr/missing.html=> resolved: 200 https://buster.neocities.org/crwlr/dummy.pdf*/
License
ISC © Buster Collings