A command line crawler based on puppeteer
To crawl a site and save the pages to a local ./temp directory
$ arachne crawl http://www.thecoffeefaq.com/ -d ./temp
To also save markdown and schema.org FAQs
$ arachne crawl http://www.thecoffeefaq.com/ -a -t markdown -d ./temp
With a whitelisted patterns file
$ arachne crawl http://www.thecoffeefaq.com/ -a -t markdown -d ./temp -w ./temp/whitelist.md
With a settling period
$ arachne crawl http://www.thecoffeefaq.com/ -d ./temp -b 5000 -o 9000
Follow the instructions here to setup: https://github.com/puppeteer/puppeteer/issues/1837#issuecomment-689006806
You will need to start XLaunch before running the CLI, select multiple windows, no client, turn off access control.
Another option is to add -h
flag to run headless (no browser application launched).
If the normal commands don't work, you might need to pass in the executablePath (-e) and run headless (-h).
$ arachne crawl http://www.thecoffeefaq.com/ -e /usr/bin/google-chrome -h