aljazeera-crawler

1.0.3 • Public • Published

logo

aljazeera-crawler

aljazeera-crawler is a command line application that helps crawl the https://www.aljazeera.net/ website.

Installation

Either installing the tool globally in your system path.

npm install -g aljazeera-crawler

Or using it directly with the help of npx:

npx aljazeera-crawler [options]

Usage

For CLI options, use the -h (or --help) argument:

aljazeera-crawler -h

Al Jazeera Crawler Usage: aljazeera-crawler [options]

Options: --version Show version number [boolean] -t, --threshold the minimum number of words to be crawled [number] [default: 1000] -d, --domain the domain to crawl [string] [required] [choices: "politics", "economy", "culture", "sport", "art", "technology", "heritage"] -h, --help Show help [boolean]

Let's say we want to crawl a minimum of 100k word in the technology domain

We will use either:

aljazeera-crawler -t 100000 -d technology

Or:

aljazeera-crawler --threshold 100000 --domain technology

After that a file named output-technology-100000.txt will be created.

Domains

For the possible domains to crawl as of know are:

Category Link
politics سياسة https://www.aljazeera.net/news/politics/
economy اقتصاد https://www.aljazeera.net/news/ebusiness/
culture ثقافة https://www.aljazeera.net/news/cultureandart/
sport رياضة https://www.aljazeera.net/sport/
art فن https://www.aljazeera.net/news/arts/
technology تكنولوجيا https://www.aljazeera.net/news/scienceandtechnology/
heritage تراث https://www.aljazeera.net/turath/

Licence

MIT

Package Sidebar

Install

npm i aljazeera-crawler

Weekly Downloads

0

Version

1.0.3

License

MIT

Unpacked Size

10.2 kB

Total Files

8

Last publish

Collaborators

  • artpumpkin