discovery-web-crawler
TypeScript icon, indicating that this package has built-in type declarations

1.2.1 • Public • Published

discovery-web-crawler

Version License: ISC Coverage Status Node.js CI

Crawls a website and populates a Watson Discovery Collection.

Install

npm install discovery-web-crawler

Usage

The following snippet will gather Watson stories from the IBM website and index them in Watson Discovery.

const DiscoveryWebCrawler = require('discovery-web-crawler')

let crawler = new DiscoveryWebCrawler({
    serviceUrl: 'YOUR_SERVICE_URL',
    apikey: 'YOUR_APIKEY',
    environmentId: 'YOUR_ENVIRONMENT_ID',
    collectionId: 'YOUR_COLLECTION_ID',

    url: 'https://www.ibm.com/watson/stories/',                                 // Starting point URL
    maxDepth: 3,                                                                // Max crawler depth
    fetchCondition: queueItem => queueItem.path.startsWith('/watson/'),         // Condition to crawl this URL
    urlCondition: url => !url.match('/list'),                                   // Condition to index this URL
    parse: async $ => ({ text: $('main').text().replace(/\s+/g, ' ').trim() }), // Cheerio API to extract JSON from HTML content
})
crawler.start()

Run tests

npm run test

Author

👤 Marco Cardoso

Show your support

Give a ⭐️ if this project helped you!

Package Sidebar

Install

npm i discovery-web-crawler

Weekly Downloads

1

Version

1.2.1

License

ISC

Unpacked Size

9.23 kB

Total Files

8

Last publish

Collaborators

  • macardoso95