@skypilot/scraper
TypeScript icon, indicating that this package has built-in type declarations

1.0.0-alpha.23 • Public • Published

@skypilot/scraper

npm latest downloads license: ISC

Node-base scriptable web scraper

How to use

  1. Create a database adapter
const dbFilePath = 'tmp/demo.json';
const database = new LowDb(dbFilePath);
  1. Create a scraper that uses the database
import { PlaywrightScraper } from './src/PlaywrightScraper';
const scraper = new PlaywrightScraper({ database });
  1. Use ScriptBuilder to build a script:
import { ScriptBuilder } from './src/ScriptBuilder';
const builder = new ScriptBuilder()
  .goTo('https://www.iana.org/domains/reserved') // start at a page
  .runOnAll({ // Runs the nested `commands` on each element that matches `query`
    query: 'table#arpa-table > tbody > tr > td > span.domain.label',
    commands: new ScriptBuilder()
      .follow('a') // follow the href in the first `a` tag
      .query({ // gather this data for each iteration of the elements matching the `runOnAll` query
        title: 'head > title',
        sponsor: '//h2[contains(text(), "Sponsoring Organisation")]/following-sibling::b',
        adminContact: '//h2[contains(text(), "Administrative Contact")]/following-sibling::b',
        techContact: '//h2[contains(text(), "Technical Contact")]/following-sibling::b',
      })
      .write() // writes to the database
  });
  1. Pass the script into the scraper's run method:
const result = scraper.run(builder);

Query

There are two ways to write a query:

1. A Query or ShorthandQuery object

A Query object is the standard way to write a selector:

interface Query {
  selector: string; // a CSS or XPath selector
  attributeName?: string; // if specified, select this attribute's value; otherwise, select the element's text content
  scope?: 'one' | 'all'; // default = 'one'; when used with `runOnAll`, `scope: 'all'` is automatically set
  limit?: Integer; // limits the selection to `limit` elements
  nthOfType?: Integer; // select the `nth` element matching the selector
}

A ShorthandQuery is the same as Query object, but uses a shorthand syntax for some of the keys:

interface ShorthandQuery {
  sel: string;
  attr?: string;
  scope?: 'one' | 'all';
  limit?: Integer;
  nth?: Integer;
}

See CSS and XPath selectors. Support for text selectors will be added soon.

A query matches the first element matching the selector, with two exceptions:

  • When used with runOnAll or when scope: 'all', the selector selects all matching elements up to the limit (if any)
  • When nthOfType is set, the selector selects the nth matching element

2. A string query

When a string value is used as the query, that value is treated as the selector param.

E.g., if the argument is 'h2', it is understood to mean { selector: 'h2' }.

Package Sidebar

Install

npm i @skypilot/scraper

Weekly Downloads

5

Version

1.0.0-alpha.23

License

MIT

Unpacked Size

75.5 kB

Total Files

71

Last publish

Collaborators

  • williamthorsen