@letsscrapedata/scraper
TypeScript icon, indicating that this package has built-in type declarations

0.0.68 • Public • Published
You can use a free LetsScrapeData App if you want to scrape web data without programming.

Please get help and discuss how to scrape a website on the discord server, which can respond quickly. It is better to submit issues on github for better tracking.

Features

  1. Template driven web scraping
  • you can quickly design templates for scraping different websites.
  • The templates are intuitive and easier to maintain.
  1. Browser operations supported by the controller package
  • Same interface of playwright, puppeteer, cheerio(more to come): easy to switch between them
  • Web browsing automation: goto(open) / click / input / hover / select / scroll
  • State data management: cookies, localStorage, HTTP Headers, custom session data
  • Request and response interception management: data and HTTP headers
  • Elements selection by CSS selectors or XPath: whether in frames or not
  • Automatic file saving: such as screenshot, pdf, mhtml, download directly or by clicking
  1. API request
  • Both browser and API can be used at the same time and cookies/headers are shared.
  1. fingerprint management:
  • Automatically generate fingerprints of the latest common browsers
  • Intercepted HTTP headers
  • Manual manangement
  1. Simple rate limits: automatic flow control, such as interval / concurrent /times per period
  2. Simple proxy management: multiple "static" proxies to increase concurrency
  3. Subtasks: complex tasks can be split into multiple simple subtasks for better maintenance and increased concurrency
  4. Data export

Install

npm install @letsscrapedata/scraper

Examples

  1. Example with default ScraperConfig:
import { scraper, TemplateTasks } from "@letsscrapedata/sraper";

/**
 * tid: ID of template to be executed, such as template for scraping one list of example in page "https://www.letsscrapedata.com/pages/listexample1.html"
 * parasstrs: input parameters of tasks, such as "1"
 * this example will execute five tasks using template 2000007, each of them scrapes the data in one page.
 */
const newTasks: TemplateTasks[] = [{ tid: 2000007, parasstrs: ["1", "2", "3", "4", "5"] }];

// The following line can do the same thing using subtasks, scraping the data in the first five pages
// const newTasks: TemplateTasks[] = [{ tid: 2000008, parasstrs: ["5"] }];

await scraper(newTasks);
  1. Example with ScraperConfig
import { scraper, TemplateTasks, ScraperConfig } from "@letsscrapedata/sraper";

const scraperConfig: ScraperConfig = {
  // browserControllerType: "playwright",
  // browserType: "chromium",
  browserConfigs: [
    // launch a chromium browser using puppeteer, no proxy
    {browserControllerType: "puppeteer", proxyUrl: ""},
    // launch a firefox browser using playwright
    {browserType: "firefox", proxyUrl: "http://proxyId:port"},
    // connect to current browser using playwright
    {browserUrl: "http://localhost:9222/"},
  ] 
}

const newTasks: TemplateTasks[] = [{ tid: 2000008, parasstrs: ["9"] }];

await scraper(newTasks, scraperConfig);

ScraperConfig

Common configurations:

  • Proxies and browser: browserConfigs, by default launching a browser using browserControllerType/browserType, without proxy
  • Default browser controller to use: browserControllerType, default "playwright"
  • Default browser to use: browserType, default "chromium"
  • File format of scraped data: dataFileFormat, default "tsv"
  • Where are the templates: templateDir, default "" which means to obtain the template from the network

Complete configurations:

export interface ScraperConfig {
  /**
   * @default true
   */
  exitWhenCompleted?: boolean;
  /**
   * whether to use the parasstr in XML if parasstr of a task is ""
   * @default false
   */
  useParasstrInXmlIfNeeded?: boolean;
  ////////////////////////////////////////////////////////////////////////////    directory
  /**
   * @default "", which will use current directory of process + "/data/"
   * if not empty, baseDir must be an absolute path, and the directory must exist and have read and write permissions.
   */
  baseDir?: string;
  templateDir?: string; //
  /**
   * filename in action_setvar_get/get_file must include inputFileDirePart for security.
   * @default "LetsScrapeData"
   */
  inputFileDirPart?: string;
  ////////////////////////////////////////////////////////////////////////////    browser
  /**
   * wether to use puppeteer-extra-plugin-stealth
   * @default false
   */
  useStealthPlugin?: boolean;
  /**
   * default browserControllerType of BrowserConfig
   * @default "playwright"
   */
  browserControllerType?: BrowserControllerType;
  /**
   * default browserType of BrowserConfig
   * @default "chromium"
   */
  browserType?: LsdBrowserType;
  /**
   * @default {}
   */
  lsdLaunchOptions?: LsdLaunchOptions;
  /**
   * @default {browserUrl: ""}
   */
  lsdConnectOptions?: LsdConnectOptions;
  /**
   * A headless browser will be launched if browserConfigs is [].
   * @default []
   */
  browserConfigs?: BrowserConfig[];
  ////////////////////////////////////////////////////////////////////////////    template
  templateUrl?: string;
  /**
   * the default maximum number of concurrent tasks that can execute the same template in a browserContext
   * @default 1
   */
  maxConcurrency?: number;
  /**
   * @default ""
   */
  readCode?: string;
  /**
   * @default []
   */
  templateParas?: TemplatePara[];
  ////////////////////////////////////////////////////////////////////////////    scheduler
  /**
   * @default 10
   */
  totalMaxConcurrency?: number;
  /**
   * min miliseconds between two tasks of the same template
   * @default 2000
   */
  minMiliseconds?: number,
  ////////////////////////////////////////////////////////////////////////////    data
  /**
   * whether to move all dat_*.csv files into a new directory "yyyyMMddHHmmss"
   * @default false
   */
  moveDataWhenStart?: boolean;
  /**
   * @default "tsv"
   */
  dataFileFormat?: DataFileFormat;
  /**
   * @default "::"
   * valid only when dataFileFormat is "txt"
   */
  columnSeperator?: string;
}

/**
 * Only one of browserUrl and proxyUrl will take effect, and browserUrl has higher priority.
 */
export interface BrowserConfig {
  browserControllerType?: BrowserControllerType;
  /**
   * browserUrl can be used when mannaul login in advance.
   */
  browserUrl?: string;
  proxyUrl?: string;
  /**
   * valid only if !browserUrl
   */
  browserType?: LsdBrowserType;
}

Package Sidebar

Install

npm i @letsscrapedata/scraper

Weekly Downloads

3

Version

0.0.68

License

Apache-2.0

Unpacked Size

406 kB

Total Files

6

Last publish

Collaborators

  • letsscrapedata