@letsscrapedata/scraper

You can use a free LetsScrapeData App if you want to scrape web data without programming.

Please get help and discuss how to scrape a website on the discord server, which can respond quickly. It is better to submit issues on github for better tracking.

Features

Template driven web scraping

you can quickly design templates for scraping different websites.
The templates are intuitive and easier to maintain.

Browser operations supported by the controller package

Same interface of playwright, puppeteer, cheerio(more to come): easy to switch between them
Web browsing automation: goto(open) / click / input / hover / select / scroll
State data management: cookies, localStorage, HTTP Headers, custom session data
Request and response interception management: data and HTTP headers
Elements selection by CSS selectors or XPath: whether in frames or not
Automatic file saving: such as screenshot, pdf, mhtml, download directly or by clicking

API request

Both browser and API can be used at the same time and cookies/headers are shared.

fingerprint management:

Automatically generate fingerprints of the latest common browsers
Intercepted HTTP headers
Manual manangement

Simple rate limits: automatic flow control, such as interval / concurrent /times per period
Simple proxy management: multiple "static" proxies to increase concurrency
Subtasks: complex tasks can be split into multiple simple subtasks for better maintenance and increased concurrency
Data export

Install

npm install @letsscrapedata/scraper

Examples

Example with default ScraperConfig:

import { scraper, TemplateTasks } from "@letsscrapedata/sraper";

/**
 * tid: ID of template to be executed, such as template for scraping one list of example in page "https://www.letsscrapedata.com/pages/listexample1.html"
 * parasstrs: input parameters of tasks, such as "1"
 * this example will execute five tasks using template 2000007, each of them scrapes the data in one page.
 */
const newTasks: TemplateTasks[] = [{ tid: 2000007, parasstrs: ["1", "2", "3", "4", "5"] }];

// The following line can do the same thing using subtasks, scraping the data in the first five pages
// const newTasks: TemplateTasks[] = [{ tid: 2000008, parasstrs: ["5"] }];

await scraper(newTasks);

Example with ScraperConfig

import { scraper, TemplateTasks, ScraperConfig } from "@letsscrapedata/sraper";

const scraperConfig: ScraperConfig = {
  // browserControllerType: "playwright",
  // browserType: "chromium",
  browserConfigs: [
    // launch a chromium browser using puppeteer, no proxy
    {browserControllerType: "puppeteer", proxyUrl: ""},
    // launch a firefox browser using playwright
    {browserType: "firefox", proxyUrl: "http://proxyId:port"},
    // connect to current browser using playwright
    {browserUrl: "http://localhost:9222/"},
  ] 
}

const newTasks: TemplateTasks[] = [{ tid: 2000008, parasstrs: ["9"] }];

await scraper(newTasks, scraperConfig);

ScraperConfig

Common configurations:

Proxies and browser: browserConfigs, by default launching a browser using browserControllerType/browserType, without proxy
Default browser controller to use: browserControllerType, default "playwright"
Default browser to use: browserType, default "chromium"
File format of scraped data: dataFileFormat, default "tsv"
Where are the templates: templateDir, default "" which means to obtain the template from the network

Complete configurations:

export interface ScraperConfig {
  /**
   * @default true
   */
  exitWhenCompleted?: boolean;
  /**
   * whether to use the parasstr in XML if parasstr of a task is ""
   * @default false
   */
  useParasstrInXmlIfNeeded?: boolean;
  ////////////////////////////////////////////////////////////////////////////    directory
  /**
   * @default "", which will use current directory of process + "/data/"
   * if not empty, baseDir must be an absolute path, and the directory must exist and have read and write permissions.
   */
  baseDir?: string;
  templateDir?: string; //
  /**
   * filename in action_setvar_get/get_file must include inputFileDirePart for security.
   * @default "LetsScrapeData"
   */
  inputFileDirPart?: string;
  ////////////////////////////////////////////////////////////////////////////    browser
  /**
   * wether to use puppeteer-extra-plugin-stealth
   * @default false
   */
  useStealthPlugin?: boolean;
  /**
   * default browserControllerType of BrowserConfig
   * @default "playwright"
   */
  browserControllerType?: BrowserControllerType;
  /**
   * default browserType of BrowserConfig
   * @default "chromium"
   */
  browserType?: LsdBrowserType;
  /**
   * @default {}
   */
  lsdLaunchOptions?: LsdLaunchOptions;
  /**
   * @default {browserUrl: ""}
   */
  lsdConnectOptions?: LsdConnectOptions;
  /**
   * A headless browser will be launched if browserConfigs is [].
   * @default []
   */
  browserConfigs?: BrowserConfig[];
  ////////////////////////////////////////////////////////////////////////////    template
  templateUrl?: string;
  /**
   * the default maximum number of concurrent tasks that can execute the same template in a browserContext
   * @default 1
   */
  maxConcurrency?: number;
  /**
   * @default ""
   */
  readCode?: string;
  /**
   * @default []
   */
  templateParas?: TemplatePara[];
  ////////////////////////////////////////////////////////////////////////////    scheduler
  /**
   * @default 10
   */
  totalMaxConcurrency?: number;
  /**
   * min miliseconds between two tasks of the same template
   * @default 2000
   */
  minMiliseconds?: number,
  ////////////////////////////////////////////////////////////////////////////    data
  /**
   * whether to move all dat_*.csv files into a new directory "yyyyMMddHHmmss"
   * @default false
   */
  moveDataWhenStart?: boolean;
  /**
   * @default "tsv"
   */
  dataFileFormat?: DataFileFormat;
  /**
   * @default "::"
   * valid only when dataFileFormat is "txt"
   */
  columnSeperator?: string;
}

/**
 * Only one of browserUrl and proxyUrl will take effect, and browserUrl has higher priority.
 */
export interface BrowserConfig {
  browserControllerType?: BrowserControllerType;
  /**
   * browserUrl can be used when mannaul login in advance.
   */
  browserUrl?: string;
  proxyUrl?: string;
  /**
   * valid only if !browserUrl
   */
  browserType?: LsdBrowserType;
}

@letsscrapedata/scraper

Features

Install

Examples

ScraperConfig

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

@letsscrapedata/scraper

Features

Install

Examples

ScraperConfig

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

Weekly Downloads