Please get help and discuss how to scrape a website on the discord server, which can respond quickly. It is better to submit issues on github for better tracking.
- Template driven web scraping
- you can quickly design templates for scraping different websites.
- The templates are intuitive and easier to maintain.
- Browser operations supported by the controller package
- Same interface of playwright, puppeteer, cheerio(more to come): easy to switch between them
- Web browsing automation: goto(open) / click / input / hover / select / scroll
- State data management: cookies, localStorage, HTTP Headers, custom session data
- Request and response interception management: data and HTTP headers
- Elements selection by CSS selectors or XPath: whether in frames or not
- Automatic file saving: such as screenshot, pdf, mhtml, download directly or by clicking
- API request
- Both browser and API can be used at the same time and cookies/headers are shared.
- fingerprint management:
- Automatically generate fingerprints of the latest common browsers
- Intercepted HTTP headers
- Manual manangement
- Simple rate limits: automatic flow control, such as interval / concurrent /times per period
- Simple proxy management: multiple "static" proxies to increase concurrency
- Subtasks: complex tasks can be split into multiple simple subtasks for better maintenance and increased concurrency
- Data export
npm install @letsscrapedata/scraper
- Example with default ScraperConfig:
import { scraper, TemplateTasks } from "@letsscrapedata/sraper";
/**
* tid: ID of template to be executed, such as template for scraping one list of example in page "https://www.letsscrapedata.com/pages/listexample1.html"
* parasstrs: input parameters of tasks, such as "1"
* this example will execute five tasks using template 2000007, each of them scrapes the data in one page.
*/
const newTasks: TemplateTasks[] = [{ tid: 2000007, parasstrs: ["1", "2", "3", "4", "5"] }];
// The following line can do the same thing using subtasks, scraping the data in the first five pages
// const newTasks: TemplateTasks[] = [{ tid: 2000008, parasstrs: ["5"] }];
await scraper(newTasks);
- Example with ScraperConfig
import { scraper, TemplateTasks, ScraperConfig } from "@letsscrapedata/sraper";
const scraperConfig: ScraperConfig = {
// browserControllerType: "playwright",
// browserType: "chromium",
browserConfigs: [
// launch a chromium browser using puppeteer, no proxy
{browserControllerType: "puppeteer", proxyUrl: ""},
// launch a firefox browser using playwright
{browserType: "firefox", proxyUrl: "http://proxyId:port"},
// connect to current browser using playwright
{browserUrl: "http://localhost:9222/"},
]
}
const newTasks: TemplateTasks[] = [{ tid: 2000008, parasstrs: ["9"] }];
await scraper(newTasks, scraperConfig);
Common configurations:
- Proxies and browser: browserConfigs, by default launching a browser using browserControllerType/browserType, without proxy
- Default browser controller to use: browserControllerType, default "playwright"
- Default browser to use: browserType, default "chromium"
- File format of scraped data: dataFileFormat, default "tsv"
- Where are the templates: templateDir, default "" which means to obtain the template from the network
Complete configurations:
export interface ScraperConfig {
/**
* @default true
*/
exitWhenCompleted?: boolean;
/**
* whether to use the parasstr in XML if parasstr of a task is ""
* @default false
*/
useParasstrInXmlIfNeeded?: boolean;
//////////////////////////////////////////////////////////////////////////// directory
/**
* @default "", which will use current directory of process + "/data/"
* if not empty, baseDir must be an absolute path, and the directory must exist and have read and write permissions.
*/
baseDir?: string;
templateDir?: string; //
/**
* filename in action_setvar_get/get_file must include inputFileDirePart for security.
* @default "LetsScrapeData"
*/
inputFileDirPart?: string;
//////////////////////////////////////////////////////////////////////////// browser
/**
* wether to use puppeteer-extra-plugin-stealth
* @default false
*/
useStealthPlugin?: boolean;
/**
* default browserControllerType of BrowserConfig
* @default "playwright"
*/
browserControllerType?: BrowserControllerType;
/**
* default browserType of BrowserConfig
* @default "chromium"
*/
browserType?: LsdBrowserType;
/**
* @default {}
*/
lsdLaunchOptions?: LsdLaunchOptions;
/**
* @default {browserUrl: ""}
*/
lsdConnectOptions?: LsdConnectOptions;
/**
* A headless browser will be launched if browserConfigs is [].
* @default []
*/
browserConfigs?: BrowserConfig[];
//////////////////////////////////////////////////////////////////////////// template
templateUrl?: string;
/**
* the default maximum number of concurrent tasks that can execute the same template in a browserContext
* @default 1
*/
maxConcurrency?: number;
/**
* @default ""
*/
readCode?: string;
/**
* @default []
*/
templateParas?: TemplatePara[];
//////////////////////////////////////////////////////////////////////////// scheduler
/**
* @default 10
*/
totalMaxConcurrency?: number;
/**
* min miliseconds between two tasks of the same template
* @default 2000
*/
minMiliseconds?: number,
//////////////////////////////////////////////////////////////////////////// data
/**
* whether to move all dat_*.csv files into a new directory "yyyyMMddHHmmss"
* @default false
*/
moveDataWhenStart?: boolean;
/**
* @default "tsv"
*/
dataFileFormat?: DataFileFormat;
/**
* @default "::"
* valid only when dataFileFormat is "txt"
*/
columnSeperator?: string;
}
/**
* Only one of browserUrl and proxyUrl will take effect, and browserUrl has higher priority.
*/
export interface BrowserConfig {
browserControllerType?: BrowserControllerType;
/**
* browserUrl can be used when mannaul login in advance.
*/
browserUrl?: string;
proxyUrl?: string;
/**
* valid only if !browserUrl
*/
browserType?: LsdBrowserType;
}