@xapp/arachne

An extremely simple web crawler, based on puppeteer.

Usage in a Lambda

Chromium is required for puppeteer and is typically the limiting factor when trying to get it to run in a Lambda due to its size. This can be overcome with a Lambda Layer, specifically this community maintained layer.

You can include this layer directly in your SLS framework file or SAM Policy template.

A SLS framework example:

functions:
  eventReceiver:
    handler: dist/index.receiver
    layers:
      - "arn:aws:lambda:us-east-1:764866452798:layer:chrome-aws-lambda:31"

In your Lambda source:

import { Browser, LaunchOptions, BrowserConnectOptions, BrowserLaunchArgumentOptions } from "puppeteer";
import { Arachne, ArachnePage, ArachneRequest, MemoryRequestQueue } from "@xapp/arachne";

// Other imports and code
// The important part..

        let browser: Pick<Browser, "close" | "newPage">;
        // The try catch allows to still run it locally if you want, assuming you 
        // have chromium installed on your machine
        try {
            log().debug('Looking for chrome-aws-lambda');

            // eslint-disable-next-line @typescript-eslint/no-var-requires
            const chromium = require('@sparticuz/chrome-aws-lambda');

            browser = await chromium.puppeteer.launch({
                args: chromium.args,
                defaultViewport: chromium.defaultViewport,
                executablePath: await chromium.executablePath,
                headless: chromium.headless,
                ignoreHTTPSErrors: true,
            });
        } catch (e) {
            log().debug("Could not find chrome-aws-lambda layer");
            console.error(e);
        }

        const crawler = Arachne.crawler({
            stealth: true,
            launchOptions, /* timeout set to 5 seconds, default of 30 is too long */
            queue,
            browser,
            pageHandler: async (page: ArachnePage, request: ArachneRequest) => {
            //... handle page load
            }
        });

Lambda Layer Resources

Google Chrome for AWS Lambda as a layer
Serverless Browser Automation with AWS Lambda and Puppeteer
- NOTE! The source code linked to this article uses require("chrome-aws-lambda") but this is WRONG if you use the layer directly. You need to use require('@sparticuz/chrome-aws-lambda');

@xapp/arachne

@xapp/arachne

Usage in a Lambda

Lambda Layer Resources

Readme

Keywords

Package Sidebar

Install

Weekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

@xapp/arachne

@xapp/arachne

Usage in a Lambda

Lambda Layer Resources

Readme

Keywords

Package Sidebar

Install

DownloadsWeekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

Weekly Downloads