An extremely simple web crawler, based on puppeteer.
Chromium is required for puppeteer and is typically the limiting factor when trying to get it to run in a Lambda due to its size. This can be overcome with a Lambda Layer, specifically this community maintained layer.
You can include this layer directly in your SLS framework file or SAM Policy template.
A SLS framework example:
functions:
eventReceiver:
handler: dist/index.receiver
layers:
- "arn:aws:lambda:us-east-1:764866452798:layer:chrome-aws-lambda:31"
In your Lambda source:
import { Browser, LaunchOptions, BrowserConnectOptions, BrowserLaunchArgumentOptions } from "puppeteer";
import { Arachne, ArachnePage, ArachneRequest, MemoryRequestQueue } from "@xapp/arachne";
// Other imports and code
// The important part..
let browser: Pick<Browser, "close" | "newPage">;
// The try catch allows to still run it locally if you want, assuming you
// have chromium installed on your machine
try {
log().debug('Looking for chrome-aws-lambda');
// eslint-disable-next-line @typescript-eslint/no-var-requires
const chromium = require('@sparticuz/chrome-aws-lambda');
browser = await chromium.puppeteer.launch({
args: chromium.args,
defaultViewport: chromium.defaultViewport,
executablePath: await chromium.executablePath,
headless: chromium.headless,
ignoreHTTPSErrors: true,
});
} catch (e) {
log().debug("Could not find chrome-aws-lambda layer");
console.error(e);
}
const crawler = Arachne.crawler({
stealth: true,
launchOptions, /* timeout set to 5 seconds, default of 30 is too long */
queue,
browser,
pageHandler: async (page: ArachnePage, request: ArachneRequest) => {
//... handle page load
}
});
- Google Chrome for AWS Lambda as a layer
-
Serverless Browser Automation with AWS Lambda and Puppeteer
- NOTE! The source code linked to this article uses
require("chrome-aws-lambda")
but this is WRONG if you use the layer directly. You need to userequire('@sparticuz/chrome-aws-lambda');
- NOTE! The source code linked to this article uses