enterprise-ai-recursive-web-scraper
TypeScript icon, indicating that this package has built-in type declarations

1.0.7 • Public • Published

Enterprise AI Recursive Web Scraper

Advanced AI-powered recursive web scraper utilizing Groq LLMs, Puppeteer, and Playwright for intelligent content extraction

👪 All Contributors: 1 🤝 Code of Conduct: Kept 🧪 Coverage 📝 License: MIT 📦 npm version 💪 TypeScript: Strict

✨ Features

  • 🚀 High Performance:
    • Blazing fast multi-threaded scraping with concurrent processing
    • Smart rate limiting to prevent API throttling and server overload
    • Automatic request queuing and retry mechanisms
  • 🤖 AI-Powered: Intelligent content extraction using Groq LLMs
  • 🌐 Multi-Browser: Support for Chromium, Firefox, and WebKit
  • 📊 Smart Extraction:
    • Structured data extraction without LLMs using CSS selectors
    • Topic-based and semantic chunking strategies
    • Cosine similarity clustering for content deduplication
  • 🎯 Advanced Capabilities:
    • Recursive domain crawling with boundary respect
    • Intelligent rate limiting with token bucket algorithm
    • Session management for complex multi-page flows
    • Custom JavaScript execution support
    • Enhanced screenshot capture with lazy-load detection
    • iframe content extraction
  • 🔒 Enterprise Ready:
    • Proxy support with authentication
    • Custom headers and user-agent configuration
    • Comprehensive error handling and retry mechanisms
    • Flexible timeout and rate limit management
    • Detailed logging and monitoring

🚀 Quick Start

To install the package, run:

npm install enterprise-ai-recursive-web-scraper

Using the CLI

The enterprise-ai-recursive-web-scraper package includes a command-line interface (CLI) that you can use to perform web scraping tasks directly from the terminal.

Installation

Ensure that the package is installed globally to use the CLI:

npm install -g enterprise-ai-recursive-web-scraper

Running the CLI

Once installed, you can use the web-scraper command to start scraping. Here’s a basic example of how to use it:

web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output

CLI Options

  • -k, --api-key <key>: (Required) Your Google Gemini API key
  • -u, --url <url>: (Required) The URL of the website to scrape
  • -o, --output <directory>: Output directory for scraped data (default: scraping_output)
  • -d, --depth <number>: Maximum crawl depth (default: 3)
  • -c, --concurrency <number>: Concurrent scraping limit (default: 5)
  • -r, --rate-limit <number>: Requests per second (default: 5)
  • -t, --timeout <number>: Request timeout in milliseconds (default: 30000)
  • -f, --format <type>: Output format: json|csv|markdown (default: json)
  • -v, --verbose: Enable verbose logging
  • --retry-attempts <number>: Number of retry attempts (default: 3)
  • --retry-delay <number>: Delay between retries in ms (default: 1000)

Example usage with rate limiting:

web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output \
  --depth 5 --concurrency 10 --rate-limit 2 --retry-attempts 3 --format csv --verbose

🔧 Advanced Usage

Rate Limiting Configuration

Configure rate limiting to respect server limits and prevent throttling:

import { WebScraper, RateLimiter } from "enterprise-ai-recursive-web-scraper";

const scraper = new WebScraper({
    rateLimiter: new RateLimiter({
        maxTokens: 5,      // Maximum number of tokens
        refillRate: 1,     // Tokens refilled per second
        retryAttempts: 3,  // Number of retry attempts
        retryDelay: 1000   // Delay between retries (ms)
    })
});

Structured Data Extraction

To extract structured data using a JSON schema, you can use the JsonExtractionStrategy:

import { WebScraper, JsonExtractionStrategy } from "enterprise-ai-recursive-web-scraper";

const schema = {
    baseSelector: "article",
    fields: [
        { name: "title", selector: "h1" },
        { name: "content", selector: ".content" },
        { name: "date", selector: "time", attribute: "datetime" }
    ]
};

const scraper = new WebScraper({
    extractionStrategy: new JsonExtractionStrategy(schema)
});

Custom Browser Session

You can customize the browser session with specific configurations:

import { WebScraper } from "enterprise-ai-recursive-web-scraper";

const scraper = new WebScraper({
    browserConfig: {
        headless: false,
        proxy: "http://proxy.example.com",
        userAgent: "Custom User Agent"
    }
});

🤝 Contributors

Mike Odnis
Mike Odnis

💻 🖋 🤔 🚇

📄 License

MIT © Mike Odnis

💙 Built with create-typescript-app

Package Sidebar

Install

npm i enterprise-ai-recursive-web-scraper

Weekly Downloads

0

Version

1.0.7

License

MIT

Unpacked Size

115 MB

Total Files

9

Last publish

Collaborators

  • womb0comb0