html-text-extractor

An HTML parsing library for Node.js, designed to extract text sections associated with anchor tags and headings from HTML files in a directory and its subdirectories. The extracted text is structured for indexing in a full-text search engine. The library produces an array of sections, each with properties for the URL (based on the file path), the anchor (if present), the title (based on the following heading tag), and the text.

Features

✅ Extracts text from HTML files in a folder (and it's sub-folders)
✅ Available as a simple API
✅ Just 624 byte nano sized (ESM, gizpped)
✅ Tree-shakable and side-effect free
✅ First class TypeScript support
✅ 100% Unit Test coverage

Example usage (API, as a library)

Setup

yarn: yarn add html-text-extractor
npm: npm install html-text-extractor

ESM

import { extract } from 'html-text-extractor'

const result = await extract('./dist')

CommonJS

const { extract } = require('html-text-extractor')

// same API like ESM variant

html-text-extractor

html-text-extractor

Features

Example usage (API, as a library)

Setup

ESM

CommonJS

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

html-text-extractor

html-text-extractor

Features

Example usage (API, as a library)

Setup

ESM

CommonJS

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

Weekly Downloads