mdream
TypeScript icon, indicating that this package has built-in type declarations

0.9.1 • Public • Published

mdream

npm version npm downloads license

Ultra-performant HTML to Markdown Convertor Optimized for LLMs. Generate llms.txt artifacts using CLI, GitHub Actions, Vite Plugin and more.

mdream logo

Made possible by my Sponsor Program 💖
Follow me @harlan_zw 🐦 • Join Discord for help

Features

  • 🧠 Optimized HTML To Markdown Conversion (~50% fewer tokens with Minimal preset)
  • 🔍 Generates GitHub Flavored Markdown: Frontmatter, Nested & HTML markup support.
  • 🚀 Fast: Stream 1.4MB of HTML to markdown in ~50ms.
  • ⚡ Tiny: 5kB gzip, zero dependency core.
  • ⚙️ Run anywhere: CLI, edge workers, browsers, Node, etc.
  • 🔌 Extensible: Plugin system for customizing and extending functionality.

What is Mdream?

Traditional HTML to Markdown converters were not built for LLMs or humans. They tend to be slow and bloated and produce output that's poorly suited for LLMs token usage or for human readability.

Other LLM specific convertors focus on supporting all document formats, resulting in larger bundles and lower quality Markdown output.

Mdream produces high-quality Markdown for LLMs efficiently with no core dependencies. It includes a plugin system to customize the conversion process, allowing you to parse, extract, transform, and filter as needed.

Installation

pnpm add mdream

CLI Usage

Mdream provides a CLI designed to work exclusively with Unix pipes, providing flexibility and freedom to integrate with other tools.

Pipe Site to Markdown

Fetches the Markdown Wikipedia page and converts it to Markdown preserving the original links and images.

curl -s https://en.wikipedia.org/wiki/Markdown \
 | npx mdream --origin https://en.wikipedia.org --preset minimal \
  | tee streaming.md

Tip: The --origin flag will fix relative image and link paths

Local File to Markdown

Converts a local HTML file to a Markdown file, using tee to write the output to a file and display it in the terminal.

cat index.html \
 | npx mdream --preset minimal \
  | tee streaming.md

CLI Options

  • --origin <url>: Base URL for resolving relative links and images
  • --preset <preset>: Conversion presets: minimal
  • --help: Display help information
  • --version: Display version information

API Usage

Mdream provides two main functions for working with HTML:

  • htmlToMarkdown: Useful if you already have the entire HTML payload you want to convert.
  • streamHtmlToMarkdown: Best practice if you are fetching or reading from a local file.

Convert existing HTML

import { htmlToMarkdown } from 'mdream'

// Simple conversion
const markdown = htmlToMarkdown('<h1>Hello World</h1>')
console.log(markdown) // # Hello World

Convert from Fetch

import { streamHtmlToMarkdown } from 'mdream'

// Using fetch with streaming
const response = await fetch('https://example.com')
const htmlStream = response.body
const markdownGenerator = streamHtmlToMarkdown(htmlStream, {
  origin: 'https://example.com'
})

// Process chunks as they arrive
for await (const chunk of markdownGenerator) {
  console.log(chunk)
}

Pure HTML Parser

If you only need to parse HTML into a DOM-like AST without converting to Markdown, use parseHtml:

import { parseHtml } from 'mdream'

const html = '<div><h1>Title</h1><p>Content</p></div>'
const { events, remainingHtml } = parseHtml(html)

// Process the parsed events
events.forEach((event) => {
  if (event.type === 'enter' && event.node.type === 'element') {
    console.log('Entering element:', event.node.tagName)
  }
})

The parseHtml function provides:

  • Pure AST parsing - No markdown generation overhead
  • DOM events - Enter/exit events for each element and text node
  • Plugin support - Can apply plugins during parsing
  • Streaming compatible - Works with the same plugin system

Presets

Presets are pre-configured combinations of plugins for common use cases.

Minimal Preset

The minimal preset optimizes for token reduction and cleaner output by removing non-essential content:

import { withMinimalPreset } from 'mdream/preset/minimal'

const options = withMinimalPreset({
  origin: 'https://example.com'
})

Plugins included:

  • isolateMainPlugin() - Extracts main content area
  • frontmatterPlugin() - Generates YAML frontmatter from meta tags
  • tailwindPlugin() - Converts Tailwind classes to Markdown
  • filterPlugin() - Excludes forms, navigation, buttons, footers, and other non-content elements

CLI Usage:

curl -s https://example.com | npx mdream --preset minimal --origin https://example.com

Plugin System

The plugin system allows you to customize HTML to Markdown conversion by hooking into the processing pipeline. Plugins can filter content, extract data, transform nodes, or add custom behavior.

Built-in Plugins

Mdream includes several built-in plugins that can be used individually or combined:

  • extractionPlugin: Extract specific elements using CSS selectors for data analysis
  • filterPlugin: Include or exclude elements based on CSS selectors or tag IDs
  • frontmatterPlugin: Generate YAML frontmatter from HTML head elements (title, meta tags)
  • isolateMainPlugin: Isolate main content using <main> elements or header-to-footer boundaries
  • tailwindPlugin: Convert Tailwind CSS classes to Markdown formatting (bold, italic, etc.)
  • readabilityPlugin: Content scoring and extraction (experimental)
import { filterPlugin, frontmatterPlugin, isolateMainPlugin } from 'mdream/plugins'

const markdown = htmlToMarkdown(html, {
  plugins: [
    isolateMainPlugin(),
    frontmatterPlugin(),
    filterPlugin({ exclude: ['nav', '.sidebar', '#footer'] })
  ]
})

Plugin Hooks

  • beforeNodeProcess: Called before any node processing, can skip nodes
  • onNodeEnter: Called when entering an element node
  • onNodeExit: Called when exiting an element node
  • processTextNode: Called for each text node
  • processAttributes: Called to process element attributes

Creating a Plugin

Use createPlugin() to create a plugin with type safety:

import type { ElementNode, TextNode } from 'mdream'
import { htmlToMarkdown } from 'mdream'
import { createPlugin } from 'mdream/plugins'

const myPlugin = createPlugin({
  onNodeEnter(node: ElementNode) {
    if (node.name === 'h1') {
      return '🔥 '
    }
  },

  processTextNode(textNode: TextNode) {
    // Transform text content
    if (textNode.parent?.attributes?.id === 'highlight') {
      return {
        content: `**${textNode.value}**`,
        skip: false
      }
    }
  }
})

// Use the plugin
const html: string = '<div id="highlight">Important text</div>'
const markdown: string = htmlToMarkdown(html, { plugins: [myPlugin] })

Example: Content Filter Plugin

import type { ElementNode, NodeEvent } from 'mdream'
import { ELEMENT_NODE } from 'mdream'
import { createPlugin } from 'mdream/plugins'

const adBlockPlugin = createPlugin({
  beforeNodeProcess(event: NodeEvent) {
    const { node } = event

    if (node.type === ELEMENT_NODE && node.name === 'div') {
      const element = node as ElementNode
      // Skip ads and promotional content
      if (element.attributes?.class?.includes('ad')
        || element.attributes?.id?.includes('promo')) {
        return { skip: true }
      }
    }
  }
})

Extraction Plugin

Extract specific elements and their content during HTML processing for data analysis or content discovery:

import { extractionPlugin, htmlToMarkdown } from 'mdream'

const html: string = `
  <article>
    <h2>Getting Started</h2>
    <p>This is a tutorial about web scraping.</p>
    <img src="/hero.jpg" alt="Hero image" />
  </article>
`

// Extract elements using CSS selectors
const plugin = extractionPlugin({
  'h2': (element: ExtractedElement, state: MdreamRuntimeState) => {
    console.log('Heading:', element.textContent) // "Getting Started"
    console.log('Depth:', state.depth) // Current nesting depth
  },
  'img[alt]': (element: ExtractedElement, state: MdreamRuntimeState) => {
    console.log('Image:', element.attributes.src, element.attributes.alt)
    // "Image: /hero.jpg Hero image"
    console.log('Context:', state.options) // Access to conversion options
  }
})

htmlToMarkdown(html, { plugins: [plugin] })

The extraction plugin provides memory-efficient element extraction with full text content and attributes, perfect for SEO analysis, content discovery, and data mining.

Credits

License

Licensed under the MIT license.

Readme

Keywords

none

Package Sidebar

Install

npm i mdream

Weekly Downloads

42

Version

0.9.1

License

MIT

Unpacked Size

122 kB

Total Files

18

Last publish

Collaborators

  • harlan_zw