llm-text-splitter
TypeScript icon, indicating that this package has built-in type declarations

0.1.0 • Public • Published

llm-text-splitter

A lightweight TypeScript library for splitting text into chunks based on various criteria, ideal for preparing text for Large Language Models (LLMs) and for RAG applications.

Features

  • Multiple Splitting Methods: Split text by sentences, paragraphs, markdown headings, or custom regular expressions.
  • Size Control: Define minimum and maximum chunk lengths.
  • Overlap: Introduce overlapping text between chunks to maintain context.
  • Extra Space Removal: Optionally remove extra whitespace within chunks.
  • Robust API: A class-based API for better configuration and error handling.

Installation

npm install llm-text-splitter

Usage

Creating a Splitter Instance

To use the Splitter, first create an instance with your desired options:

import { Splitter } from 'llm-text-splitter';

const splitter = new Splitter({
    maxLength: 30,
    overlap: 5,
    splitter: 'sentence',
});

Splitting Text by Sentences with Overlap

const text =
    'This is the first sentence. This is the second, slightly longer sentence. And a final short one.';
const chunks = splitter.split(text);

console.log(chunks);
// Expected output (may vary slightly depending on overlap handling):
// [
//   "This is the first sentence.",
//   "sentence.  This is the",
//   "the second, slightly longer",
//   "longer  sentence.  And a",
//   "And a final short one."
// ]

Splitting Text by Paragraphs

const text2 =
    'This is the first paragraph.\n\nThis is the second paragraph, which is a bit longer.\n\nShort paragraph.';
const chunks2 = splitter.split(text2, { splitter: 'paragraph' });
console.log(chunks2);
// Expected output:
// [
//   "This is the first paragraph.",
//   "This is the second paragraph, which is a bit longer.",
//   "Short paragraph."
// ]

Splitting Text by Markdown Headings

const text3 =
    '# Heading 1\nThis is some text under heading 1.\n\n## Heading 2\nMore text here.\n\n### Heading 3\nAnd even more text.';
const chunks3 = splitter.split(text3, { splitter: 'markdown' });
console.log(chunks3);
// Expected output:
// [
//   "# Heading 1\nThis is some text under heading 1.",
//   "## Heading 2\nMore text here.",
//   "### Heading 3\nAnd even more text."
// ]

Custom Regex Splitter

const text4 = 'Item 1; Item 2; Item 3; Item 4';
const chunks4 = splitter.split(text4, { regex: /[;]/ });
console.log(chunks4);
// Expected output:
// ["Item 1", " Item 2", " Item 3", " Item 4"]

Removing Extra Spaces

const text5 =
    'This  is   a   string   with   extra    spaces.    This  is   another   string   with   extra    spaces.';
const chunks5 = splitter.split(text5, { removeExtraSpaces: true });
console.log(chunks5);
// Expected output:
// [
//  "This is a string with extra spaces.",
//  "This is another string with extra spaces."
// ]

API

Splitter

The main class for splitting text.

Constructor

constructor(options: SplitOptions = {})
  • options: An optional object with the following properties:
    • minLength: The minimum length of a chunk (default: 0).
    • maxLength: The maximum length of a chunk (default: 5000).
    • overlap: The number of overlapping characters between chunks (default: 0).
    • splitter: The splitting method ('sentence', 'paragraph', 'markdown') (default: 'sentence').
    • regex: A custom regular expression for splitting. Overrides splitter if provided.
    • removeExtraSpaces: Whether to remove extra spaces within chunks (default: false).

Method

split(text: string, options: SplitOptions = {}): string[]
  • text: The input text string.
  • options: An optional object to override the default options for this specific split.

Return Value

An array of strings, where each string is a chunk of the input text.

Error Handling

The Splitter class will throw an error if minLength is greater than maxLength during instantiation.

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Package Sidebar

Install

npm i llm-text-splitter

Weekly Downloads

17

Version

0.1.0

License

MIT

Unpacked Size

22.7 kB

Total Files

6

Last publish

Collaborators

  • hfranc