🍱 semantic-chunking

NPM Package for Semantically creating chunks from large texts. Useful for workflows involving large language models (LLMs).

Features

Semantic chunking based on sentence similarity
Dynamic similarity thresholds
Configurable chunk sizes
Multiple embedding model options
Quantized model support
Chunk prefix support for RAG workflows
Web UI for experimenting with settings
Dependency injection for model reuse and flexibility

Semantic Chunking Workflow

how it works

Model Initialization: An embedding model is initialized once and can be reused across multiple operations.
Sentence Splitting: The input text is split into an array of sentences.
Embedding Generation: A vector is created for each sentence using the specified ONNX model.
Similarity Calculation: Cosine similarity scores are calculated for each sentence pair.
Chunk Formation: Sentences are grouped into chunks based on the similarity threshold and max token size.
Chunk Rebalancing: Optionally, similar adjacent chunks are combined into larger ones up to the max token size.
Output: The final chunks are returned as an array of objects, each containing the properties described above.

Installation

npm install semantic-chunking

Usage

All functions now require a pre-initialized model instance for better performance and flexibility. You can choose between local embedding models or OpenAI's API:

Option 1: Local Embedding Model

import { LocalEmbeddingModel, chunkit } from 'semantic-chunking';
import { env, pipeline, AutoTokenizer } from '@huggingface/transformers';

// Create transformers object for dependency injection
const transformers = { env, pipeline, AutoTokenizer };

// Initialize the model once
const model = new LocalEmbeddingModel(transformers);
await model.initialize('Xenova/all-MiniLM-L6-v2');

const documents = [
    { document_name: "document1", document_text: "contents of document 1..." },
    { document_name: "document2", document_text: "contents of document 2..." },
    ...
];

const myChunks = await chunkit(documents, model, {
    maxTokenSize: 500,
    similarityThreshold: 0.5
});

Option 2: OpenAI Embedding Model

import { OpenAIEmbedding, chunkit } from 'semantic-chunking';
import OpenAI from 'openai';

// Initialize OpenAI client and model
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const model = new OpenAIEmbedding(openai);
await model.initialize('text-embedding-3-small');

const documents = [
    { document_name: "document1", document_text: "contents of document 1..." },
    { document_name: "document2", document_text: "contents of document 2..." },
    ...
];

const myChunks = await chunkit(documents, model, {
    maxTokenSize: 500,
    similarityThreshold: 0.5
});

NOTE 🚨 The Embedding model will be downloaded to your specified cache directory the first time it is run (file size will depend on the specified model; see the model's table below).

LocalEmbeddingModel Class

The LocalEmbeddingModel class manages model initialization and provides embedding/tokenization functionality. It now uses dependency injection for the transformers library:

import { LocalEmbeddingModel } from "semantic-chunking";
import { env, pipeline, AutoTokenizer } from "@huggingface/transformers";

// Create transformers object for dependency injection
const transformers = { env, pipeline, AutoTokenizer };

// Create and initialize the model
const model = new LocalEmbeddingModel(transformers);
await model.initialize(
  "Xenova/all-MiniLM-L6-v2", // Model name
  "q8", // Data type (fp32, fp16, q8, q4)
  "./models", // Local model path (optional)
  "./models" // Model cache directory (optional)
);

// Get model information
console.log(model.getModelInfo()); // { modelName: '...', dtype: '...' }

// Use the model for embeddings
const embedding = await model.createEmbedding("sample text");

// Use the model for tokenization
const tokens = await model.tokenize("sample text", { padding: true });

OpenAIEmbedding Class

The OpenAIEmbedding class provides an alternative to local models by using OpenAI's embedding API. It implements the same interface as LocalEmbeddingModel for seamless dependency injection:

import { OpenAIEmbedding } from "semantic-chunking";
import OpenAI from "openai";

// Create OpenAI client
const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY, // Make sure to set your API key
});

// Create and initialize the OpenAI embedding model
const model = new OpenAIEmbedding(openai);
await model.initialize("text-embedding-3-small"); // or "text-embedding-3-large"

// Get model information
console.log(model.getModelInfo()); // { modelName: 'text-embedding-3-small', dtype: 'api' }

// Use the model for embeddings (same interface as LocalEmbeddingModel)
const embedding = await model.createEmbedding("sample text");

// Use the model for tokenization (approximate estimation)
const tokens = await model.tokenize("sample text");

OpenAI Model Options

text-embedding-3-small: Faster and more cost-effective, 1536 dimensions
text-embedding-3-large: Higher quality embeddings, 3072 dimensions
text-embedding-ada-002: Legacy model (still supported)

Requirements:

Install the OpenAI package: npm install openai
Set your OPENAI_API_KEY environment variable
Valid OpenAI API subscription with embedding usage enabled

Note: The tokenize method provides an approximate token count since OpenAI doesn't expose their tokenization directly. For more accurate tokenization, consider using the tiktoken library.

Parameters

chunkit(documents, model, options)

documents: Array of documents. Each document is an object containing document_name and document_text.
model: An initialized LocalEmbeddingModel or OpenAIEmbedding instance.
options: Configuration object with the following properties:
- logging: Boolean (optional, default false) - Enables logging of detailed processing steps.
- maxTokenSize: Integer (optional, default 500) - Maximum token size for each chunk.
- similarityThreshold: Float (optional, default 0.5) - Threshold to determine if sentences are similar enough to be in the same chunk. A higher value demands higher similarity.
- dynamicThresholdLowerBound: Float (optional, default 0.4) - Minimum possible dynamic similarity threshold.
- dynamicThresholdUpperBound: Float (optional, default 0.8) - Maximum possible dynamic similarity threshold.
- numSimilaritySentencesLookahead: Integer (optional, default 3) - Number of sentences to look ahead for calculating similarity.
- combineChunks: Boolean (optional, default true) - Determines whether to rebalance and combine chunks into larger ones up to the max token limit.
- combineChunksSimilarityThreshold: Float (optional, default 0.5) - Threshold for combining chunks based on similarity during the rebalance and combining phase.
- returnEmbedding: Boolean (optional, default false) - If set to true, each chunk will include an embedding vector.
- returnTokenLength: Boolean (optional, default true) - If set to true, each chunk will include the token length.
- chunkPrefix: String (optional, default null) - A prefix to add to each chunk (e.g., "search_document: ").
- excludeChunkPrefixInResults: Boolean (optional, default false) - If set to true, the chunk prefix will be removed from the results.

Output

The output is an array of chunks, each containing the following properties:

document_id: Integer - A unique identifier for the document (current timestamp in milliseconds).
document_name: String - The name of the document being chunked (if provided).
number_of_chunks: Integer - The total number of final chunks returned from the input text.
chunk_number: Integer - The number of the current chunk.
model_name: String - The name of the embedding model used.
dtype: String - The precision of the embedding model used (options: fp32, fp16, q8, q4 for local models, api for OpenAI models).
text: String - The chunked text.
embedding: Array - The embedding vector (if returnEmbedding is true).
token_length: Integer - The token length (if returnTokenLength is true).

NOTE 🚨 Every Embedding Model behaves differently!

It is important to understand how the model you choose behaves when chunking your text. It is highly recommended to tweak all the parameters using the Web UI to get the best results for your use case. Web UI README

Examples

Example 1: Basic usage with custom similarity threshold:

import { LocalEmbeddingModel, chunkit } from "semantic-chunking";
import { env, pipeline, AutoTokenizer } from "@huggingface/transformers";
import fs from "fs";

async function main() {
  // Create transformers object for dependency injection
  const transformers = { env, pipeline, AutoTokenizer };

  // Initialize model
  const model = new LocalEmbeddingModel(transformers);
  await model.initialize("Xenova/all-MiniLM-L6-v2");

  const documents = [
    {
      document_name: "test document",
      document_text: await fs.promises.readFile("./test.txt", "utf8"),
    },
  ];

  let myChunks = await chunkit(documents, model, {
    similarityThreshold: 0.4,
  });

  myChunks.forEach((chunk, index) => {
    console.log(`\n-- Chunk ${index + 1} --`);
    console.log(chunk);
  });
}
main();

Example 2: Chunking with a small max token size:

import { LocalEmbeddingModel, chunkit } from "semantic-chunking";
import { env, pipeline, AutoTokenizer } from "@huggingface/transformers";

const frogText =
  'A frog hops into a deli and croaks to the cashier, "I\'ll have a sandwich, please." The cashier, surprised, quickly makes the sandwich and hands it over. The frog takes a big bite, looks around, and then asks, "Do you have any flies to go with this?" The cashier, taken aback, replies, "Sorry, we\'re all out of flies today." The frog shrugs and continues munching on its sandwich, clearly unfazed by the lack of fly toppings. Just another day in the life of a sandwich-loving amphibian! 🐸🥪';

const documents = [
  {
    document_name: "frog document",
    document_text: frogText,
  },
];

async function main() {
  // Create transformers object for dependency injection
  const transformers = { env, pipeline, AutoTokenizer };

  const model = new LocalEmbeddingModel(transformers);
  await model.initialize("Xenova/all-MiniLM-L6-v2");

  let myFrogChunks = await chunkit(documents, model, {
    maxTokenSize: 65,
  });
  console.log("myFrogChunks", myFrogChunks);
}
main();

Example 3: Reusing model across multiple operations:

import { LocalEmbeddingModel, chunkit, cramit } from "semantic-chunking";
import { env, pipeline, AutoTokenizer } from "@huggingface/transformers";

async function processMultipleDocumentSets() {
  // Create transformers object for dependency injection
  const transformers = { env, pipeline, AutoTokenizer };

  // Initialize model once
  const model = new LocalEmbeddingModel(transformers);
  await model.initialize("Xenova/all-MiniLM-L6-v2", "q8");

  // Process first set of documents
  const set1 = [{ document_name: "doc1", document_text: "..." }];
  const chunks1 = await chunkit(set1, model, {
    maxTokenSize: 500,
    similarityThreshold: 0.5,
  });

  // Process second set with different settings, reusing the same model
  const set2 = [{ document_name: "doc2", document_text: "..." }];
  const chunks2 = await cramit(set2, model, {
    maxTokenSize: 300,
  });

  return { chunks1, chunks2 };
}

Example 4: Using OpenAI embeddings:

import { OpenAIEmbedding, chunkit } from "semantic-chunking";
import OpenAI from "openai";

const documents = [
  {
    document_name: "sample document",
    document_text: "Your long text content here...",
  },
];

async function main() {
  // Initialize OpenAI client
  const openai = new OpenAI({
    apiKey: process.env.OPENAI_API_KEY,
  });

  // Initialize OpenAI embedding model
  const model = new OpenAIEmbedding(openai);
  await model.initialize("text-embedding-3-small");

  let myChunks = await chunkit(documents, model, {
    maxTokenSize: 500,
    similarityThreshold: 0.5,
    returnEmbedding: true, // Get embeddings for downstream tasks
  });

  console.log("Chunks created:", myChunks.length);
  console.log("Model info:", model.getModelInfo());
}
main();

Cleaner API: Explicit dependency injection makes the code more maintainable

Tuning

The behavior of the chunkit function can be finely tuned using several optional parameters in the options object. Understanding how each parameter affects the function can help you optimize the chunking process for your specific requirements.

`logging`

Type: Boolean
Default: false
Description: Enables detailed debug output during the chunking process. Turning this on can help in diagnosing how chunks are formed or why certain chunks are combined.

`maxTokenSize`

Type: Integer
Default: 500
Description: Sets the maximum number of tokens allowed in a single chunk. Smaller values result in smaller, more numerous chunks, while larger values can create fewer, larger chunks. It's crucial for maintaining manageable chunk sizes when processing large texts.

`similarityThreshold`

Type: Float
Default: 0.456
Description: Determines the minimum cosine similarity required for two sentences to be included in the same chunk. Higher thresholds demand greater similarity, potentially leading to more but smaller chunks, whereas lower values might result in fewer, larger chunks.

`dynamicThresholdLowerBound`

Type: Float
Default: 0.2
Description: The minimum limit for dynamically adjusted similarity thresholds during chunk formation. This ensures that the dynamic threshold does not fall below a certain level, maintaining a baseline similarity among sentences in a chunk.

`dynamicThresholdUpperBound`

Type: Float
Default: 0.8
Description: The maximum limit for dynamically adjusted similarity thresholds. This cap prevents the threshold from becoming too lenient, which could otherwise lead to overly large chunks with low cohesion.

`numSimilaritySentencesLookahead`

Type: Integer
Default: 2
Description: Controls how many subsequent sentences are considered for calculating the maximum similarity to the current sentence during chunk formation. A higher value increases the chance of finding a suitable sentence to extend the current chunk but at the cost of increased computational overhead.

`combineChunks`

Type: Boolean
Default: true
Description: Determines whether to perform a second pass to combine smaller chunks into larger ones, based on their semantic similarity and the maxTokenSize. This can enhance the readability of the output by grouping closely related content more effectively.

`combineChunksSimilarityThreshold`

Type: Float
Default: 0.4
Description: Used in the second pass of chunk combination to decide if adjacent chunks should be merged, based on their similarity. Similar to similarityThreshold, but specifically for rebalancing existing chunks. Adjusting this parameter can help in fine-tuning the granularity of the final chunks.

Curated ONNX Embedding Models

Model	Precision	Link	Size
nomic-ai/nomic-embed-text-v1.5	fp32, q8	https://huggingface.co/nomic-ai/nomic-embed-text-v1.5	548 MB, 138 MB
thenlper/gte-base	fp32	https://huggingface.co/thenlper/gte-base	436 MB
Xenova/all-MiniLM-L6-v2	fp32, fp16, q8	https://huggingface.co/Xenova/all-MiniLM-L6-v2	23 MB, 45 MB, 90 MB
Xenova/paraphrase-multilingual-MiniLM-L12-v2	fp32, fp16, q8	https://huggingface.co/Xenova/paraphrase-multilingual-MiniLM-L12-v2	470 MB, 235 MB, 118 MB
Xenova/all-distilroberta-v1	fp32, fp16, q8	https://huggingface.co/Xenova/all-distilroberta-v1	326 MB, 163 MB, 82 MB
BAAI/bge-base-en-v1.5	fp32	https://huggingface.co/BAAI/bge-base-en-v1.5	436 MB
BAAI/bge-small-en-v1.5	fp32	https://huggingface.co/BAAI/bge-small-en-v1.5	133 MB
yashvardhan7/snowflake-arctic-embed-m-onnx	fp32	https://huggingface.co/yashvardhan7/snowflake-arctic-embed-m-onnx	436 MB

Each of these parameters allows you to customize the chunkit function to better fit the text size, content complexity, and performance requirements of your application.

Appreciation

If you enjoy this library please consider sending me a tip to support my work 😀

@elpassion/semantic-chunking

🍱 semantic-chunking

Features

Semantic Chunking Workflow

Installation

Usage

Option 1: Local Embedding Model

Option 2: OpenAI Embedding Model

LocalEmbeddingModel Class

OpenAIEmbedding Class

OpenAI Model Options

Parameters

chunkit(documents, model, options)

Output

NOTE 🚨 Every Embedding Model behaves differently!

Examples

Tuning

logging

maxTokenSize

similarityThreshold

dynamicThresholdLowerBound

dynamicThresholdUpperBound

numSimilaritySentencesLookahead

combineChunks

combineChunksSimilarityThreshold

Curated ONNX Embedding Models

Appreciation

🍵 tip me here

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators