A high-performance Node.js library for text tokenization, providing bindings to the Rust implementation of HuggingFace's Tokenizers.
- Fast and Efficient: Leverages Rust's performance for rapid tokenization.
- Versatile: Supports various tokenization models including BPE, WordPiece, and Unigram.
- Easy Integration: Seamlessly use pre-trained tokenizers in your Node.js projects.
- Customizable: Fine-tune tokenization parameters for your specific use case.
- Production-Ready: Designed for both research and production environments.
Install the package using npm:
npm install @flexpilot-ai/tokenizers
Here's an example demonstrating how to use the Tokenizer class:
import { Tokenizer } from "@flexpilot-ai/tokenizers";
import fs from "fs";
// Read the tokenizer configuration file
const fileBuffer = fs.readFileSync("path/to/tokenizer.json");
const byteArray = Array.from(fileBuffer);
// Create a new Tokenizer instance
const tokenizer = new Tokenizer(byteArray);
// Encode a string
const text = "Hello, y'all! How are you 😁 ?";
const encoded = tokenizer.encode(text, true);
console.log("Encoded:", encoded);
// Decode the tokens
const decoded = tokenizer.decode(encoded, false);
console.log("Decoded:", decoded);
// Use the fast encoding method
const fastEncoded = tokenizer.encodeFast(text, true);
console.log("Fast Encoded:", fastEncoded);
The main class for handling tokenization.
constructor(bytes: Array<number>)
Creates a new Tokenizer
instance from a configuration provided as an array of bytes.
-
bytes
: An array of numbers representing the tokenizer configuration.
encode(input: string, addSpecialTokens: boolean): Array<number>
Encodes the input text into token IDs.
-
input
: The text to tokenize. -
addSpecialTokens
: Whether to add special tokens during encoding. - Returns: An array of numbers representing the token IDs.
decode(ids: Array<number>, skipSpecialTokens: boolean): string
Decodes the token IDs back into text.
-
ids
: An array of numbers representing the token IDs. -
skipSpecialTokens
: Whether to skip special tokens during decoding. - Returns: The decoded text as a string.
encodeFast(input: string, addSpecialTokens: boolean): Array<number>
A faster version of the encode
method for tokenizing text.
-
input
: The text to tokenize. -
addSpecialTokens
: Whether to add special tokens during encoding. - Returns: An array of numbers representing the token IDs.
We welcome contributions! Please see our Contributing Guide for more details.
This project is licensed under the Apache-2.0 License - see the LICENSE file for details.
- This library is based on the HuggingFace Tokenizers Rust implementation.
- Special thanks to the Rust and Node.js communities for their invaluable resources and support.