@flexpilot-ai/tokenizers

A high-performance Node.js library for text tokenization, providing bindings to the Rust implementation of HuggingFace's Tokenizers.

Main Features

Fast and Efficient: Leverages Rust's performance for rapid tokenization.
Versatile: Supports various tokenization models including BPE, WordPiece, and Unigram.
Easy Integration: Seamlessly use pre-trained tokenizers in your Node.js projects.
Customizable: Fine-tune tokenization parameters for your specific use case.
Production-Ready: Designed for both research and production environments.

Installation

Install the package using npm:

npm install @flexpilot-ai/tokenizers

Usage Example

Here's an example demonstrating how to use the Tokenizer class:

import { Tokenizer } from "@flexpilot-ai/tokenizers";
import fs from "fs";

// Read the tokenizer configuration file
const fileBuffer = fs.readFileSync("path/to/tokenizer.json");
const byteArray = Array.from(fileBuffer);

// Create a new Tokenizer instance
const tokenizer = new Tokenizer(byteArray);

// Encode a string
const text = "Hello, y'all! How are you 😁 ?";
const encoded = tokenizer.encode(text, true);
console.log("Encoded:", encoded);

// Decode the tokens
const decoded = tokenizer.decode(encoded, false);
console.log("Decoded:", decoded);

// Use the fast encoding method
const fastEncoded = tokenizer.encodeFast(text, true);
console.log("Fast Encoded:", fastEncoded);

API Reference

`Tokenizer`

The main class for handling tokenization.

Constructor

constructor(bytes: Array<number>)

Creates a new Tokenizer instance from a configuration provided as an array of bytes.

bytes: An array of numbers representing the tokenizer configuration.

Methods

`encode`

encode(input: string, addSpecialTokens: boolean): Array<number>

Encodes the input text into token IDs.

input: The text to tokenize.
addSpecialTokens: Whether to add special tokens during encoding.
Returns: An array of numbers representing the token IDs.

`decode`

decode(ids: Array<number>, skipSpecialTokens: boolean): string

Decodes the token IDs back into text.

ids: An array of numbers representing the token IDs.
skipSpecialTokens: Whether to skip special tokens during decoding.
Returns: The decoded text as a string.

`encodeFast`

encodeFast(input: string, addSpecialTokens: boolean): Array<number>

A faster version of the encode method for tokenizing text.

input: The text to tokenize.
addSpecialTokens: Whether to add special tokens during encoding.
Returns: An array of numbers representing the token IDs.

Contributing

We welcome contributions! Please see our Contributing Guide for more details.

License

This project is licensed under the Apache-2.0 License - see the LICENSE file for details.

Acknowledgments

This library is based on the HuggingFace Tokenizers Rust implementation.
Special thanks to the Rust and Node.js communities for their invaluable resources and support.