@flexpilot-ai/tokenizers
TypeScript icon, indicating that this package has built-in type declarations

0.0.1 • Public • Published



NPM Version GitHub Actions Workflow Status GitHub License PRs Welcome

A high-performance Node.js library for text tokenization, providing bindings to the Rust implementation of HuggingFace's Tokenizers.

Main Features

  • Fast and Efficient: Leverages Rust's performance for rapid tokenization.
  • Versatile: Supports various tokenization models including BPE, WordPiece, and Unigram.
  • Easy Integration: Seamlessly use pre-trained tokenizers in your Node.js projects.
  • Customizable: Fine-tune tokenization parameters for your specific use case.
  • Production-Ready: Designed for both research and production environments.

Installation

Install the package using npm:

npm install @flexpilot-ai/tokenizers

Usage Example

Here's an example demonstrating how to use the Tokenizer class:

import { Tokenizer } from "@flexpilot-ai/tokenizers";
import fs from "fs";

// Read the tokenizer configuration file
const fileBuffer = fs.readFileSync("path/to/tokenizer.json");
const byteArray = Array.from(fileBuffer);

// Create a new Tokenizer instance
const tokenizer = new Tokenizer(byteArray);

// Encode a string
const text = "Hello, y'all! How are you 😁 ?";
const encoded = tokenizer.encode(text, true);
console.log("Encoded:", encoded);

// Decode the tokens
const decoded = tokenizer.decode(encoded, false);
console.log("Decoded:", decoded);

// Use the fast encoding method
const fastEncoded = tokenizer.encodeFast(text, true);
console.log("Fast Encoded:", fastEncoded);

API Reference

Tokenizer

The main class for handling tokenization.

Constructor

constructor(bytes: Array<number>)

Creates a new Tokenizer instance from a configuration provided as an array of bytes.

  • bytes: An array of numbers representing the tokenizer configuration.

Methods

encode
encode(input: string, addSpecialTokens: boolean): Array<number>

Encodes the input text into token IDs.

  • input: The text to tokenize.
  • addSpecialTokens: Whether to add special tokens during encoding.
  • Returns: An array of numbers representing the token IDs.
decode
decode(ids: Array<number>, skipSpecialTokens: boolean): string

Decodes the token IDs back into text.

  • ids: An array of numbers representing the token IDs.
  • skipSpecialTokens: Whether to skip special tokens during decoding.
  • Returns: The decoded text as a string.
encodeFast
encodeFast(input: string, addSpecialTokens: boolean): Array<number>

A faster version of the encode method for tokenizing text.

  • input: The text to tokenize.
  • addSpecialTokens: Whether to add special tokens during encoding.
  • Returns: An array of numbers representing the token IDs.

Contributing

We welcome contributions! Please see our Contributing Guide for more details.

License

This project is licensed under the Apache-2.0 License - see the LICENSE file for details.

Acknowledgments

  • This library is based on the HuggingFace Tokenizers Rust implementation.
  • Special thanks to the Rust and Node.js communities for their invaluable resources and support.

/@flexpilot-ai/tokenizers/

    Package Sidebar

    Install

    npm i @flexpilot-ai/tokenizers

    Weekly Downloads

    1

    Version

    0.0.1

    License

    Apache-2.0

    Unpacked Size

    55.5 MB

    Total Files

    23

    Last publish

    Collaborators

    • mohankumarelec