hf-dataset
TypeScript icon, indicating that this package has built-in type declarations

0.1.0 • Public • Published

hf-dataset

A Node.js library for streaming HuggingFace datasets with support for Parquet, CSV, and JSONL formats.

Installation

npm install hf-dataset

Quick Start

import { HFDataset } from 'hf-dataset';

// Load a dataset and iterate through it
const dataset = await HFDataset.create('Salesforce/wikitext');

for await (const row of dataset) {
  console.log(row.text);
  break; // Just show the first row
}

Features

  • Multiple Formats: Supports Parquet, CSV, and JSONL files
  • Gzipped Files: Automatically handles .gz compressed files
  • Streaming: Memory-efficient iteration over large datasets
  • TypeScript: Full TypeScript support with generics
  • Authentication: Support for private/gated datasets with HF tokens

API Reference

HFDataset.create(dataset, options?)

Creates a new dataset instance.

Parameters:

  • dataset (string): HuggingFace dataset identifier (e.g., 'Salesforce/wikitext')
  • options (object, optional):
    • token (string): HuggingFace token for private datasets (defaults to process.env.HF_TOKEN)
    • revision (string): Git revision or tag (defaults to 'main')

Returns: Promise<HFDataset>

// Public dataset
const dataset = await HFDataset.create('Salesforce/wikitext');

// Private dataset with token
const dataset = await HFDataset.create('my-org/private-dataset', {
  token: 'hf_xxxxxxxxxxxxx'
});

// Specific revision
const dataset = await HFDataset.create('Salesforce/wikitext', {
  revision: 'v1.0'
});

Iteration

The dataset implements AsyncIterable, so you can use for await loops:

const dataset = await HFDataset.create('Salesforce/wikitext');

// Process all rows
for await (const row of dataset) {
  console.log(row);
}

// Process first N rows
let count = 0;
for await (const row of dataset) {
  console.log(row);
  if (++count >= 100) break;
}

listFiles()

Returns information about discovered files in the dataset.

const dataset = await HFDataset.create('Salesforce/wikitext');
const files = dataset.listFiles();

console.log(files);
// [
//   { path: 'train.parquet', type: 'parquet', gz: false },
//   { path: 'test.csv.gz', type: 'csv', gz: true }
// ]

Authentication

For private or gated datasets, provide your HuggingFace token:

Environment Variable (Recommended)

export HF_TOKEN=hf_xxxxxxxxxxxxx

Explicit Token

const dataset = await HFDataset.create('my-org/private-dataset', {
  token: 'hf_xxxxxxxxxxxxx'
});

Examples

Working with Different File Formats

Parquet Files:

const dataset = await HFDataset.create('Salesforce/wikitext');

for await (const row of dataset) {
  console.log(row.text); // Parquet preserves column types
}

CSV Files:

const dataset = await HFDataset.create('lvwerra/red-wine');

for await (const row of dataset) {
  console.log(row); // CSV columns as string values
}

JSONL Files:

const dataset = await HFDataset.create('BeIR/scifact');

for await (const row of dataset) {
  console.log(row._id, row.title); // JSON structure preserved
}

TypeScript Usage

interface WikiTextRow {
  text: string;
}

const dataset = await HFDataset.create<WikiTextRow>('Salesforce/wikitext');

for await (const row of dataset) {
  console.log(row.text); // TypeScript knows this is a string
}

Processing Large Datasets

const dataset = await HFDataset.create('large-dataset');

let processedCount = 0;
const batchSize = 1000;
const batch = [];

for await (const row of dataset) {
  batch.push(row);
  
  if (batch.length === batchSize) {
    await processBatch(batch);
    batch.length = 0; // Clear batch
    processedCount += batchSize;
    console.log(`Processed ${processedCount} rows`);
  }
}

// Process remaining rows
if (batch.length > 0) {
  await processBatch(batch);
}

Requirements

  • Node.js >= 24.3.0

License

MIT - see LICENSE file for details.

Readme

Keywords

Package Sidebar

Install

npm i hf-dataset

Weekly Downloads

65

Version

0.1.0

License

MIT

Unpacked Size

14.7 kB

Total Files

5

Last publish

Collaborators

  • vdeturckheim