hf-dataset

A Node.js library for streaming HuggingFace datasets with support for Parquet, CSV, and JSONL formats.

Installation

npm install hf-dataset

Quick Start

import { HFDataset } from 'hf-dataset';

// Load a dataset and iterate through it
const dataset = await HFDataset.create('Salesforce/wikitext');

for await (const row of dataset) {
  console.log(row.text);
  break; // Just show the first row
}

Features

Multiple Formats: Supports Parquet, CSV, and JSONL files
Gzipped Files: Automatically handles .gz compressed files
Streaming: Memory-efficient iteration over large datasets
TypeScript: Full TypeScript support with generics
Authentication: Support for private/gated datasets with HF tokens

API Reference

`HFDataset.create(dataset, options?)`

Creates a new dataset instance.

Parameters:

dataset (string): HuggingFace dataset identifier (e.g., 'Salesforce/wikitext')
options (object, optional):
- token (string): HuggingFace token for private datasets (defaults to process.env.HF_TOKEN)
- revision (string): Git revision or tag (defaults to 'main')

Returns: Promise<HFDataset>

// Public dataset
const dataset = await HFDataset.create('Salesforce/wikitext');

// Private dataset with token
const dataset = await HFDataset.create('my-org/private-dataset', {
  token: 'hf_xxxxxxxxxxxxx'
});

// Specific revision
const dataset = await HFDataset.create('Salesforce/wikitext', {
  revision: 'v1.0'
});

Iteration

The dataset implements AsyncIterable, so you can use for await loops:

const dataset = await HFDataset.create('Salesforce/wikitext');

// Process all rows
for await (const row of dataset) {
  console.log(row);
}

// Process first N rows
let count = 0;
for await (const row of dataset) {
  console.log(row);
  if (++count >= 100) break;
}

`listFiles()`

Returns information about discovered files in the dataset.

const dataset = await HFDataset.create('Salesforce/wikitext');
const files = dataset.listFiles();

console.log(files);
// [
//   { path: 'train.parquet', type: 'parquet', gz: false },
//   { path: 'test.csv.gz', type: 'csv', gz: true }
// ]

Authentication

For private or gated datasets, provide your HuggingFace token:

Environment Variable (Recommended)

export HF_TOKEN=hf_xxxxxxxxxxxxx

Explicit Token

const dataset = await HFDataset.create('my-org/private-dataset', {
  token: 'hf_xxxxxxxxxxxxx'
});

Examples

Working with Different File Formats

Parquet Files:

const dataset = await HFDataset.create('Salesforce/wikitext');

for await (const row of dataset) {
  console.log(row.text); // Parquet preserves column types
}

CSV Files:

const dataset = await HFDataset.create('lvwerra/red-wine');

for await (const row of dataset) {
  console.log(row); // CSV columns as string values
}

JSONL Files:

const dataset = await HFDataset.create('BeIR/scifact');

for await (const row of dataset) {
  console.log(row._id, row.title); // JSON structure preserved
}

TypeScript Usage

interface WikiTextRow {
  text: string;
}

const dataset = await HFDataset.create<WikiTextRow>('Salesforce/wikitext');

for await (const row of dataset) {
  console.log(row.text); // TypeScript knows this is a string
}

Processing Large Datasets

const dataset = await HFDataset.create('large-dataset');

let processedCount = 0;
const batchSize = 1000;
const batch = [];

for await (const row of dataset) {
  batch.push(row);
  
  if (batch.length === batchSize) {
    await processBatch(batch);
    batch.length = 0; // Clear batch
    processedCount += batchSize;
    console.log(`Processed ${processedCount} rows`);
  }
}

// Process remaining rows
if (batch.length > 0) {
  await processBatch(batch);
}

Requirements

Node.js >= 24.3.0

License

MIT - see LICENSE file for details.

hf-dataset

hf-dataset

Installation

Quick Start

Features

API Reference

`HFDataset.create(dataset, options?)`

Iteration

`listFiles()`

Authentication

Environment Variable (Recommended)

Explicit Token

Examples

Working with Different File Formats

TypeScript Usage

Processing Large Datasets

Requirements

License

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

hf-dataset

hf-dataset

Installation

Quick Start

Features

API Reference

HFDataset.create(dataset, options?)

Iteration

listFiles()

Authentication

Environment Variable (Recommended)

Explicit Token

Examples

Working with Different File Formats

TypeScript Usage

Processing Large Datasets

Requirements

License

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

`HFDataset.create(dataset, options?)`

`listFiles()`

Weekly Downloads