A lightweight TypeScript library designed to reconstruct paragraphs from AI transcriptions. It helps format unstructured text with appropriate paragraph breaks, handles timestamps for transcripts, and optimizes for readability.
- Segment Recognition: Intelligently groups text into logical paragraphs
- Filler Removal: Identifies and removes common speech fillers (uh, umm, etc.)
- Gap Detection: Detects significant pauses to identify paragraph breaks
- Timestamp Formatting: Converts seconds to readable timestamps (HH:MM:SS)
- Punctuation Awareness: Uses punctuation to identify natural segment breaks
- Customizable Parameters: Configure minimum words per segment, max segment length, etc.
- Arabic Support: Handles Arabic question marks and other non-Latin punctuation
- Transcript Formatting: Converts raw token streams into readable text with appropriate line breaks
- Ground-Truth Token Mapping: Aligns AI-generated word timestamps to human-edited transcript text using an LCS-based algorithm with intelligent interpolation
npm install paragrafs
or
pnpm install paragrafs
or
yarn add paragrafs
or
bun add paragrafs
import { estimateSegmentFromToken, markAndCombineSegments, mapSegmentsIntoFormattedSegments } from 'paragrafs';
// Example token from transcription
const token = {
start: 0,
end: 5,
text: 'This is a sample text. It should be properly segmented.',
};
// Estimate segment with word-level tokens
const segment = estimateSegmentFromToken(token);
// Combine and format segments
const formattedSegments = mapSegmentsIntoFormattedSegments([segment]);
console.log(formattedSegments[0].text);
// Output: "This is a sample text. It should be properly segmented."
import {
markAndCombineSegments,
mapSegmentsIntoFormattedSegments,
formatSegmentsToTimestampedTranscript,
} from 'paragrafs';
// Example transcription segments
const segments = [
{
start: 0,
end: 6.5,
text: 'The quick brown fox!',
tokens: [
{ start: 0, end: 1, text: 'The' },
{ start: 1, end: 2, text: 'quick' },
{ start: 2, end: 3, text: 'brown' },
{ start: 3, end: 6.5, text: 'fox!' },
],
},
{
start: 8,
end: 13,
text: 'Jumps right over the',
tokens: [
{ start: 8, end: 9, text: 'Jumps' },
{ start: 9, end: 10, text: 'right' },
{ start: 10, end: 11, text: 'over' },
{ start: 12, end: 13, text: 'the' },
],
},
];
// Options for segment formatting
const options = {
fillers: ['uh', 'umm', 'hmmm'],
gapThreshold: 3,
maxSecondsPerSegment: 12,
minWordsPerSegment: 3,
};
// Process the segments
const combinedSegments = markAndCombineSegments(segments, options);
const formattedSegments = mapSegmentsIntoFormattedSegments(combinedSegments);
// Get timestamped transcript
const transcript = formatSegmentsToTimestampedTranscript(combinedSegments, 10);
console.log(transcript);
// Output:
// 0:00: The quick brown fox!
// 0:08: Jumps right over the
import { updateSegmentWithGroundTruth } from 'paragrafs';
const rawSegment = {
start: 0,
end: 10,
text: 'The Buick crown flock jumps right over the crazy dog.',
tokens: [
/* AI-generated word timestamps */
],
};
const aligned = updateSegmentWithGroundTruth(rawSegment, 'The quick brown fox jumps right over the lazy dog.');
console.log(aligned.tokens);
// Each token now matches the ground-truth words exactly,
// with missing words interpolated where needed.
Splits a single token into word-level tokens and estimates timing for each word.
Marks tokens with segment breaks based on fillers, gaps, and punctuation.
groupMarkedTokensIntoSegments(markedTokens: MarkedToken[], maxSecondsPerSegment: number): MarkedSegment[]
Groups marked tokens into logical segments based on maximum segment length.
mergeShortSegmentsWithPrevious(segments: MarkedSegment[], minWordsPerSegment: number): MarkedSegment[]
Merges segments with too few words into the previous segment.
Converts marked segments into clean, formatted segments with proper text representation.
Formats segments into a human-readable transcript with timestamps.
Combined utility that processes segments through all the necessary steps.
Synchronizes AI-generated word timestamps with the human-edited transcript (segment.text
):
- Uses a longest-common-subsequence (LCS) to find matching words and preserve their original timing.
- Evenly interpolates timestamps for runs of missing words (only when two or more are missing).
- Falls back to
estimateSegmentFromToken
if no matches are found.
type Token = {
start: number; // Start time in seconds
end: number; // End time in seconds
text: string; // The transcribed text
};
type Segment = Token & {
tokens: Token[]; // Word-by-word breakdown with timings
};
type MarkedToken = 'SEGMENT_BREAK' | Token;
type MarkedSegment = {
start: number;
end: number;
tokens: MarkedToken[];
};
Checks if the text ends with punctuation (including Arabic punctuation).
Formats seconds into a human-readable timestamp (H:MM:SS).
- Transcript Formatting: Convert raw transcriptions into readable text
- Subtitle Generation: Create properly formatted subtitles from audio transcriptions
- Document Reconstruction: Rebuild properly formatted documents from extracted text
Contributions are welcome! Please make sure your contributions adhere to the coding standards and are accompanied by relevant tests.
To get started:
- Fork the repository
- Install dependencies:
bun install
(requires Bun) - Make your changes
- Run tests:
bun test
- Submit a pull request
paragrafs
is released under the MIT License. See the LICENSE.MD file for more details.
Ragaeeb Haq
Built with TypeScript and Bun. Uses ESM module format.