A React hook for speech-to-text using multiple STT providers.
- Audio Recording: Happens in the browser using the Web Audio API
- Audio Processing: Uses FFmpeg WebAssembly (runs entirely in the browser)
-
React Hooks: All hooks (
useSTT
,useFFmpeg
) run on the client side and must be marked with'use client'
directive in Next.js 13+
- Transcription API: Only the final transcription request to Whisper API happens on the server
- API Key Management: Sensitive keys are stored and used only on the server
-
WebAssembly Loading:
// FFmpeg core files are loaded from CDN const baseURL = 'https://unpkg.com/@ffmpeg/core@0.12.10/dist/umd'; await ffmpeg.load({ coreURL: `${baseURL}/ffmpeg-core.js`, // JavaScript interface wasmURL: `${baseURL}/ffmpeg-core.wasm` // WebAssembly binary });
-
Processing Flow:
// 1. Record audio in browser // 2. Process with FFmpeg (client-side) const processedBlob = await convertAudioToWebM(ffmpeg, audioBlob); // 3. Send to server for transcription const formData = new FormData(); formData.append('file', processedBlob);
The library automatically handles audio format compatibility across different devices:
-
Why Format Conversion?
- iOS devices record in M4A format (not accepted by Whisper API)
- Android devices typically record in WebM format
- Whisper API requires specific formats (WebM, MP3, WAV, etc.)
- Browser compatibility varies across platforms
-
Format Strategy:
// FFmpeg is always loaded, regardless of device type await ffmpeg.load({ coreURL: `${baseURL}/ffmpeg-core.js`, wasmURL: `${baseURL}/ffmpeg-core.wasm` }); // Then check if conversion is needed const isWebM = audioBlob.type.includes('webm'); if (!isWebM) { // Convert non-WebM formats (e.g., iOS M4A) to WebM processedBlob = await convertAudioToWebM(ffmpeg, audioBlob); } else { // Android WebM recordings can be used as-is // But FFmpeg is still available for other audio processing // like normalization, noise reduction, etc. processedBlob = audioBlob; }
-
Why WebM?
- Native format for Android recordings (no format conversion needed)
- Excellent compression with Opus codec
- Well-supported by Whisper API
- Good browser compatibility
- Efficient streaming capabilities
-
Format Flow:
iOS Recording (M4A) ──┐ │ ▼ FFmpeg Convert ──► WebM/Opus ──┐ ▲ │ │ ▼ Android (WebM) ──────┘ ──► Optional Processing ──► Whisper API (normalize, denoise, etc.)
Note: FFmpeg is always loaded because it's needed for audio processing features (normalization, noise reduction, etc.) even when format conversion isn't required. The ~31MB WebAssembly load happens on all devices, but this enables consistent audio processing capabilities across platforms.
- Unified API for multiple speech-to-text providers
- OpenAI Whisper API (implemented)
- Azure Speech Services (coming soon)
- Google Cloud Speech-to-Text (coming soon)
- Simple React hook interface with TypeScript support
- Real-time audio recording with pause/resume capability
- Advanced audio processing options:
- Audio format conversion (WebM/Opus)
- Bitrate control
- Volume normalization
- Noise reduction
- Voice Activity Detection (VAD)
- Proper cleanup and resource management
- Comprehensive error handling
- Development mode support (React StrictMode compatible)
- Secure API key handling through server actions
npm install use-stt
// app/actions/transcribe.ts
'use server';
export async function transcribe(formData: FormData) {
if (!process.env.OPENAI_API_KEY) {
throw new Error('OPENAI_API_KEY is not configured');
}
try {
const file = formData.get('file') as File;
if (!file) {
throw new Error('No file provided');
}
// Prepare form data for Whisper API
const whisperFormData = new FormData();
whisperFormData.append('file', file);
whisperFormData.append('model', 'whisper-1');
whisperFormData.append('response_format', 'json');
const response = await fetch('https://api.openai.com/v1/audio/transcriptions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`
},
body: whisperFormData
});
const result = await response.json();
return {
transcript: result.text,
confidence: result.confidence
};
} catch (error) {
console.error('Transcription error:', error);
throw error;
}
}
// components/SpeechToText.tsx
'use client';
import React, { useState, useCallback } from 'react';
import { useSTT } from 'use-stt';
import { FFmpeg } from '@ffmpeg/ffmpeg';
import { transcribe } from '@/app/actions/transcribe';
import type { FFmpegConfig } from 'use-stt';
import { convertAudioToWebM } from 'use-stt/utils';
// Initialize FFmpeg (do this once)
let ffmpeg: FFmpeg | null = null;
// Default audio processing config
const defaultConfig: FFmpegConfig = {
outputSampleRate: 16000,
outputChannels: 1,
bitrate: '24k',
normalize: true,
normalizationLevel: -16,
denoise: false,
vad: false,
vadLevel: 1,
compressionLevel: 10
};
function SpeechToTextDemo() {
// Audio processing configuration state
const [config, setConfig] = useState<FFmpegConfig>(defaultConfig);
// Wrapper function to handle audio processing and transcription
const transcribeAudio = useCallback(async (audioBlob: Blob) => {
try {
// Initialize FFmpeg if needed
if (!ffmpeg) {
console.log('Initializing FFmpeg...');
ffmpeg = new FFmpeg();
await ffmpeg.load();
}
// Process audio with FFmpeg
console.log('Converting audio with config:', config);
const processedBlob = await convertAudioToWebM(ffmpeg, audioBlob, config);
// Send to server for transcription
const formData = new FormData();
formData.append('file', processedBlob, 'audio.webm');
return transcribe(formData);
} catch (error) {
console.error('Audio processing error:', error);
throw error;
}
}, [config]);
const {
isRecording,
isProcessing,
transcript,
error,
startRecording,
stopRecording,
pauseRecording,
resumeRecording
} = useSTT({
provider: 'whisper',
transcribe: transcribeAudio
});
return (
<div>
{/* Audio Processing Configuration */}
<div className="mb-4">
<h3>Audio Processing Options</h3>
<div>
<label>
<input
type="checkbox"
checked={config.normalize}
onChange={(e) => setConfig(prev => ({
...prev,
normalize: e.target.checked
}))}
/>
Normalize Volume
</label>
<label>
<input
type="checkbox"
checked={config.denoise}
onChange={(e) => setConfig(prev => ({
...prev,
denoise: e.target.checked
}))}
/>
Reduce Background Noise
</label>
<select
value={config.bitrate}
onChange={(e) => setConfig(prev => ({
...prev,
bitrate: e.target.value
}))}
>
<option value="16k">16 kbps (Low)</option>
<option value="24k">24 kbps (Default)</option>
<option value="32k">32 kbps (Better)</option>
</select>
</div>
</div>
{/* Recording Controls */}
<div>
<button
onClick={startRecording}
disabled={isRecording || isProcessing}
>
Start Recording
</button>
<button
onClick={stopRecording}
disabled={!isRecording}
>
Stop Recording
</button>
<button
onClick={pauseRecording}
disabled={!isRecording}
>
Pause
</button>
<button
onClick={resumeRecording}
disabled={!isRecording}
>
Resume
</button>
</div>
{isProcessing && <p>Processing audio...</p>}
<div>
<h3>Transcript:</h3>
<p>{transcript || 'No transcript yet'}</p>
</div>
{error && (
<div>
<h3>Error:</h3>
<p>{error.message}</p>
</div>
)}
</div>
);
}
## Audio Processing Options
The library supports various audio processing options through the `FFmpegConfig` interface:
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| bitrate | string | '24k' | Audio bitrate (e.g., '16k', '24k', '32k') |
| normalize | boolean | true | Enable volume normalization |
| normalizationLevel | number | -16 | Target normalization level in dB |
| denoise | boolean | false | Apply noise reduction |
| vad | boolean | false | Enable Voice Activity Detection |
| vadLevel | number | 1 | VAD sensitivity (0-3) |
| compressionLevel | number | 10 | Opus compression level (0-10) |
See the [examples](./examples) directory for a complete demo with audio processing controls.
## API Reference
### useSTT Options
| Option | Type | Required | Description |
|--------|------|----------|-------------|
| provider | 'whisper' \| 'azure' \| 'google' | Yes | The STT provider to use |
| transcribe | (blob: Blob) => Promise<{transcript: string, confidence?: number}> | Yes | Function to handle transcription (typically a server action) |
| language | string | No | Language code (e.g., 'en', 'es') |
| model | string | No | Model name (provider-specific) |
### Return Value
| Property | Type | Description |
|----------|------|-------------|
| isRecording | boolean | Whether recording is in progress |
| isProcessing | boolean | Whether audio is being processed |
| transcript | string | The current transcript text |
| error | Error \| null | Any error that occurred |
| startRecording | () => Promise<void> | Start recording |
| stopRecording | () => Promise<void> | Stop recording and process audio |
| pauseRecording | () => void | Pause recording |
| resumeRecording | () => void | Resume recording |
### Controlling Hook Activity (Disabling/Pausing)
In some scenarios, you might want to temporarily "pause" or "disable" the `useSTT` hook, for example, when a parent form is submitting or when another UI element takes precedence. This prevents the hook from starting new recordings, processing audio, or updating its state while it's meant to be inactive.
You can achieve this by adding a `disabled` option to your `useSTT` hook.
**1. Modify `UseSTTOptions` in `useSTT`:**
Add a `disabled` option to your hook's options interface:
```typescript
// In your custom useSTT.ts or equivalent
export interface UseSTTOptions {
// ... existing options like provider, language, transcribe ...
disabled?: boolean; // New option to disable the hook
}
2. Implement disabled
Logic within useSTT
:
When the disabled
option is true
, your useSTT
hook should:
- Not initiate new recordings or speech recognition processes.
- Avoid firing effects that update its internal state (e.g.,
isRecording
,isProcessing
,transcript
,error
). - Clean up or not establish any internal event listeners, timers, or connections to speech APIs.
- Optionally, reset its state to an idle or default state.
Here's a conceptual example of how to use the disabled
flag within useSTT
:
// In your custom useSTT.ts or equivalent
export function useSTT(options: UseSTTOptions) {
const {
provider,
language,
transcribe,
disabled = false // Default to false if not provided
} = options;
const [transcript, setTranscript] = useState('');
const [isRecording, setIsRecording] = useState(false);
const [isProcessing, setIsProcessing] = useState(false);
const [error, setError] = useState<Error | null>(null);
// ... other internal state and refs ...
// Example: Guarding an effect
useEffect(() => {
if (disabled) {
// If disabled, ensure no processing happens and clean up.
// This might involve stopping any active recognition,
// clearing timeouts, removing event listeners, etc.
setIsRecording(false);
setIsProcessing(false);
// setTranscript(''); // Optionally reset transcript
// setError(null); // Optionally clear errors
// Example: If using Web Speech API directly
// if (speechRecognitionInstanceRef.current) {
// speechRecognitionInstanceRef.current.stop();
// speechRecognitionInstanceRef.current.onresult = null;
// // ... remove other listeners ...
// }
return; // Bail out of the effect
}
// ... otherwise, proceed with normal effect logic ...
// (e.g., setting up speech recognition, handling events)
}, [disabled, /* ... other relevant dependencies ... */]);
// Example: Guarding a callback
const startRecording = useCallback(async () => {
if (disabled) {
console.log('useSTT is disabled, startRecording aborted.');
return;
}
// ... existing startRecording logic ...
}, [disabled, /* ... other dependencies for startRecording ... */]);
const stopRecording = useCallback(async () => {
if (disabled) {
console.log('useSTT is disabled, stopRecording aborted.');
// Even if disabled, you might want to allow stopping an ongoing recording
// that was started *before* it was disabled. This depends on desired behavior.
// If truly dormant, then perhaps this also becomes a no-op.
// For now, let's assume it should still try to stop if one was in progress.
}
// ... existing stopRecording logic ...
// Consider if isRecording should be set to false here if disabled is true.
}, [disabled, /* ... other dependencies for stopRecording ... */]);
// Ensure all other effects, callbacks, and internal processes
// similarly respect the 'disabled' flag.
return {
transcript,
isRecording,
isProcessing,
error,
startRecording,
stopRecording,
// ... other returned values
};
}
Important: The exact implementation details within useSTT
(like how to stop underlying speech APIs or clear resources) will depend on your specific hook's internal architecture. The key is to make the hook as quiescent as possible when disabled
is true
.
3. Using the disabled
Prop in a Parent Component:
Now, your parent component can pass the disabled
prop to useSTT
:
// Example: In a component like VoiceInput.tsx
// Assume 'isSubmitting' is a state variable in this parent component.
const {
transcript: sttTranscript,
isRecording,
isProcessing: isProcessingSTT,
error: sttError,
startRecording,
stopRecording
} = useSTT({
provider: 'whisper',
// ... other options ...
disabled: isSubmitting, // Pass the parent's state here
});
// UI elements can also use 'isSubmitting' to disable themselves
// <button onClick={startRecording} disabled={isSubmitting || isRecording}>
// Record
// </button>
By implementing this pattern, you gain fine-grained control over when useSTT
is active, helping to prevent unexpected behavior and state updates during critical periods like form submissions.
See the examples directory for working examples.
When using this library with Next.js 13+ (App Router), ensure your components are properly marked:
// components/AudioRecorder.tsx
'use client'; // Required because this component uses browser APIs
import { useSTT } from 'use-stt';
export function AudioRecorder() {
const { transcript, startRecording } = useSTT({...});
// ...
}
// app/api/transcribe/route.ts
// Server-side API route for transcription
export async function POST(request: Request) {
const formData = await request.formData();
const file = formData.get('file');
// Process with Whisper API...
}
The library uses FFmpeg-WASM from a CDN (unpkg.com). This means:
- No server-side FFmpeg installation needed
- First load might be slower (downloading WASM files)
- Requires internet connection to load FFmpeg
- ~31MB of WASM code loaded in browser
To optimize for production:
- Consider self-hosting FFmpeg-WASM files
- Implement proper loading states
- Add error handling for offline scenarios
// Example of custom FFmpeg URL configuration
const ffmpeg = new FFmpeg();
await ffmpeg.load({
coreURL: '/ffmpeg/ffmpeg-core.js', // Self-hosted
wasmURL: '/ffmpeg/ffmpeg-core.wasm' // Self-hosted
});
The library supports various audio processing options through the FFmpegConfig
interface:
Option | Type | Default | Description |
---|---|---|---|
bitrate | string | '24k' | Audio bitrate (e.g., '16k', '24k', '32k') |
normalize | boolean | true | Enable volume normalization |
normalizationLevel | number | -16 | Target normalization level in dB |
denoise | boolean | false | Apply noise reduction |
vad | boolean | false | Enable Voice Activity Detection |
vadLevel | number | 1 | VAD sensitivity (0-3) |
compressionLevel | number | 10 | Opus compression level (0-10) |
See the examples directory for a complete demo with audio processing controls.
- Chrome/Chromium (Desktop & Android) ✅
- Firefox (Desktop & Android) ✅
- Safari (Desktop & iOS) ✅
- Edge (Chromium-based) ✅
- iOS/Safari: Records in M4A format, automatically converted to WebM
- Android: Records natively in WebM format
- Desktop: Format varies by browser, automatically handled
-
MediaRecorder
API - WebAssembly support
-
AudioContext
API - Service Workers (for worker functionality)
- FFmpeg WASM (~31MB) is loaded on first use
- Consider implementing:
// Preload FFmpeg in a low-priority way const preloadFFmpeg = () => { const link = document.createElement('link'); link.rel = 'preload'; link.as = 'fetch'; link.href = 'https://unpkg.com/@ffmpeg/core@0.12.10/dist/umd/ffmpeg-core.wasm'; document.head.appendChild(link); };
- FFmpeg WASM typically uses 50-100MB RAM when processing
- Audio processing is done in chunks to manage memory
- Large recordings (>1 hour) may require additional memory management
- Initial FFmpeg download: ~31MB
- WebM audio file size: ~100KB per minute (at 24k bitrate)
- Whisper API uploads: Processed audio file size
-
"FFmpeg not loaded" Error
// Ensure FFmpeg is loaded before use if (!ffmpeg) { await loadFFmpeg(); }
-
iOS Audio Format Issues
- Ensure
type: 'audio/webm'
in recorder options - Check FFmpeg conversion logs
- Verify Whisper API accepts the format
- Ensure
-
Memory Issues
- Implement cleanup after processing
- Use
URL.revokeObjectURL()
for audio URLs - Monitor memory usage in DevTools
-
CORS Issues with FFmpeg Loading
- When self-hosting FFmpeg files, ensure proper CORS headers
- Check browser console for CORS errors
- Verify CDN access if using unpkg
Enable debug logging:
const { transcript, error } = useSTT({
provider: 'whisper',
transcribe: transcribeAudio,
debug: true // Enables detailed logging
});
- NEVER expose API keys in client-side code
- Use server actions or API routes for Whisper API calls
- Implement proper environment variable handling
- Audio processing happens locally in the browser
- Only processed audio is sent to Whisper API
- No intermediate storage on servers
- Consider implementing:
// Clear processed audio data const cleanup = () => { if (audioUrl) { URL.revokeObjectURL(audioUrl); } // Clear any stored blobs processedBlob = null; };
-
If self-hosting FFmpeg files, set appropriate headers:
# Nginx configuration example location /ffmpeg/ { add_header Cross-Origin-Resource-Policy cross-origin; add_header Cross-Origin-Embedder-Policy require-corp; }
-
Add required CSP headers:
// next.config.js const nextConfig = { headers: async () => [{ source: '/:path*', headers: [ { key: 'Cross-Origin-Embedder-Policy', value: 'require-corp' }, { key: 'Cross-Origin-Resource-Policy', value: 'cross-origin' } ] }] };
MIT License - see LICENSE file for details.
Contributions are welcome! Please open an issue or submit a pull request.
To publish a new version of the use-stt
package to npm, follow these steps:
-
Ensure all changes are committed: Make sure your working directory is clean and all desired changes are included in your Git history.
-
Update the package version: Decide on the appropriate version bump (patch, minor, or major) according to Semantic Versioning (SemVer). You can use the
npm version
command to update the version inpackage.json
and create a version commit and tag.- For a patch release (bug fixes):
npm version patch
- For a minor release (new non-breaking features):
npm version minor
- For a major release (breaking changes):
npm version major
- For a patch release (bug fixes):
-
Push Git commits and tags (if you used
npm version
which creates them):git push && git push --tags
-
Publish to npm: Run the following command:
npm publish
The
prepublishOnly
script inpackage.json
will automatically runnpm run build
before publishing, ensuring that the latest code is compiled.