Support for Gemini, Gemini Live, Cloud Speech-to-Text, and Cloud Text-to-Speech.
npm install @omarimai/agents-plugin-google
import { multimodal } from '@livekit/agents';
import * as google from '@omarimai/agents-plugin-google';
const model = new google.realtime.RealtimeModel({
apiKey: process.env.GOOGLE_API_KEY,
voice: 'Puck',
});
const agent = new multimodal.MultimodalAgent({
model,
fncCtx,
});
Set your Google API key:
-
GOOGLE_API_KEY
environment variable, or - Pass
apiKey
parameter to the constructor
For VertexAI, also set:
-
GOOGLE_CLOUD_PROJECT
environment variable -
GOOGLE_APPLICATION_CREDENTIALS
pointing to your service account key
pnpm build
Create a simple test file to verify it works with MultimodalAgent:
// test.ts
import { multimodal, llm } from '@livekit/agents';
import * as google from './src/index.js';
const model = new google.realtime.RealtimeModel({
apiKey: 'your-api-key',
voice: 'Puck',
});
const fncCtx = new llm.FunctionContext();
const agent = new multimodal.MultimodalAgent({
model,
fncCtx,
});
console.log('Google plugin integrated successfully!');
- Implement Google Live API Connection: Research Google's Live API documentation and implement the actual WebSocket connection
- Add Authentication: Implement proper Google Cloud authentication
- Complete Audio Processing: Finish the audio streaming implementation
- Add Function Calling: Implement function calling support in the realtime session
- Add Error Handling: Implement robust error handling and reconnection logic
- Add Tests: Create comprehensive tests
- Add LLM/STT/TTS: Complete the standard service implementations
Your plugin structure is now ready and should integrate seamlessly with the existing MultimodalAgent!
A TypeScript implementation of the Google Gemini Live API for real-time audio conversations with advanced features including function calling, conversation management, and turn detection.
- ✅ Real-time audio streaming with Gemini Live API
- ✅ Function calling and tool integration
- ✅ Advanced conversation management with
session.conversation.item.create()
- ✅ Response generation control with
session.response.create()
- ✅ Server-side Voice Activity Detection (VAD) with adaptive thresholds
- ✅ Multi-feature speech detection (audio level, energy, zero crossing rate)
- ✅ Event-driven architecture with comprehensive event emission
- ✅ Session management with recovery and error handling
npm install
Set your Google API key:
export GOOGLE_API_KEY="your-api-key-here"
import { RealtimeModel } from './src/realtime/realtime_model.js';
// Create a realtime model with advanced features
const model = new RealtimeModel({
model: 'gemini-2.0-flash-live-001',
voice: 'Puck',
instructions: 'You are a helpful AI assistant.',
turnDetection: {
type: 'server_vad',
threshold: 0.1,
silence_duration_ms: 1000
}
});
// Create a session
const session = model.session({
fncCtx: {},
chatCtx: new ChatContext()
});
// Advanced conversation management
session.conversation.item.create({
role: 'user',
text: 'Hello, how are you?'
});
// Start response generation
session.response.create();
// Enhanced conversation management
const items = session.conversation.item.list();
console.log('Conversation items:', items);
// Update a conversation item
session.conversation.item.update('msg_1', {
content: 'Updated message content'
});
// Delete a conversation item
session.conversation.item.delete('msg_1');
// Clear all conversation items
session.conversation.item.clear();
The plugin includes sophisticated turn detection with multiple features:
const model = new RealtimeModel({
turnDetection: {
type: 'server_vad',
threshold: 0.1, // Audio level threshold
silence_duration_ms: 1000, // Silence duration before turn end
prefix_padding_ms: 200 // Padding before speech start
}
});
// Listen for turn detection events
session.on('turn_detected', (event) => {
console.log('Turn detected:', event);
// event.type: 'silence_threshold'
// event.duration: silence duration in ms
// event.timestamp: when the turn was detected
});
session.on('input_speech_started', (event) => {
console.log('Speech started:', event);
// event.audioLevel: current audio level
// event.energyLevel: current energy level
// event.threshold: adaptive threshold used
});
Register and use tools with the session:
// Register a tool
session.updateTools([
{
name: 'get_weather',
description: 'Get current weather for a location',
parameters: {
type: 'object',
properties: {
location: { type: 'string' }
}
},
handler: async (args) => {
const { location } = args;
return { temperature: '72°F', condition: 'sunny' };
}
}
]);
// Listen for tool calls
session.on('toolCall', (toolCall) => {
console.log('Tool called:', toolCall);
});
The plugin emits comprehensive events:
// Transcript events
session.on('transcript', (event) => {
console.log('Transcript:', event.transcript, 'Final:', event.isFinal);
});
// Generation events
session.on('generation_created', (event) => {
console.log('Generation started:', event.messageId);
});
// Error handling
session.on('error', (error) => {
console.error('Session error:', error);
});
// Metrics
session.on('metrics_collected', (metrics) => {
console.log('Usage metrics:', metrics);
});
Advanced session control features:
// Interrupt current generation
session.interrupt();
// Start user activity
session.startUserActivity();
// Truncate conversation at specific message
session.truncate('msg_5', 5000); // Truncate at message 5, audio end at 5s
// Update session options
session.updateOptions({
temperature: 0.7,
maxOutputTokens: 1000
});
// Update instructions
session.updateInstructions('You are now a coding assistant.');
// Clear audio buffer
session.clearAudio();
// Commit audio for processing
session.commitAudio();
Handle audio frames with automatic resampling:
// Push audio frames (automatically resampled)
session.pushAudio(audioFrame);
// Push video frames
session.pushVideo(videoFrame);
// Get current audio buffer
const audioBuffer = session.inputAudioBuffer;
The plugin includes robust error recovery:
// Recover from text response
session.recoverFromTextResponse('item_123');
// Session automatically retries on connection failures
// Exponential backoff with configurable max retries
const model = new RealtimeModel({
// Model configuration
model: 'gemini-2.0-flash-live-001',
voice: 'Puck',
instructions: 'Custom instructions',
// Generation parameters
temperature: 0.8,
maxOutputTokens: 1000,
topP: 0.9,
topK: 40,
// Turn detection
turnDetection: {
type: 'server_vad',
threshold: 0.1,
silence_duration_ms: 1000
},
// Language and location
language: 'en-US',
location: 'us-central1',
// VertexAI (optional)
vertexai: false,
project: process.env.GOOGLE_CLOUD_PROJECT
});
-
session(options)
: Create a new session -
close()
: Close all sessions
-
conversation.item.create(message)
: Create conversation item -
conversation.item.update(id, updates)
: Update conversation item -
conversation.item.delete(id)
: Delete conversation item -
conversation.item.list()
: List all conversation items -
conversation.item.get(id)
: Get specific conversation item -
conversation.item.clear()
: Clear all conversation items
-
response.create()
: Start response generation
-
pushAudio(frame)
: Push audio frame -
pushVideo(frame)
: Push video frame -
commitAudio()
: Commit audio for processing -
clearAudio()
: Clear audio buffer
-
interrupt()
: Interrupt current generation -
startUserActivity()
: Start user activity -
truncate(messageId, audioEndMs)
: Truncate conversation -
updateOptions(options)
: Update session options -
updateInstructions(instructions)
: Update instructions -
updateTools(tools)
: Update available tools
-
on(event, listener)
: Listen for events -
off(event, listener)
: Remove event listener -
emit(event, ...args)
: Emit event
Available events:
-
transcript
: Text transcript updates -
error
: Error events -
toolCall
: Tool call events -
generation_created
: New generation started -
input_audio_transcription_completed
: Audio transcription completed -
input_speech_started
: Speech started -
metrics_collected
: Usage metrics -
turn_detected
: Turn detection events
Apache-2.0