ποΈ Voice Agents: The Something Bigβ’
Executive Summaryβ
Voice-enabled AI agents represent the next frontier in human-AI collaboration. These agents can:
- Listen passively to conversations and meetings
- Distinguish speakers through voice recognition
- Respond intelligently only when called upon
- Maintain context across entire conversations
- Collaborate with other agents in real-time
This is not just an incremental featureβit's a paradigm shift in how users interact with AI.
ποΈ Architecture Overviewβ
Core Components We Already Have (95%!) β UPGRADEDβ
graph TB
subgraph "Existing Infrastructure - ENHANCED"
VoiceButton[VoiceRecordButtonHybrid - NEW State Machine]
StateMachine[Voice State Machine - NEW Clean Architecture]
AgentSystem[Agent Personality System]
AgentBench[AgentBench - Dynamic Attachment]
WebSocket[WebSocket Real-time + Voice Handlers]
Sessions[Session Management]
Queues[Queue Processing]
Memory[Memento System]
Tools[Function Calling]
Debug[VoiceDebugPanelV2 - NEW Enhanced Diagnostics]
end
subgraph "Voice Processing Layer - IMPLEMENTED"
RealTimeAPI[OpenAI Realtime API]
AudioStream[Audio Streaming - PCM16]
HybridMode[Hybrid Voice Mode - NEW]
ModelPreservation[Model Preservation - NEW]
TriggerDetect[Trigger Detection]
end
VoiceButton --> AudioStream
AudioStream --> RealTimeAPI
RealTimeAPI --> Diarization
Diarization --> Sessions
TriggerDetect --> AgentSystem
AgentSystem --> WebSocket
What We're Buildingβ
interface VoiceAgent extends IAgent {
// Existing agent properties +
voiceCapabilities: {
voiceId?: string; // Unique voice fingerprint
wakeWords: string[]; // "Hey Marketing Agent", "Assistant"
listeningMode: 'passive' | 'active'; // Passive by default
responseVoice: {
model: 'alloy' | 'echo' | 'fable' | 'onyx' | 'nova' | 'shimmer';
speed: number; // 0.25 to 4.0
emotion?: string; // Based on personality
};
};
sessionState: {
isListening: boolean;
lastSpoke: Date;
contextWindow: TranscriptSegment[];
speakerProfiles: Map<string, SpeakerProfile>;
};
}
π― Quest Chain: From MVP to Magicβ
π₯ Bronze Tier: Voice-Activated Agentsβ
Goal: Basic voice input/output with existing agents
Quest 1.1: Resurrect the Voice Buttonβ
- Audit existing
VoiceRecordButton
component - Update to use modern Web Audio API
- Add visual feedback for recording state
- Test with existing text-based agents
Quest 1.2: OpenAI Realtime Integrationβ
- Implement OpenAI Realtime API client
- Add WebSocket handler for audio streaming
- Create audio-to-text pipeline
- Test with
gpt-4o-mini-realtime-preview
Quest 1.3: Voice Responseβ
- Add text-to-speech for agent responses
- Implement voice selection based on agent personality
- Add audio playback controls
- Create "speaking" indicator on AgentBench
Deliverable: Agents can receive voice input and respond with voice
π₯ Silver Tier: Passive Listening Modeβ
Goal: Agents listen continuously and respond when called
Quest 2.1: Continuous Listeningβ
- Implement streaming audio capture
- Add silence detection and segmentation
- Create rolling transcript buffer
- Add visual "listening" indicator
Quest 2.2: Wake Word Detectionβ
- Implement trigger phrase detection
- Support both agent names and custom wake words
- Add "attention" animation when triggered
- Create cooldown to prevent spam
Quest 2.3: Context-Aware Responsesβ
- Store conversation history in session
- Implement context window management
- Add "what did I miss?" capability
- Enable referencing previous speakers
Deliverable: Agents passively listen and respond when called by name
π₯ Gold Tier: Multi-Speaker Recognitionβ
Goal: Distinguish and remember different speakers
Quest 3.1: Speaker Diarizationβ
- Implement basic speaker segmentation
- Create speaker embedding profiles
- Add speaker labels to transcript
- Store speaker voices for session
Quest 3.2: Speaker Memoryβ
- Link speakers to user profiles
- Remember speaker preferences
- Track speaker-specific context
- Enable "What did [Speaker] say about X?"
Quest 3.3: Personalized Responsesβ
- Tailor responses to specific speakers
- Remember past interactions with speakers
- Adjust formality based on speaker
- Track speaker sentiment/mood
Deliverable: Agents recognize and remember individual speakers
π Diamond Tier: Intelligent Participationβ
Goal: Agents participate naturally in conversations
Quest 4.1: Conversation Flow Understandingβ
- Detect conversation topics and transitions
- Identify questions vs statements
- Recognize when input is expected
- Track conversation momentum
Quest 4.2: Proactive Contributionsβ
- Detect long silences and offer help
- Identify confusion and clarify
- Suggest relevant information
- Flag important points for follow-up
Quest 4.3: Multi-Agent Orchestrationβ
- Enable multiple voice agents simultaneously
- Implement agent hand-offs
- Coordinate agent responses
- Prevent agent interruptions
Deliverable: Agents participate naturally like human team members
π Legendary Tier: Meeting Intelligenceβ
Goal: Complete meeting assistant capabilities
Quest 5.1: Meeting Modesβ
- Standup mode (track updates by person)
- Brainstorm mode (capture all ideas)
- Decision mode (track action items)
- Review mode (summarize key points)
Quest 5.2: Visual Integrationβ
- Screen share awareness
- Slide tracking and referencing
- Whiteboard capture
- Document correlation
Quest 5.3: Post-Meeting Intelligenceβ
- Auto-generate meeting summaries
- Extract and assign action items
- Create follow-up reminders
- Generate meeting insights
Deliverable: Full AI meeting assistant that rivals human note-takers
π§ Technical Implementationβ
Two-Layer WebSocket Architectureβ
Voice agents use a proxy architecture with two separate WebSocket connections:
graph LR
Browser[Browser/Client] -->|Existing WebSocket| AWS[AWS API Gateway]
AWS --> Lambda[Lambda Functions]
Lambda -->|New WebSocket via 'ws' package| OpenAI[OpenAI Realtime API]
This design keeps API keys secure and enables custom processing. For a detailed explanation of why we need two WebSocket layers and the ws
package, see Real-time API Integration Architecture.
Serverless Session Persistenceβ
The Problem: Session Loss Across Lambda Instancesβ
In a serverless environment, voice sessions face a critical challenge:
sequenceDiagram
participant Client
participant Lambda A
participant Lambda B
participant OpenAI
participant Memory
Note over Lambda A,Lambda B: Different Lambda instances
Client->>Lambda A: voice.session.start
Lambda A->>OpenAI: Connect & Create Session
OpenAI-->>Lambda A: Session Created
Lambda A->>Memory: Store in activeVoiceSessions Map
Note over Memory: Session exists only in Lambda A's memory
Lambda A-->>Client: Session ready
Client->>Lambda B: voice.text.input "Hello"
Note over Lambda B: Different Lambda instance!
Lambda B->>Memory: Check activeVoiceSessions Map
Note over Memory: Map is empty (different process)
Lambda B-->>Client: β Voice session not found
The Solution: Database-Backed Session Recreationβ
We implemented a getOrCreateBackend
pattern that enables any Lambda instance to recreate the session:
flowchart TB
subgraph "Session Creation Flow"
A[Client Request] --> B{Session in Memory?}
B -->|Yes| C[Use Existing Session]
B -->|No| D[Check Database]
D --> E{Session in DB?}
E -->|No| F[Return Error]
E -->|Yes| G[Acquire Recreation Lock]
G --> H{Lock Acquired?}
H -->|No| I[Wait & Poll]
H -->|Yes| J[Recreate Backend]
J --> K[Connect to OpenAI]
K --> L[Restore Configuration]
L --> M[Store in Memory]
M --> N[Release Lock]
N --> C
I --> O{Session Ready?}
O -->|Yes| C
O -->|No - Timeout| J
end
style F fill:#f96
style C fill:#9f6
Session Persistence Architectureβ
classDiagram
class Connection {
+connectionId: string
+userId: string
+voiceSession: VoiceSession
}
class VoiceSession {
+sessionId: string
+questId: string
+sessionKey: string
+agentId: string
+model: string
+voice: string
+instructions: string
+startedAt: Date
+recreationLock: Date
+recreationLambdaId: string
}
class OpenAIRealtimeBackend {
+connect()
+disconnect()
+sendText(text: string)
+sendAudioChunk(audio: string)
+updateSession(config: object)
+createResponse()
+cancelResponse()
}
class ActiveVoiceSessions {
<<Map>>
+get(sessionKey): Backend
+set(sessionKey, backend)
+delete(sessionKey)
}
Connection "1" *-- "0..1" VoiceSession
VoiceSession ..> OpenAIRealtimeBackend : recreates
ActiveVoiceSessions o-- OpenAIRealtimeBackend : stores
Complete Voice Session Lifecycleβ
sequenceDiagram
participant Client
participant Lambda
participant MongoDB
participant OpenAI
rect rgb(200, 230, 200)
Note right of Client: Session Creation
Client->>Lambda: voice.session.start
Lambda->>OpenAI: Create Realtime Session
OpenAI-->>Lambda: Session Created
Lambda->>MongoDB: Store Voice Session Config
Lambda-->>Client: voice.session.created
end
rect rgb(230, 200, 200)
Note right of Client: Different Lambda Instance
Client->>Lambda: voice.text.input
Lambda->>Lambda: Check activeVoiceSessions
Note over Lambda: Not found in memory
Lambda->>MongoDB: Get Connection & Session
Lambda->>Lambda: getOrCreateBackend()
Lambda->>OpenAI: Recreate Connection
Lambda->>Lambda: Store in activeVoiceSessions
Lambda->>OpenAI: Send Text Input
OpenAI-->>Lambda: Audio Response
Lambda-->>Client: voice.audio.delta
end
rect rgb(200, 200, 230)
Note right of Client: Session End
Client->>Lambda: voice.session.end
Lambda->>Lambda: getOrCreateBackend() if needed
Lambda->>OpenAI: Disconnect
Lambda->>MongoDB: Remove Voice Session
Lambda-->>Client: voice.session.ended
end
Key Implementation Detailsβ
-
Database Schema Enhancement:
- Added
voiceSession
field to Connection model - Stores all necessary session configuration
- Enables session recreation from any Lambda
- Added
-
Recreation Locking:
- Prevents multiple Lambdas from recreating simultaneously
- Uses atomic MongoDB operations
- Includes timeout mechanism for stuck locks
-
Handler Updates:
- All voice handlers use
getOrCreateBackend()
- Seamless fallback to database when memory miss
- Maintains performance with memory-first approach
- All voice handlers use
-
Session Cleanup:
- Automatic expiration after 30 minutes of inactivity
- Proper disconnection from OpenAI
- Database cleanup on session end
Leveraging Existing Codeβ
1. VoiceRecordButton Enhancementβ
// Current: Basic recording
// Upgrade: Streaming with real-time API
interface EnhancedVoiceRecordProps {
mode: 'push-to-talk' | 'continuous';
onAudioStream?: (chunk: Float32Array) => void;
onTranscript?: (text: string, speaker?: string) => void;
enableDiarization?: boolean;
}
2. AgentBench Voice Indicatorsβ
// Add to existing AgentBench
interface VoiceAgentChipProps {
agent: IAgent;
voiceState: 'listening' | 'thinking' | 'speaking' | 'idle';
currentSpeaker?: string;
audioLevel?: number; // 0-1 for visualizer
}
3. Session Enhancementβ
// Extend existing session with voice context
interface IVoiceSession extends ISessionDocument {
voiceEnabled: boolean;
audioTranscript: {
segments: TranscriptSegment[];
speakers: Map<string, SpeakerProfile>;
lastUpdated: Date;
};
voiceAgents: {
agentId: string;
state: VoiceAgentState;
lastTriggered?: Date;
}[];
}
π§ Continuous Listening Architectureβ
Understanding the Timeout Modelβ
The 30-second timeouts in the WebSocket handlers are per-message timeouts, not session limits:
// This timeout applies to processing a single audio chunk, NOT the entire session
'voice:audio:chunk': {
function: {
memorySize: '512 MB',
timeout: '30 seconds', // Time to process ONE audio chunk
handler: 'packages/client/server/websocket/voice/voiceAudioStream.func',
}
}
How Continuous Listening Worksβ
- WebSocket Connection: Persists for hours/days
- OpenAI Realtime Connection: Maintained in server memory
- Audio Streaming: Chunked into small segments (e.g., 100ms chunks)
- Processing Pipeline:
Client β WebSocket β Lambda (30s) β OpenAI β Response
β β
ββββββββββββββββ Continuous Loop βββββββββββββ
Meeting/Speech Architectureβ
For long-form listening (meetings, speeches, rambles):
interface ContinuousListeningConfig {
// Session can run indefinitely
sessionDuration: 'unlimited';
// Audio chunks are small and process quickly
audioChunkSize: 100; // milliseconds
processTimeout: 30; // seconds per chunk
// Heartbeat keeps connections alive
heartbeatInterval: 30; // seconds
// Memory management
transcriptBuffer: {
maxSize: 10000; // lines
rollingWindow: 60; // minutes
};
}
TODO: Future Enhancements for Long Sessionsβ
-
Increase Critical Timeouts:
// For session start (includes model loading)
'voice:session:start': {
timeout: '60 seconds', // Increase for model initialization
} -
Add Session Heartbeat:
'voice:heartbeat': {
function: {
timeout: '10 seconds',
handler: 'packages/client/server/websocket/voice/voiceHeartbeat.func',
}
} -
Implement Transcript Streaming:
- Stream transcripts to S3 for long sessions
- Implement rolling buffer for memory efficiency
- Add periodic summaries for context management
-
Add Session Recovery:
- Auto-reconnect on network interruptions
- Resume from last known state
- Preserve transcript continuity
Example: 2-Hour Meeting Sessionβ
// Client-side continuous listening
const voiceSession = useVoice({
mode: 'continuous',
maxDuration: null, // No limit
features: {
speakerDiarization: true,
autoTranscribe: true,
contextWindow: 5000, // tokens
}
});
// Server maintains connection for entire duration
// Each 100ms audio chunk processes in <1 second
// Total session: 2 hours = 72,000 chunks
Performance Considerationsβ
- Lambda Concurrency: Each chunk is processed independently
- Memory Usage: ~2GB for voice session state
- Cost Optimization: Batch small chunks when possible
- Latency: <500ms end-to-end for real-time feel
OpenAI Realtime API Integrationβ
class VoiceAgentService {
private realtimeClient: OpenAIRealtimeClient;
private audioContext: AudioContext;
private mediaStream: MediaStream;
async initializeVoiceAgent(agent: IAgent, session: ISession) {
// 1. Set up audio capture
this.mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
// 2. Connect to OpenAI Realtime
this.realtimeClient = new OpenAIRealtimeClient({
model: 'gpt-4o-realtime-preview',
voice: agent.voiceCapabilities.responseVoice.model,
instructions: this.buildAgentInstructions(agent),
});
// 3. Set up event handlers
this.realtimeClient.on('transcript', this.handleTranscript);
this.realtimeClient.on('audio', this.handleAudioResponse);
this.realtimeClient.on('function_call', this.handleFunctionCall);
}
private buildAgentInstructions(agent: IAgent): string {
return `You are ${agent.name}, ${agent.description}.
Personality: ${agent.personality.majorMotivation}, ${agent.personality.uniqueQuirk}
Mode: Passive listening - only respond when called by name or wake words: ${agent.voiceCapabilities.wakeWords}
Current speakers in room: ${this.getCurrentSpeakers()}
Maintain context of the entire conversation but only speak when addressed.`;
}
}
Speaker Diarization Pipelineβ
class SpeakerDiarization {
private embeddings: Map<string, Float32Array>;
async processSpeakerSegment(audio: Float32Array): Promise<string> {
// 1. Extract speaker embedding
const embedding = await this.extractEmbedding(audio);
// 2. Compare with known speakers
const speakerId = this.findClosestSpeaker(embedding);
// 3. Update speaker profile
this.updateSpeakerProfile(speakerId, embedding);
return speakerId;
}
}
π Why This Is Something Bigβ’β
Market Differentiationβ
- First-mover advantage in voice-enabled AI agents
- Natural evolution from text to voice interaction
- Enterprise appeal for meeting intelligence
- Accessibility improvements for all users
Technical Innovationβ
- Leverages 80% existing infrastructure
- Clean integration with current agent system
- Scalable architecture for future enhancements
- Real-time performance with OpenAI's new models
User Impactβ
- 10x productivity in meetings
- Hands-free operation for multitasking
- Natural interaction like human colleagues
- Personalized experience with speaker recognition
π Success Metricsβ
Technical KPIsβ
- Voice recognition accuracy: >95%
- Response latency: <500ms
- Speaker identification accuracy: >90%
- Concurrent voice sessions: 1000+
User KPIsβ
- Meeting time saved: 30%+
- User satisfaction: 4.5+ stars
- Daily active voice users: 20%+
- Enterprise adoption rate: 40%+
π¬ Demo Scenariosβ
Scenario 1: Product Team Standupβ
PM: "Let's do our standup. I'll go first - yesterday I finalized the Q2 roadmap."
Dev: "I fixed the login bug and started on the API refactor."
PM: "Hey @ProductAssistant, what were the main themes from yesterday's customer calls?"
ProductAssistant: "Based on yesterday's 3 customer calls, the main themes were:
1) Request for mobile app (mentioned 5 times), 2) Faster onboarding process..."
Scenario 2: Brainstorming Sessionβ
Team: [discussing various ideas]
[After 30 seconds of silence]
CreativeAgent: "I notice we haven't explored the subscription model angle.
Would you like me to share some successful examples from similar products?"
Team: "Yes, please!"
Scenario 3: Technical Discussionβ
Dev1: "The API is timing out on large requests"
Dev2: "Maybe we need to implement pagination"
Dev1: "@TechArchitect, what's the best practice here?"
TechArchitect: "For your use case, I'd recommend cursor-based pagination
with a 100-item limit. Here's why..." [continues with technical explanation]
π Getting Startedβ
Phase 1: Foundation (Immediate)β
- Set up OpenAI Realtime API access
- Create
voice-agents
feature branch - Audit and update VoiceRecordButton
- Design voice agent UI components
Phase 2: MVPβ
- Implement basic voice input/output
- Add voice capabilities to one agent
- Test with team members
- Gather initial feedback
Phase 3: Rolloutβ
- Expand to passive listening
- Add speaker recognition
- Beta test with customers
- Iterate based on feedback
π Conclusionβ
Voice agents are not just a featureβthey're a revolution in human-AI collaboration. By building on Bike4Mind's robust agent infrastructure and leveraging OpenAI's realtime capabilities, we can deliver an experience that feels like magic but is grounded in solid engineering.
This is THE Something Bigβ’ that will differentiate Bike4Mind and delight users!
"The best way to predict the future is to invent it." - Alan Kay
Let's invent the future of AI collaboration together! π