Voice Agents Implementation Guide
You Already Have 80% of What You Need! 🎉
This guide maps your existing codebase to the voice agent requirements, showing exactly what you can reuse and what needs to be built.
🟢 What You Already Have
1. Voice Input Infrastructure ✅ UPDATED
Component: VoiceRecordButtonHybrid
in SessionBottom.tsx
// Updated implementation with new state machine
<VoiceRecordButtonHybrid
sessionId={currentSessionId || 'temp'}
questId={currentSessionId ? 'new' : 'temp'}
agentId={displayAgents.length > 0 ? displayAgents[0].id : undefined}
model={isImageModel(model) ? ChatModels.GPT4O_REALTIME_PREVIEW : (model as ChatModels)}
onTranscript={async (transcript: string) => {
setRecording(false);
await handleSendClick(transcript);
}}
onError={() => setRecording(false)}
onRecordingStart={() => setRecording(true)}
onRecordingEnd={async (prompt: string) => {
setRecording(false);
await handleSendClick(prompt);
}}
onRecordingError={() => setRecording(false)}
enableHybridMode={true} // NEW: Preserves user-selected model
/>
Status: ✅ FULLY UPDATED with new state machine and hybrid mode support!
2. Agent System with Personality ✅
What you have:
- Full personality system with 20+ dimensions
- Agent creation and management
- Trigger word detection (
@agent
mentions) - Dynamic agent attachment
- Agent state management
// From AgentBench.tsx - Dynamic agent management
const detectAgentMentions = (text: string): string[] => {
const mentions = text.match(/@(\w+)/g)?.map(m => m.slice(1).toLowerCase()) || [];
return mentions;
};
3. Voice State Machine ✅ NEW
Clean Architecture: New state machine replaces problematic dual-state approach
// New state machine with single source of truth
import { useVoiceV2 } from './hooks/useVoiceV2';
const {
state, // VoiceState enum - single source of truth
isRecording, // Derived from state
isSpeaking, // Derived from state
canRecord, // Derived from state
canSpeak, // Derived from state
canInterrupt, // Derived from state
startSession, // Clean session management
startRecording, // Immediate state updates
interruptAndRecord, // NEW: Interrupt AI to speak
requestVoiceResponse // NEW: Programmatic voice output
} = useVoiceV2();
Key Benefits:
- ✅ No more race conditions between boolean flags
- ✅ Validated state transitions prevent impossible states
- ✅ Support for interrupting AI speech to respond
- ✅ Enhanced debugging with
VoiceDebugPanelV2
- ✅ Model preservation (no forced switching)
4. Real-time Communication ✅
WebSocket Infrastructure:
- Existing WebSocket context and connections
- Real-time message updates
- NEW: Voice-specific WebSocket handlers integrated
- Event-based communication
- Subscriber fanout service
// You already handle real-time updates!
const { sendJsonMessage, readyState } = useWebsocket();
4. Session Management ✅
What you have:
- Session creation and persistence
- Message history tracking
- Context management
- WorkBench pattern for pre-session state
5. Queue Processing ✅
Existing queues that can be adapted:
questStartQueue
- Can handle voice processing tasks- WebSocket handlers for real-time audio
- Dead letter queues for error handling
6. Memory System (Mementos) ✅
Perfect for:
- Storing voice transcripts
- Speaker profiles
- Conversation summaries
- Meeting notes
🟡 What Needs Enhancement
1. VoiceRecordButton → Streaming Audio
Current: Records and sends complete audio Needed: Stream audio chunks in real-time
// Enhanced VoiceRecordButton
interface EnhancedVoiceRecordButton {
mode: 'push-to-talk' | 'continuous' | 'voice-activated';
// Add streaming capability
onAudioChunk?: (chunk: Float32Array) => void;
// Add real-time transcription
onPartialTranscript?: (text: string) => void;
// Keep existing callbacks
onRecordingEnd: (fullTranscript: string) => void;
}
2. Agent Detection → Wake Word Detection
Current: Text-based @mentions
Needed: Voice-based wake words
// Extend existing detection
const detectVoiceAgentTrigger = (transcript: string, agent: IAgent): boolean => {
// Reuse existing mention detection
const textMentions = detectAgentMentions(transcript);
// Add voice-specific wake words
const voiceWakeWords = agent.voiceCapabilities?.wakeWords || [];
return textMentions.includes(agent.name.toLowerCase()) ||
voiceWakeWords.some(wake => transcript.toLowerCase().includes(wake));
};
3. AgentBench → Voice Status Indicators
Current: Shows attached agents Needed: Show voice activity status
// Extend AgentBench chip display
<Chip
variant="soft"
color={voiceState === 'listening' ? 'success' : 'warning'}
startDecorator={
<Box sx={{ position: 'relative' }}>
<Avatar src={agent.visual?.portraitUrl} />
{voiceState === 'speaking' && <PulsingIndicator />}
</Box>
}
>
{agent.name}
</Chip>
🔴 What Needs to Be Built
1. OpenAI Realtime Client
// New service to create
class OpenAIRealtimeService {
private ws: WebSocket;
private sessionId: string;
async connect(config: RealtimeConfig) {
this.ws = new WebSocket('wss://api.openai.com/v1/realtime');
// Reuse your existing WebSocket patterns!
this.ws.on('message', this.handleRealtimeMessage);
}
// Stream audio to OpenAI
async streamAudio(audioChunk: Float32Array) {
this.ws.send(JSON.stringify({
type: 'input_audio_buffer.append',
audio: base64Encode(audioChunk)
}));
}
}
2. Speaker Diarization Service
// New capability to add
interface SpeakerProfile {
id: string;
embedding: Float32Array;
name?: string;
userId?: string; // Link to existing users
}
class SpeakerService {
// Store in existing Memento system!
async saveSpeakerProfile(profile: SpeakerProfile) {
return createMemento({
type: 'speaker_profile',
data: profile,
sessionId: this.sessionId
});
}
}
3. Voice Agent State Manager
// Extend existing agent state
interface VoiceAgentState {
agentId: string;
isListening: boolean;
lastActivated?: Date;
currentSpeaker?: string;
conversationBuffer: string[];
}
// Add to SessionsContext
const [voiceAgentStates, setVoiceAgentStates] = useState<Map<string, VoiceAgentState>>();
🛠️ Implementation Roadmap
Foundation
-
Audit VoiceRecordButton ✅ (2 hours)
- Test current functionality
- Identify Web Audio API updates needed
-
Set up OpenAI Realtime (1 day)
- Create service wrapper
- Add to existing API structure
-
Extend Agent Model (4 hours)
- Add voiceCapabilities to IAgent
- Update database schema
Core Voice Features
-
Implement Streaming Audio (2 days)
- Update VoiceRecordButton
- Add continuous recording mode
- Connect to OpenAI Realtime
-
Add Voice Indicators (1 day)
- Extend AgentBench UI
- Add speaking/listening states
- Create audio visualizer
-
Test Basic Flow (1 day)
- Voice input → Agent response
- Debug and refine
Passive Listening
-
Continuous Recording Mode (2 days)
- Background audio capture
- Silence detection
- Efficient buffering
-
Wake Word Detection (2 days)
- Extend mention detection
- Add voice triggers
- Test with multiple agents
Speaker Recognition
-
Basic Diarization (3 days)
- Speaker segmentation
- Embedding generation
- Profile storage
-
Integration & Testing (2 days)
- Connect all components
- End-to-end testing
- Performance optimization
💡 Quick Wins
1. Resurrect Voice Input
Just getting the existing VoiceRecordButton working again will excite users!
2. Voice Responses
Add TTS to agent responses using OpenAI's voice models - instant "wow" factor!
3. Simple Wake Words
Even basic "Hey [Agent Name]" functionality will feel magical.
🔗 Code Connection Points
Existing Files to Modify
-
SessionBottom.tsx
- Enhance VoiceRecordButton
- Add continuous recording toggle
- Show voice agent states
-
AgentBench.tsx
- Add voice activity indicators
- Show current speaker
- Display listening status
-
LLMContext.tsx
- Add voice model selection
- Store voice preferences
- Manage TTS settings
-
WebsocketContext.tsx
- Add audio streaming events
- Handle real-time transcripts
- Manage voice sessions
New Files to Create
-
services/VoiceAgentService.ts
- OpenAI Realtime client
- Audio processing
- Voice state management
-
components/VoiceIndicator.tsx
- Speaking animation
- Audio level visualization
- Recording status
-
hooks/useVoiceAgent.ts
- Voice agent state
- Audio permissions
- Recording management
-
utils/audioProcessing.ts
- Audio chunking
- Format conversion
- Silence detection
🚀 Why This Will Work
- Existing Infrastructure: Your WebSocket, session, and agent systems are perfect foundations
- Proven Patterns: You already handle real-time updates, state management, and agent interactions
- Clean Architecture: Your modular design makes adding voice features straightforward
- User Familiarity: Users already understand agents and @mentions - voice is a natural extension
🎯 Next Steps
- Test VoiceRecordButton - See what works today
- Get OpenAI Realtime Access - Request API access
- Create Feature Branch -
feature/voice-agents
- Start with Bronze Tier - Basic voice in/out
- Iterate Fast - Get feedback early and often
You're not starting from scratch - you're adding voice to an already powerful agent system. This is absolutely achievable and will be AMAZING! 🚀
The fact that you already have:
- Agent personalities ✅
- Real-time WebSocket ✅
- Session management ✅
- Dynamic agent attachment ✅
- Memory system ✅
Means you can focus on the fun parts - making agents come alive with voice! 🎙️✨