This document has been archived. Please see the Voice Agents - Complete Implementation Guide for the latest consolidated information about voice agents.
ποΈ Voice Agents Client-Side Implementation Roadmap
π MILESTONE: VOICE INPUT IS LIVE!β
What's Workingβ
- β Push-to-talk voice input - Tap to start, tap to stop
- β Real-time transcription - Your words appear as text
- β Agent interaction - Nova receives and responds to voice input
- β Visual feedback - Clear recording states
Discovered Opportunitiesβ
- Voice UX: Manual tap-to-end is cumbersome β Need auto-stop after silence
- Agent Awareness: Agents don't know input was voice β Add metadata
- Voice Responses: We can hear you, but you can't hear Nova yet! β Enable TTS
π IMMEDIATE PRIORITIESβ
1. Enable Nova's Voice! (TODAY)β
The audio player is ready - we just need to connect it!
// When assistant message arrives:
if (message.role === 'assistant' && voiceSession.active) {
// Boom! Nova speaks!
playAudioResponse(message.audioData);
}
2. Smart Silence Detectionβ
The Goldilocks Problem: Not too aggressive, not too passive
- 2-4 second silence = end of speech
- Visual countdown when detecting silence
- Audio cue: subtle "ding" when processing
- Grace period to continue speaking
3. Voice Metadata Integrationβ
// Add to message creation:
metadata: {
inputType: 'voice',
duration: 5.2, // seconds
confidence: 0.95,
emotion: 'curious' // future
}
4. Intelligent Voice Responsesβ
Context-Aware Speaking:
- Short responses (<100 words): Speak all
- Medium (100-500 words): Speak intro + "I've written more..."
- Long (>500 words): Summary only
- Lists: "I found 5 items. The first is..."
- Code: "I've prepared code for you to review"
πΊοΈ Overview: From Silence to Symphonyβ
This roadmap takes us from the existing (but dormant) voice infrastructure to a fully-featured voice agent system.
πββοΈ Sprint 1: Foundation (β COMPLETED!)β
1.1 Audit & Resurrect VoiceRecordButtonβ
Location: app/components/SessionBottom/VoiceRecordButton.tsx
- Review existing implementation
- Update to use Web Audio API
- Switch from blob recording to streaming
- Add visual feedback states
β
Completed: Created VoiceRecordButtonRealtime.tsx
with:
- Real-time streaming audio
- Beautiful state transitions
- Connection indicators
- Speaking animations
1.2 Create useVoice Hookβ
Location: app/hooks/useVoice.ts
interface UseVoiceOptions {
mode: 'push-to-talk' | 'continuous';
sessionId: string;
questId: string;
agentId?: string;
model: ChatModels;
onTranscript?: (text: string, role: 'user' | 'assistant') => void;
onAudioData?: (audioData: string) => void;
onError?: (error: Error) => void;
}
β
Completed: Full-featured useVoice
hook with:
- Voice session management
- WebSocket integration
- Audio streaming
- Error handling
- State management
1.3 WebSocket Voice Integrationβ
Update: app/hooks/useWebSocket.ts
- Add voice action handlers
- Handle voice:session:created
- Handle voice:audio:delta
- Handle voice:transcript
- Handle voice:error
β Completed: Integrated voice actions into WebSocket:
- Updated
actions.ts
to include all voice actions - useVoice hook handles all voice WebSocket messages
- Full duplex communication ready
1.4 Audio Processing Utilsβ
Create: app/utils/audio/
-
audioEncoder.ts
- PCM16 encoding -
audioDecoder.ts
- PCM16 to playable audio -
audioStreamer.ts
- Chunking & buffering (handled in useVoice) -
audioVisualizer.ts
- Waveform display
β Mostly Completed:
audioEncoder.ts
: Float32 to PCM16 conversion, resampling, streaming processoraudioDecoder.ts
: PCM16 playback, queue management, volume control- Streaming is handled within useVoice hook
- Visualizer pending for future sprint
π Additional Achievementsβ
- Created comprehensive
voice.ts
types - Backend WebSocket routes configured in
sst.config.ts
- Voice actions added to common schemas
- Error handling with typed errors
- Mobile responsiveness
- USERS CAN TALK TO NOVA!!!
π― Sprint 2: Voice Responses & Intelligence (ACTIVE)β
2.1 Enable Voice Responses (CRITICAL)β
Make Nova Speak!
- Connect audio player to assistant messages
- Add voice response toggle
- Implement smart truncation
- Add "speaking" indicator
2.2 Smart Interaction Patternsβ
Natural Conversation Flow
- Auto-stop after 2-4 seconds silence
- Visual silence detection indicator
- Grace period before cutoff
- "Still thinking" audio cues
2.3 Voice-Aware Agentsβ
Agents Know You're Speaking
- Add input type to message metadata
- Agent acknowledgment phrases
- Voice-specific responses
- Speaking style adaptation
2.4 Response Intelligenceβ
When and How to Speak
- Length-based truncation rules
- Content-type awareness
- Context detection (driving, working)
- User preference system
π Sprint 3: Advanced Featuresβ
3.1 Wake Word Activationβ
- "Hey Nova" detection
- Agent name triggers
- Custom wake phrases
- Activation confirmation
3.2 Continuous Conversation Modeβ
- Always-listening option
- Natural back-and-forth
- Interruption handling
- Context preservation
3.3 Multi-Modal Responsesβ
- Voice + Visual sync
- Highlighting spoken text
- Code block handling
- List summarization
3.4 Voice Personalizationβ
- Voice speed control
- Preferred response length
- Auto-speak preferences
- Per-agent voice settings
π¨ Sprint 4: Polish & UXβ
4.1 Voice Onboardingβ
- First-time tutorial
- Best practices guide
- Voice command hints
- Feature discovery
4.2 Accessibility Excellenceβ
- Keyboard shortcuts
- Screen reader integration
- Visual voice indicators
- Closed captions
4.3 Performance & Reliabilityβ
- Offline fallbacks
- Network resilience
- Battery optimization
- Background capability
π Success Metricsβ
Current Statusβ
- β Voice input: WORKING
- β Transcription accuracy: HIGH
- β User delight: ACHIEVED
- π§ Voice output: IN PROGRESS
Next Milestonesβ
- Nova speaks: 1 day
- Smart silence: 2 days
- Full conversation: 1 week
π― The Visionβ
Imagine:
- Walking into a meeting: "Hey Nova, what's on my agenda?"
- Driving to work: "Marketing Agent, summarize yesterday's campaign results"
- Cooking dinner: "Tech Assistant, walk me through that Docker setup"
- Before bed: "Story Teller, continue where we left off"
We're not just adding voice - we're creating AI companions.
Voice Input: LIVE! Voice Output: COMING TODAY! π΅