Skip to main content
ARCHIVED DOCUMENT

This document has been archived. Please see the Voice Agents - Complete Implementation Guide for the latest consolidated information about voice agents.

πŸŽ™οΈ Voice Agents Client-Side Implementation Roadmap

πŸŽ‰ MILESTONE: VOICE INPUT IS LIVE!​

What's Working​

  • βœ… Push-to-talk voice input - Tap to start, tap to stop
  • βœ… Real-time transcription - Your words appear as text
  • βœ… Agent interaction - Nova receives and responds to voice input
  • βœ… Visual feedback - Clear recording states

Discovered Opportunities​

  1. Voice UX: Manual tap-to-end is cumbersome β†’ Need auto-stop after silence
  2. Agent Awareness: Agents don't know input was voice β†’ Add metadata
  3. Voice Responses: We can hear you, but you can't hear Nova yet! β†’ Enable TTS

πŸš€ IMMEDIATE PRIORITIES​

1. Enable Nova's Voice! (TODAY)​

The audio player is ready - we just need to connect it!

// When assistant message arrives:
if (message.role === 'assistant' && voiceSession.active) {
// Boom! Nova speaks!
playAudioResponse(message.audioData);
}

2. Smart Silence Detection​

The Goldilocks Problem: Not too aggressive, not too passive

  • 2-4 second silence = end of speech
  • Visual countdown when detecting silence
  • Audio cue: subtle "ding" when processing
  • Grace period to continue speaking

3. Voice Metadata Integration​

// Add to message creation:
metadata: {
inputType: 'voice',
duration: 5.2, // seconds
confidence: 0.95,
emotion: 'curious' // future
}

4. Intelligent Voice Responses​

Context-Aware Speaking:

  • Short responses (<100 words): Speak all
  • Medium (100-500 words): Speak intro + "I've written more..."
  • Long (>500 words): Summary only
  • Lists: "I found 5 items. The first is..."
  • Code: "I've prepared code for you to review"

πŸ—ΊοΈ Overview: From Silence to Symphony​

This roadmap takes us from the existing (but dormant) voice infrastructure to a fully-featured voice agent system.

πŸƒβ€β™‚οΈ Sprint 1: Foundation (βœ… COMPLETED!)​

1.1 Audit & Resurrect VoiceRecordButton​

Location: app/components/SessionBottom/VoiceRecordButton.tsx

  • Review existing implementation
  • Update to use Web Audio API
  • Switch from blob recording to streaming
  • Add visual feedback states

βœ… Completed: Created VoiceRecordButtonRealtime.tsx with:

  • Real-time streaming audio
  • Beautiful state transitions
  • Connection indicators
  • Speaking animations

1.2 Create useVoice Hook​

Location: app/hooks/useVoice.ts

interface UseVoiceOptions {
mode: 'push-to-talk' | 'continuous';
sessionId: string;
questId: string;
agentId?: string;
model: ChatModels;
onTranscript?: (text: string, role: 'user' | 'assistant') => void;
onAudioData?: (audioData: string) => void;
onError?: (error: Error) => void;
}

βœ… Completed: Full-featured useVoice hook with:

  • Voice session management
  • WebSocket integration
  • Audio streaming
  • Error handling
  • State management

1.3 WebSocket Voice Integration​

Update: app/hooks/useWebSocket.ts

  • Add voice action handlers
  • Handle voice:session:created
  • Handle voice:audio:delta
  • Handle voice:transcript
  • Handle voice:error

βœ… Completed: Integrated voice actions into WebSocket:

  • Updated actions.ts to include all voice actions
  • useVoice hook handles all voice WebSocket messages
  • Full duplex communication ready

1.4 Audio Processing Utils​

Create: app/utils/audio/

  • audioEncoder.ts - PCM16 encoding
  • audioDecoder.ts - PCM16 to playable audio
  • audioStreamer.ts - Chunking & buffering (handled in useVoice)
  • audioVisualizer.ts - Waveform display

βœ… Mostly Completed:

  • audioEncoder.ts: Float32 to PCM16 conversion, resampling, streaming processor
  • audioDecoder.ts: PCM16 playback, queue management, volume control
  • Streaming is handled within useVoice hook
  • Visualizer pending for future sprint

πŸŽ‰ Additional Achievements​

  • Created comprehensive voice.ts types
  • Backend WebSocket routes configured in sst.config.ts
  • Voice actions added to common schemas
  • Error handling with typed errors
  • Mobile responsiveness
  • USERS CAN TALK TO NOVA!!!

🎯 Sprint 2: Voice Responses & Intelligence (ACTIVE)​

2.1 Enable Voice Responses (CRITICAL)​

Make Nova Speak!

  • Connect audio player to assistant messages
  • Add voice response toggle
  • Implement smart truncation
  • Add "speaking" indicator

2.2 Smart Interaction Patterns​

Natural Conversation Flow

  • Auto-stop after 2-4 seconds silence
  • Visual silence detection indicator
  • Grace period before cutoff
  • "Still thinking" audio cues

2.3 Voice-Aware Agents​

Agents Know You're Speaking

  • Add input type to message metadata
  • Agent acknowledgment phrases
  • Voice-specific responses
  • Speaking style adaptation

2.4 Response Intelligence​

When and How to Speak

  • Length-based truncation rules
  • Content-type awareness
  • Context detection (driving, working)
  • User preference system

πŸš€ Sprint 3: Advanced Features​

3.1 Wake Word Activation​

  • "Hey Nova" detection
  • Agent name triggers
  • Custom wake phrases
  • Activation confirmation

3.2 Continuous Conversation Mode​

  • Always-listening option
  • Natural back-and-forth
  • Interruption handling
  • Context preservation

3.3 Multi-Modal Responses​

  • Voice + Visual sync
  • Highlighting spoken text
  • Code block handling
  • List summarization

3.4 Voice Personalization​

  • Voice speed control
  • Preferred response length
  • Auto-speak preferences
  • Per-agent voice settings

🎨 Sprint 4: Polish & UX​

4.1 Voice Onboarding​

  • First-time tutorial
  • Best practices guide
  • Voice command hints
  • Feature discovery

4.2 Accessibility Excellence​

  • Keyboard shortcuts
  • Screen reader integration
  • Visual voice indicators
  • Closed captions

4.3 Performance & Reliability​

  • Offline fallbacks
  • Network resilience
  • Battery optimization
  • Background capability

πŸ“Š Success Metrics​

Current Status​

  • βœ… Voice input: WORKING
  • βœ… Transcription accuracy: HIGH
  • βœ… User delight: ACHIEVED
  • 🚧 Voice output: IN PROGRESS

Next Milestones​

  • Nova speaks: 1 day
  • Smart silence: 2 days
  • Full conversation: 1 week

🎯 The Vision​

Imagine:

  • Walking into a meeting: "Hey Nova, what's on my agenda?"
  • Driving to work: "Marketing Agent, summarize yesterday's campaign results"
  • Cooking dinner: "Tech Assistant, walk me through that Docker setup"
  • Before bed: "Story Teller, continue where we left off"

We're not just adding voice - we're creating AI companions.


Voice Input: LIVE! Voice Output: COMING TODAY! 🎡