Skip to main content
ARCHIVED DOCUMENT

This document has been archived. Please see the Voice Agents - Complete Implementation Guide for the latest consolidated information about voice agents.

🎯 Voice Agents Development TODO

🎉 VOICE INPUT IS WORKING! 🎉

✅ Completed

  • Voice recording with real-time WebSocket
  • Speech-to-text transcription
  • Message submission from voice
  • Nova receives and responds to voice input
  • Voice response infrastructure ready
  • Smart truncation logic implemented
  • WebSocket handlers for voice responses
  • Fixed model switching recognition
  • Fixed microphone button disappearing
  • Auto-switch to voice model when needed

🚀 Voice Response (PRODUCTION READY!)

  • Modified useVoice hook with requestVoiceResponse
  • Implemented AudioStreamPlayer for queue-based playback
  • Created voice.text.input WebSocket handler (dot notation)
  • Created voice.response.cancel handler (dot notation)
  • Smart truncation for different content types (75 words max)
  • Full integration with chat completion flow
  • Audio controls: volume, pause, resume, stop

🐛 Bug Fixes (Just Completed!)

  • Model switching not recognized when changing to GPT-4O Realtime
  • Microphone button disappearing after error state
  • Always show mic button with auto-model switching
  • Proper session cleanup when switching models

🎯 Voice Implementation: COMPLETE ✅

Production Ready: User speaks → Nova responds with voice ✅ Smart truncation working:

  • Short responses: Full voice
  • Long responses: First 75 words + "more below"
  • Lists: Count + first few items
  • Code: "I've written code for you"

Edge Cases Discovered

  1. Voice UX: Manual tap-to-end is cumbersome

    • Solution: Auto-stop after 2-4 seconds of silence
  2. Agent Awareness: Agents don't know input was voice

    • Solution: Add metadata to messages (inputType: 'voice')
  3. Voice Responses: Audio ready but needs testing

    • Solution: Test the full flow!

Next Optimization Priorities

  1. ✅ Test Nova's Voice ← PRODUCTION READY!

    • Tap mic, ask a question → Nova speaks her response
  2. Smart Silence Detection (Future Enhancement)

    • 2-4 second silence threshold
    • Visual countdown indicator
  3. Voice Metadata (Future Enhancement)

    • Add to message: { inputType: 'voice', duration: 3.2, confidence: 0.95 }
  4. Voice-Aware Responses (Working)

    • Voice input → voice response ✅
    • Text input → text only ✅

Intelligent Voice Responses

Length-based Rules:

  • Less than 100 words: Speak entire response
  • 100-500 words: Speak intro + "I've written more below"
  • More than 500 words: Brief summary only
  • Code blocks: Never speak, just announce
  • Lists: Speak count + first 3 items

Context-Aware Modes:

  • Driving: Full audio responses
  • Working: Summary + key points
  • Meeting: Ultra-brief responses

Technical Debt

  • Silence detection algorithm
  • Voice activity indicator improvements
  • Error recovery for interrupted sessions
  • Background noise handling

Future Enhancements

  • Emotion detection in voice
  • Multi-language support
  • Voice personality presets
  • Interruption handling
  • Background conversation mode

🏗️ Architecture Notes

Timeout Clarification

The WebSocket handler timeouts (30 seconds) are per-message timeouts, not session duration limits:

  • Each audio chunk must process within 30 seconds
  • The overall voice session can run indefinitely
  • WebSocket connections persist across many messages
  • OpenAI Realtime connections are maintained in server memory

Continuous Listening Support

The current architecture already supports continuous listening for:

  • Multi-hour meetings
  • Long speeches or presentations
  • Ongoing conversations
  • Background monitoring

🔥 Critical Edge Cases Discovered

1. Voice Activation & Deactivation

Current State: Tap to start, tap to end Issues:

  • Manual tap to end is cumbersome
  • Need automatic end detection after silence (2-4 seconds)
  • Must avoid aggressive cutoffs (the "OpenAI problem")

Proposed Solutions:

  • Wake word activation: "Hey Nova...", "Marketing Agent...", etc.
  • Configurable silence threshold (2-4 seconds)
  • Visual/audio cue when system thinks you're done (grace period)
  • "Hold to talk" mode as alternative
  • Hybrid: Wake word to start, silence to end

2. Agent Voice Awareness

Current State: Agents see voice as text, no awareness of input method Issues:

  • Nova thought voice input was just text
  • No metadata about input source
  • No speaker identification

Proposed Solutions:

interface MessageMetadata {
inputType: 'text' | 'voice' | 'image';
speaker?: {
id: string;
name: string;
voiceProfile?: string;
};
voiceMetrics?: {
duration: number;
confidence: number;
emotion?: string;
};
}

3. Voice Response Intelligence

Current State: Text-only responses Issues:

  • When to speak vs. show text?
  • Long responses are tedious to hear
  • Context matters (driving vs. desk)

Proposed Solutions:

  • Smart truncation: Speak first 50-75 words, then "I've written more details..."
  • Context-aware responses:
    • Driving mode: Full audio
    • Working mode: Summary + text
    • Meeting mode: Brief audio
  • User preferences per agent
  • Response type hints: [BRIEF], [DETAILED], [LIST]

📋 Immediate TODOs

1. Enable Voice Responses (TODAY!)

// In useVoice hook - we already have the audio player!
// Just need to trigger it when assistant messages arrive

2. Add Voice Metadata to Messages

// Update quest/message creation to include:
{
content: transcribedText,
metadata: {
inputType: 'voice',
voiceDuration: recordingDuration,
voiceConfidence: 0.95
}
}

3. Implement Smart Silence Detection

  • Use OpenAI's VAD (Voice Activity Detection)
  • Add configurable silence threshold
  • Visual countdown indicator
  • "Still listening..." indicator for long pauses

4. Wake Word Detection

  • Client-side keyword spotting
  • Agent name recognition
  • Custom wake phrases
  • Visual/audio confirmation

🚀 Bronze Tier Tasks (Current Sprint)

Quest 1.1: Resurrect Voice Button ✅

  • Backend infrastructure ready
  • Update VoiceRecordButton component
  • Implement Web Audio API streaming
  • Add visual recording feedback
  • VOICE INPUT WORKING!!!

Quest 1.2: Client Integration ✅

  • Create useVoice hook
  • Handle WebSocket voice messages
  • Implement audio encoding/decoding
  • Add connection state management

Quest 1.3: Enable Voice Response 🚧

  • Audio playback system ready
  • Trigger playback on assistant messages
  • Add speaking indicators
  • Implement response truncation logic
  • Create voice preference UI

🥈 Silver Tier Tasks (Next Sprint)

Smart Voice Interaction

  • Automatic silence detection (VAD)
  • Wake word activation
  • Grace period before cutoff
  • "Thinking" audio cues

Voice-Aware Agents

  • Pass input type metadata
  • Agent acknowledgment of voice
  • Voice-specific responses
  • Emotion detection

Intelligent Response System

  • Context-aware speaking
  • Smart truncation
  • Summary generation
  • "Read more" prompts

🎯 Voice Response Strategies

1. Length-Based Rules

if (response.length < 100) {
// Speak entire response
} else if (response.length < 500) {
// Speak first paragraph + "There's more..."
} else {
// Speak summary + "I've written details"
}

2. Content-Type Rules

  • Lists: "I found 5 items. First..."
  • Code: "I've written code for you to review"
  • Stories: Full narration (user preference)
  • Data: "Here's the summary..." (speak highlights)

3. Context Rules

  • Driving: Full audio, no truncation
  • Headphones: Longer responses OK
  • Speaker: Brief responses
  • Silent mode: Text only

4. User Preferences

interface VoicePreferences {
maxSpokenWords: number; // default: 75
speakCodeBlocks: boolean; // default: false
speakLists: 'full' | 'summary' | 'count'; // default: 'summary'
autoSpeak: boolean; // default: true
voiceSpeed: number; // 0.5-2.0, default: 1.0
}

🐛 Known Issues & Edge Cases

  1. The Pause Problem: How long is a thinking pause vs. "I'm done"?
  2. The Interruption Problem: How to handle mid-speech corrections?
  3. The Context Problem: How to know if user wants audio response?
  4. The Privacy Problem: Voice in public spaces
  5. The Attention Problem: How to indicate AI is speaking?

📊 Success Metrics to Track

  • Voice input success rate: >95%
  • False positive cutoffs: <5%
  • User satisfaction: >4.5/5
  • Response relevance: Track skipped audio

Last Updated: [Current Date] Status: VOICE INPUT WORKING! 🎉