Skip to content

Commit ce268e4

Browse files
committed
Add details on TTS and audio processing
1 parent 12d77cd commit ce268e4

File tree

1 file changed

+94
-0
lines changed

1 file changed

+94
-0
lines changed

coffee_ws/src/coffee_voice_agent/README.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,100 @@ coffee_voice_agent/
124124
└─────────────────┘ └─────────────────┘ └──────────────┘
125125
```
126126

127+
### **TTS and Audio Processing Flow**
128+
129+
Understanding how text-to-speech and audio synthesis works in the refactored architecture:
130+
131+
#### **🔄 Two TTS Pathways**
132+
133+
**Path 1: Normal Conversation (User-initiated)**
134+
```
135+
User Speech → STT → LLM → CoffeeBaristaAgent.tts_node() → Emotion Processing → Audio Playback
136+
```
137+
138+
**Path 2: Manual Announcements (System-initiated)**
139+
```
140+
Virtual Requests/Greetings → StateManager.say_with_emotion() → session.say() → CoffeeBaristaAgent.tts_node() → Audio Playback
141+
```
142+
143+
#### **📍 TTS Processing Components**
144+
145+
**1. TTS Override - `agents/coffee_barista_agent.py` (Lines 79-159)**
146+
- **Method**: `async def tts_node(self, text, model_settings=None)`
147+
- **Role**: **Central TTS bottleneck** - all speech goes through here
148+
- **Functions**:
149+
- Intercepts streaming text from LLM or manual calls
150+
- Processes `emotion:text` delimiter format in real-time
151+
- Extracts emotions from first 50 characters of text stream
152+
- Updates agent's emotional state
153+
- Logs animated eye expressions
154+
- Passes clean text to LiveKit's default TTS
155+
156+
**2. Manual TTS - `state/state_manager.py` (Lines 512-528)**
157+
- **Method**: `async def say_with_emotion(self, text: str, emotion: str = None)`
158+
- **Role**: Direct TTS for system announcements
159+
- **Functions**:
160+
- Used for greetings, virtual request announcements, timeouts
161+
- Calls `await self.session.say(text)` directly
162+
- Still routes through `tts_node()` override for emotion processing
163+
- Bypasses LLM but preserves emotion handling
164+
165+
#### **🎵 Audio Synthesis and Playback**
166+
167+
**Final Audio Generation (Line 157 in `coffee_barista_agent.py`):**
168+
```python
169+
async for audio_frame in Agent.default.tts_node(self, processed_text, model_settings):
170+
yield audio_frame
171+
```
172+
173+
**Audio Pipeline:**
174+
1. **OpenAI TTS**: Uses model "tts-1" with voice "nova" (configurable)
175+
2. **LiveKit Streaming**: Real-time audio frame streaming to connected clients
176+
3. **Client Playback**: Audio plays through browser, room system, or connected devices
177+
178+
#### **🎭 Emotion Processing Integration**
179+
180+
**Emotion Flow in TTS Override:**
181+
```python
182+
# 1. Text stream arrives (with potential emotion:text format)
183+
async for text_chunk in text:
184+
if ":" in first_chunk_buffer:
185+
# 2. Extract emotion from delimiter
186+
emotion = parts[0].strip()
187+
text_after_delimiter = parts[1]
188+
189+
# 3. Update emotional state
190+
if emotion != self.state_manager.current_emotion:
191+
self.state_manager.current_emotion = emotion
192+
self.state_manager.log_animated_eyes(emotion)
193+
194+
# 4. Yield clean text for audio synthesis
195+
yield text_after_delimiter
196+
```
197+
198+
#### **⚙️ Technical Details**
199+
200+
**Threading Model:**
201+
- **Main Thread**: LiveKit agent and TTS processing
202+
- **Wake Word Thread**: Porcupine audio processing (synchronous)
203+
- **WebSocket Thread**: Order notification server
204+
205+
**Audio Configuration:**
206+
- **STT**: OpenAI Whisper ("whisper-1")
207+
- **TTS**: OpenAI TTS ("tts-1", voice configurable via `VOICE_AGENT_VOICE`)
208+
- **VAD**: Silero Voice Activity Detection
209+
- **Streaming**: Real-time audio frame streaming via LiveKit
210+
211+
**State Synchronization:**
212+
- All TTS calls update `StateManager.current_emotion`
213+
- Emotion changes trigger eye animation logging
214+
- Session events coordinate conversation flow and TTS timing
215+
216+
**Performance Characteristics:**
217+
- **Minimal Buffering**: Only first 50 characters checked for emotion
218+
- **Streaming**: Audio synthesis starts as soon as clean text is available
219+
- **Low Latency**: Real-time processing for responsive conversations
220+
127221
## Dependencies
128222

129223
### Environment Variables

0 commit comments

Comments
 (0)