@@ -124,6 +124,100 @@ coffee_voice_agent/
124124└─────────────────┘ └─────────────────┘ └──────────────┘
125125```
126126
127+ ### ** TTS and Audio Processing Flow**
128+
129+ Understanding how text-to-speech and audio synthesis works in the refactored architecture:
130+
131+ #### ** 🔄 Two TTS Pathways**
132+
133+ ** Path 1: Normal Conversation (User-initiated)**
134+ ```
135+ User Speech → STT → LLM → CoffeeBaristaAgent.tts_node() → Emotion Processing → Audio Playback
136+ ```
137+
138+ ** Path 2: Manual Announcements (System-initiated)**
139+ ```
140+ Virtual Requests/Greetings → StateManager.say_with_emotion() → session.say() → CoffeeBaristaAgent.tts_node() → Audio Playback
141+ ```
142+
143+ #### ** 📍 TTS Processing Components**
144+
145+ ** 1. TTS Override - ` agents/coffee_barista_agent.py ` (Lines 79-159)**
146+ - ** Method** : ` async def tts_node(self, text, model_settings=None) `
147+ - ** Role** : ** Central TTS bottleneck** - all speech goes through here
148+ - ** Functions** :
149+ - Intercepts streaming text from LLM or manual calls
150+ - Processes ` emotion:text ` delimiter format in real-time
151+ - Extracts emotions from first 50 characters of text stream
152+ - Updates agent's emotional state
153+ - Logs animated eye expressions
154+ - Passes clean text to LiveKit's default TTS
155+
156+ ** 2. Manual TTS - ` state/state_manager.py ` (Lines 512-528)**
157+ - ** Method** : ` async def say_with_emotion(self, text: str, emotion: str = None) `
158+ - ** Role** : Direct TTS for system announcements
159+ - ** Functions** :
160+ - Used for greetings, virtual request announcements, timeouts
161+ - Calls ` await self.session.say(text) ` directly
162+ - Still routes through ` tts_node() ` override for emotion processing
163+ - Bypasses LLM but preserves emotion handling
164+
165+ #### ** 🎵 Audio Synthesis and Playback**
166+
167+ ** Final Audio Generation (Line 157 in ` coffee_barista_agent.py ` ):**
168+ ``` python
169+ async for audio_frame in Agent.default.tts_node(self , processed_text, model_settings):
170+ yield audio_frame
171+ ```
172+
173+ ** Audio Pipeline:**
174+ 1 . ** OpenAI TTS** : Uses model "tts-1" with voice "nova" (configurable)
175+ 2 . ** LiveKit Streaming** : Real-time audio frame streaming to connected clients
176+ 3 . ** Client Playback** : Audio plays through browser, room system, or connected devices
177+
178+ #### ** 🎭 Emotion Processing Integration**
179+
180+ ** Emotion Flow in TTS Override:**
181+ ``` python
182+ # 1. Text stream arrives (with potential emotion:text format)
183+ async for text_chunk in text:
184+ if " :" in first_chunk_buffer:
185+ # 2. Extract emotion from delimiter
186+ emotion = parts[0 ].strip()
187+ text_after_delimiter = parts[1 ]
188+
189+ # 3. Update emotional state
190+ if emotion != self .state_manager.current_emotion:
191+ self .state_manager.current_emotion = emotion
192+ self .state_manager.log_animated_eyes(emotion)
193+
194+ # 4. Yield clean text for audio synthesis
195+ yield text_after_delimiter
196+ ```
197+
198+ #### ** ⚙️ Technical Details**
199+
200+ ** Threading Model:**
201+ - ** Main Thread** : LiveKit agent and TTS processing
202+ - ** Wake Word Thread** : Porcupine audio processing (synchronous)
203+ - ** WebSocket Thread** : Order notification server
204+
205+ ** Audio Configuration:**
206+ - ** STT** : OpenAI Whisper ("whisper-1")
207+ - ** TTS** : OpenAI TTS ("tts-1", voice configurable via ` VOICE_AGENT_VOICE ` )
208+ - ** VAD** : Silero Voice Activity Detection
209+ - ** Streaming** : Real-time audio frame streaming via LiveKit
210+
211+ ** State Synchronization:**
212+ - All TTS calls update ` StateManager.current_emotion `
213+ - Emotion changes trigger eye animation logging
214+ - Session events coordinate conversation flow and TTS timing
215+
216+ ** Performance Characteristics:**
217+ - ** Minimal Buffering** : Only first 50 characters checked for emotion
218+ - ** Streaming** : Audio synthesis starts as soon as clean text is available
219+ - ** Low Latency** : Real-time processing for responsive conversations
220+
127221## Dependencies
128222
129223### Environment Variables
0 commit comments