Skip to content

Conversation

@martin-purplefish
Copy link
Contributor

@martin-purplefish martin-purplefish commented Jan 4, 2026

Summary

Fixes a bug in StreamPacerWrapper where calling end_input() did not immediately send remaining buffered sentences to TTS, causing multi-second delays in agent responses.

The Bug

When end_input() is called (indicating the user has finished speaking), the pacer continued to wait based on the remaining_audio timer calculation instead of immediately sending all remaining text:

  1. end_input() only woke the send task conditionally - it only called _wakeup_event.set() when the audio emitter's destination channel was closed, not when it was still open
  2. Send condition didn't account for input ending - the send loop only sent text when it was the first sentence or when generation stopped and remaining audio was low

Example of the Problem

With min_remaining_audio = 5.0s:

  • t=0.0s: First sentence sent; TTS produces 10s audio
  • t=0.5s: Two more sentences queued
  • t=0.6s: end_input() called while audio emitter is still open
    • _input_ended = True, but no wakeup occurs
    • Send task sleeps on timer: remaining_audio - min_remaining_audio = 10 - 5 = 5s
  • t=5.5s: Next send finally happens

Result: ~5 second delay after user finishes speaking before remaining sentences are synthesized.

Changes

  1. Always wake the send task on end_input() - moved _wakeup_event.set() outside the conditional
  2. Added send condition for ended input - (self._input_ended and self._sentences) triggers immediate sending

Why this is correct

The purpose of pacing is to:

  1. Reduce waste from interruptions - not relevant once input ends; we're committed to this response
  2. Send larger chunks for better speech quality - still respected via max_text_length batching

Once input has ended, we know exactly what text needs to be synthesized and there's no benefit to delaying. The max_text_length batching is still respected, so we're not bypassing quality optimizations - just the waiting.

Test plan

  • Verify that when end_input() is called with pending sentences, they are sent immediately (within ~1 event loop iteration)
  • Verify that max_text_length batching is still respected when input ends
  • Verify normal pacing behavior is unchanged when input has not ended

🤖 Generated with Claude Code

@martin-purplefish
Copy link
Contributor Author

martin-purplefish commented Jan 4, 2026

Hunting down random delays in TTS - this seems promising. It feels like an obvious bug - though I'm not convinced about sending/flushing all of the sentences versus dropping them. Tested this out locally.

@longcw
Copy link
Contributor

longcw commented Jan 4, 2026

Hunting down random delays in TTS

do you have text_pacing enabled? it's disabled by default and it's used to slow down the TTS generation after the first sentence to save TTS usage in case the speech is interrupted.

When end_input() is called (indicating the user has finished speaking), the pacer continued to wait based on the remaining_audio timer calculation instead of immediately sending all remaining text

this is the intended behavior. end_input is called after LLM generation, which is usually faster than audio playout, so we don't want to start the rest TTS generation immediately, it should wait for the audio playout as usual even end_input is called.

With min_remaining_audio = 5.0s:

t=0.0s: First sentence sent; TTS produces 10s audio
t=0.5s: Two more sentences queued
t=0.6s: end_input() called while audio emitter is still open
_input_ended = True, but no wakeup occurs
Send task sleeps on timer: remaining_audio - min_remaining_audio = 10 - 5 = 5s
t=5.5s: Next send finally happens
Result: ~5 second delay after user finishes speaking before remaining sentences are synthesized.

this is the expected result with tts text pacing enabled, I guess the AI tool raised this as a bug because it didn't know we are playing the audio in real-time instead of trying to generate audio as fast as possible.

@longcw
Copy link
Contributor

longcw commented Jan 4, 2026

here is an example of using tts text pacing https://github.com/livekit/agents/blob/[email protected]/examples/voice_agents/tts_text_pacing.py, you can see the logs like the following to indicate the sentences were sent to the TTS while the remaining audio was about to be all played

22:23:28.321 DEBUG livekit.agents sent text to tts
{"text": " She was known for ...", "remaining_audio": 4.859836101531982, "pid": 2875548, "job_id": "AJ_Btwoh8EoeLjY", "room_id": "RM_R6ZuwrSQUVY5"}

@martin-purplefish
Copy link
Contributor Author

Oh got it, so this is expected then. No, we don't. wouldn't the behavior still happen though? Will close!

@longcw
Copy link
Contributor

longcw commented Jan 4, 2026

it shouldn't happen if it's not enabled explicitly as in the example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants