Skip to content

Conversation

@chenghao-mou
Copy link
Member

@chenghao-mou chenghao-mou commented Dec 1, 2025

@chenghao-mou chenghao-mou force-pushed the fix/accurate_span_times branch from 74235c3 to d2a97ee Compare December 2, 2025 14:05
@chenghao-mou chenghao-mou requested a review from a team December 2, 2025 14:09
@chenghao-mou chenghao-mou marked this pull request as ready for review December 2, 2025 14:09
@chenghao-mou chenghao-mou changed the title refine speech start time in spans refine speech start time in spans and recording alignment Dec 2, 2025
},
)
self.__padded = True
frames = [
Copy link
Contributor

@longcw longcw Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I understand correctly, the issue is the silent duration is not counted when both input and output buffers are empty. so it's not only the duration after the first response to the time user enables the mic, but also before the first response generated (LLM + TTS time).

can we record the time the RecordIO.start called as last_take_time, then modify RecorderAudioInput.take_buf to return a silence frame with duration current_time - last_take_time before there is any frame added to the input buffer?

then the total padding duration should be started_wall_time - RecordIO.start_wall_time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2025-12-04 at 10 04 50

Not sure how, but it seems we must have padding somewhere outside agents:

output speech started at 1764842333.011143 (first output frame capture time)
input speech started at 1764842347.058694 (first input frame capture time)
agent speech span started at 1764842333.011
user speech span started at 1764842347.058
but we have the recording started almost 1 second before the agent turn even started.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry what was the issue in this case, the audio wave looks matched with the span, do you mean it matched because you added padding outside agents, or you think it's not matched?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is matched, but we didn't do any padding in agents before the first agent_speaking AFAIK.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is what we have confirmed: Audio recording in observability only starts playing when the cursor hits the start_wall_time/(first agent or user frame capture time), essentially padding silence in both channels;

But if we are padding this in agents, we are essentially pushing the RecordIO.start_wall_time to last_take_time, right?

@chenghao-mou chenghao-mou force-pushed the fix/accurate_span_times branch from 939b976 to 7221955 Compare December 4, 2025 10:38
@chenghao-mou chenghao-mou requested a review from a team December 5, 2025 09:55
@chenghao-mou chenghao-mou force-pushed the fix/accurate_span_times branch from 7221955 to 01b87f3 Compare December 5, 2025 10:33
stopped_speaking_at: float | None = None

def _on_first_frame(_: asyncio.Future[None]) -> None:
def _on_first_frame(fut: asyncio.Future[float] | asyncio.Future[None]) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't obvious that the float inside the future was for the started_speaking_at

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will add a comment!

self._playback_enabled = asyncio.Event()
self._playback_enabled.set()

self._first_frame: rtc.AudioFrame | None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to store the first_frame?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I think this is more of a defensive implementation by checking both the future and the identity. I can simplify it.

# wait for the first frame to be captured
if self._first_frame and self._first_frame_fut:
try:
await self._first_frame_fut
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to understand why we're doing this?
Why do we need to wait for the first frame before directly pushing the rest?

Copy link
Member Author

@chenghao-mou chenghao-mou Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We start a speech span and calculate pushed duration based on upstream frame push time:

  1. this sometimes create a span a few hundred ms early than the actual speech;
  2. if the speech is interrupted before the frame reaches the audio source, we would have created short but invalid spans and recordings

So this makes sure the audio source captures the first frame before committing more frames. The other approach I was thinking of is to create an on_playback_started event so callers can use that for span creation or recording processing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense to introduce a new on_playback_started event. It's going to make things more explicit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have a strong opinion. cc @longcw would love to hear your thoughts as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now updated to use an on_playback_started event.

@chenghao-mou chenghao-mou requested a review from a team December 9, 2025 11:30
@chenghao-mou chenghao-mou changed the title refine speech start time in spans and recording alignment AGT-2316: refine timestamps in spans and recording alignment Dec 9, 2025
@chenghao-mou chenghao-mou force-pushed the fix/accurate_span_times branch 2 times, most recently from be78c45 to d7b7386 Compare December 21, 2025 15:03
# we could pad with silence here with some fixed SR and channels,
# but it's better for the user to know that this is happening
elif pad_since and self.__started_time is None and not self.__padded and not frames:
logger.warning(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be an issue if the user is muted by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think muting is fine. It still generates silence frames. The only case this would happen is when the user doesn't allow access to the microphone at the browser level at all.

@chenghao-mou chenghao-mou force-pushed the fix/accurate_span_times branch from 897a51c to 2639401 Compare January 3, 2026 19:24
@chenghao-mou chenghao-mou force-pushed the fix/accurate_span_times branch from 2639401 to 1e825a1 Compare January 8, 2026 15:00
@tinalenguyen
Copy link
Member

/test-stt

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

STT Test Results

Status: ✗ Some tests failed

Metric Count
✓ Passed 23
✗ Failed 0
× Errors 1
→ Skipped 15
▣ Total 39
⏱ Duration 198.1s
Failed Tests
  • tests.test_stt::test_stream[livekit.plugins.aws]
    def finalizer() -> None:
            """Yield again, to finalize."""
      
            async def async_finalizer() -> None:
                try:
                    await gen_obj.__anext__()  # type: ignore[union-attr]
                except StopAsyncIteration:
                    pass
                else:
                    msg = "Async generator fixture didn't stop."
                    msg += "Yield only once."
                    raise ValueError(msg)
      
            task = _create_task_in_context(event_loop, async_finalizer(), context)
    >       event_loop.run_until_complete(task)
    
    .venv/lib/python3.12/site-packages/pytest_asyncio/plugin.py:347: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <_UnixSelectorEventLoop running=False closed=True debug=False>
    future = <Task finished name='Task-114' coro=<_wrap_asyncgen_fixture.<locals>._asyncgen_fixture_wrapper.<locals>.finalizer.<loc... File "/home/runner/work/agents/agents/.venv/lib/python3.12/site-packages/smithy_http/aio/crt.py", line 104, in chunks>
    
        def run_until_complete(self, future):
            """Run until the Future is done.
      
            If the argument is a coroutine, it is wrapped in a Task.
      
            WARNING: It would be disastrous to call run_until_complete()
            with the same coroutine twice -- it would wrap it in two
            different Tasks and that can't be good.
      
            Return the Future's result, or raise its exception.
            """
            self._check_closed()
            self._check_running()
      
            new_task = not futures.isfuture(future)
            future = tasks.ensure_future(future, loop=self)
            if new_task:
                # An exception is raised if the future didn't complete, so there
                # is no need to log the "destroy pending task" message
                future._log_destroy_pending = False
      
            future.add_done_callback(_run_until_complete_cb)
            try:
                self.run_forever()
            except:
                if new_task and future.done() and not future.canc
    
Skipped Tests
Test Reason
tests.test_stt::test_recognize[livekit.plugins.assemblyai] universal-streaming-english@AssemblyAI does not support batch recognition
tests.test_stt::test_recognize[livekit.plugins.speechmatics] unknown@Speechmatics does not support batch recognition
tests.test_stt::test_recognize[livekit.plugins.fireworksai] unknown@FireworksAI does not support batch recognition
tests.test_stt::test_recognize[livekit.plugins.cartesia] ink-whisper@Cartesia does not support batch recognition
tests.test_stt::test_recognize[livekit.plugins.nvidia] unknown@unknown does not support batch recognition
tests.test_stt::test_recognize[livekit.plugins.soniox] stt-rt-v3@Soniox does not support batch recognition
tests.test_stt::test_recognize[livekit.plugins.aws] unknown@Amazon Transcribe does not support batch recognition
tests.test_stt::test_recognize[livekit.plugins.deepgram.STTv2] flux-general-en@Deepgram does not support batch recognition
tests.test_stt::test_recognize[livekit.plugins.gradium.STT] unknown@Gradium does not support batch recognition
tests.test_stt::test_recognize[livekit.agents.inference] unknown@livekit does not support batch recognition
tests.test_stt::test_recognize[livekit.plugins.azure] unknown@Azure STT does not support batch recognition
tests.test_stt::test_stream[livekit.plugins.elevenlabs] Scribe@ElevenLabs does not support streaming
tests.test_stt::test_stream[livekit.plugins.fal] Wizper@Fal does not support streaming
tests.test_stt::test_stream[livekit.plugins.mistralai] voxtral-mini-latest@MistralAI does not support streaming
tests.test_stt::test_stream[livekit.plugins.openai] [email protected] does not support streaming

Triggered by workflow run #87

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants