AGT-2316: refine timestamps in spans and recording alignment #4131

chenghao-mou · 2025-12-01T13:28:14Z

Update the user speech start time to include the VAD speech duration;
Audio capture in output is now synced with the audio source capture, based on @longcw's PR wait the first frame to be captured by audio source in room io #4149
User speech is now padded if the mic is enabled after the agent's first speech;

longcw · 2025-12-04T07:25:28Z

livekit-agents/livekit/agents/voice/recorder_io/recorder_io.py

+                },
+            )
+            self.__padded = True
+            frames = [


if I understand correctly, the issue is the silent duration is not counted when both input and output buffers are empty. so it's not only the duration after the first response to the time user enables the mic, but also before the first response generated (LLM + TTS time).

can we record the time the RecordIO.start called as last_take_time, then modify RecorderAudioInput.take_buf to return a silence frame with duration current_time - last_take_time before there is any frame added to the input buffer?

then the total padding duration should be started_wall_time - RecordIO.start_wall_time

Not sure how, but it seems we must have padding somewhere outside agents:

output speech started at 1764842333.011143 (first output frame capture time)
input speech started at 1764842347.058694 (first input frame capture time)
agent speech span started at 1764842333.011
user speech span started at 1764842347.058
but we have the recording started almost 1 second before the agent turn even started.

sorry what was the issue in this case, the audio wave looks matched with the span, do you mean it matched because you added padding outside agents, or you think it's not matched?

It is matched, but we didn't do any padding in agents before the first agent_speaking AFAIK.

Here is what we have confirmed: Audio recording in observability only starts playing when the cursor hits the start_wall_time/(first agent or user frame capture time), essentially padding silence in both channels;

But if we are padding this in agents, we are essentially pushing the RecordIO.start_wall_time to last_take_time, right?

livekit-agents/livekit/agents/voice/agent_activity.py

theomonnom · 2025-12-07T01:26:33Z

livekit-agents/livekit/agents/voice/agent_activity.py

        stopped_speaking_at: float | None = None

-        def _on_first_frame(_: asyncio.Future[None]) -> None:
+        def _on_first_frame(fut: asyncio.Future[float] | asyncio.Future[None]) -> None:


This isn't obvious that the float inside the future was for the started_speaking_at

Good point. I will add a comment!

theomonnom · 2025-12-07T01:31:25Z

livekit-agents/livekit/agents/voice/room_io/_output.py

        self._playback_enabled = asyncio.Event()
        self._playback_enabled.set()

+        self._first_frame: rtc.AudioFrame | None = None


Is there any reason to store the first_frame?

Good question. I think this is more of a defensive implementation by checking both the future and the identity. I can simplify it.

theomonnom · 2025-12-07T01:32:13Z

livekit-agents/livekit/agents/voice/room_io/_output.py

+            # wait for the first frame to be captured
+            if self._first_frame and self._first_frame_fut:
+                try:
+                    await self._first_frame_fut


I'm not sure to understand why we're doing this?
Why do we need to wait for the first frame before directly pushing the rest?

We start a speech span and calculate pushed duration based on upstream frame push time:

this sometimes create a span a few hundred ms early than the actual speech;

if the speech is interrupted before the frame reaches the audio source, we would have created short but invalid spans and recordings

So this makes sure the audio source captures the first frame before committing more frames. The other approach I was thinking of is to create an on_playback_started event so callers can use that for span creation or recording processing.

I think it would make more sense to introduce a new on_playback_started event. It's going to make things more explicit

I don't really have a strong opinion. cc @longcw would love to hear your thoughts as well.

This is now updated to use an on_playback_started event.

theomonnom · 2025-12-23T15:09:30Z

livekit-agents/livekit/agents/voice/recorder_io/recorder_io.py

+        # we could pad with silence here with some fixed SR and channels,
+        # but it's better for the user to know that this is happening
+        elif pad_since and self.__started_time is None and not self.__padded and not frames:
+            logger.warning(


Could this be an issue if the user is muted by default?

I think muting is fine. It still generates silence frames. The only case this would happen is when the user doesn't allow access to the microphone at the browser level at all.

Co-authored-by: Long Chen <[email protected]>

tinalenguyen · 2026-01-08T21:57:50Z

/test-stt

github-actions · 2026-01-08T22:01:55Z

STT Test Results

Status: ✗ Some tests failed

Metric	Count
✓ Passed	23
✗ Failed	0
× Errors	1
→ Skipped	15
▣ Total	39
⏱ Duration	198.1s

Failed Tests

tests.test_stt::test_stream[livekit.plugins.aws]

def finalizer() -> None:
        """Yield again, to finalize."""
  
        async def async_finalizer() -> None:
            try:
                await gen_obj.__anext__()  # type: ignore[union-attr]
            except StopAsyncIteration:
                pass
            else:
                msg = "Async generator fixture didn't stop."
                msg += "Yield only once."
                raise ValueError(msg)
  
        task = _create_task_in_context(event_loop, async_finalizer(), context)
>       event_loop.run_until_complete(task)

.venv/lib/python3.12/site-packages/pytest_asyncio/plugin.py:347: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <_UnixSelectorEventLoop running=False closed=True debug=False>
future = <Task finished name='Task-114' coro=<_wrap_asyncgen_fixture.<locals>._asyncgen_fixture_wrapper.<locals>.finalizer.<loc... File "/home/runner/work/agents/agents/.venv/lib/python3.12/site-packages/smithy_http/aio/crt.py", line 104, in chunks>

    def run_until_complete(self, future):
        """Run until the Future is done.
  
        If the argument is a coroutine, it is wrapped in a Task.
  
        WARNING: It would be disastrous to call run_until_complete()
        with the same coroutine twice -- it would wrap it in two
        different Tasks and that can't be good.
  
        Return the Future's result, or raise its exception.
        """
        self._check_closed()
        self._check_running()
  
        new_task = not futures.isfuture(future)
        future = tasks.ensure_future(future, loop=self)
        if new_task:
            # An exception is raised if the future didn't complete, so there
            # is no need to log the "destroy pending task" message
            future._log_destroy_pending = False
  
        future.add_done_callback(_run_until_complete_cb)
        try:
            self.run_forever()
        except:
            if new_task and future.done() and not future.canc

Skipped Tests

Test	Reason
`tests.test_stt::test_recognize[livekit.plugins.assemblyai]`	universal-streaming-english@AssemblyAI does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.speechmatics]`	unknown@Speechmatics does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.fireworksai]`	unknown@FireworksAI does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.cartesia]`	ink-whisper@Cartesia does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.nvidia]`	unknown@unknown does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.soniox]`	stt-rt-v3@Soniox does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.aws]`	unknown@Amazon Transcribe does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.deepgram.STTv2]`	flux-general-en@Deepgram does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.gradium.STT]`	unknown@Gradium does not support batch recognition
`tests.test_stt::test_recognize[livekit.agents.inference]`	unknown@livekit does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.azure]`	unknown@Azure STT does not support batch recognition
`tests.test_stt::test_stream[livekit.plugins.elevenlabs]`	Scribe@ElevenLabs does not support streaming
`tests.test_stt::test_stream[livekit.plugins.fal]`	Wizper@Fal does not support streaming
`tests.test_stt::test_stream[livekit.plugins.mistralai]`	voxtral-mini-latest@MistralAI does not support streaming
`tests.test_stt::test_stream[livekit.plugins.openai]`	[email protected] does not support streaming

Triggered by workflow run #87

chenghao-mou force-pushed the fix/accurate_span_times branch from 74235c3 to d2a97ee Compare December 2, 2025 14:05

chenghao-mou requested a review from a team December 2, 2025 14:09

chenghao-mou marked this pull request as ready for review December 2, 2025 14:09

chenghao-mou changed the title ~~refine speech start time in spans~~ refine speech start time in spans and recording alignment Dec 2, 2025

longcw reviewed Dec 4, 2025

View reviewed changes

livekit-agents/livekit/agents/voice/agent_activity.py Outdated Show resolved Hide resolved

chenghao-mou force-pushed the fix/accurate_span_times branch from 939b976 to 7221955 Compare December 4, 2025 10:38

chenghao-mou requested a review from a team December 5, 2025 09:55

chenghao-mou force-pushed the fix/accurate_span_times branch from 7221955 to 01b87f3 Compare December 5, 2025 10:33

theomonnom reviewed Dec 7, 2025

View reviewed changes

chenghao-mou requested a review from a team December 9, 2025 11:30

chenghao-mou changed the title ~~refine speech start time in spans and recording alignment~~ AGT-2316: refine timestamps in spans and recording alignment Dec 9, 2025

chenghao-mou requested review from longcw and theomonnom December 11, 2025 16:05

chenghao-mou force-pushed the fix/accurate_span_times branch 2 times, most recently from be78c45 to d7b7386 Compare December 21, 2025 15:03

theomonnom reviewed Dec 23, 2025

View reviewed changes

chenghao-mou force-pushed the fix/accurate_span_times branch from 897a51c to 2639401 Compare January 3, 2026 19:24

chenghao-mou and others added 12 commits January 8, 2026 15:00

refine speech start time in spans

60922aa

update user listenting timestamps

507d5f4

add exception handling for first frame future

710f9f9

update recorderIO output start time

2d50cb5

clean up default

179c7dd

fix type issues

9e275fa

pad silence in user speech

0f5d4a4

replace start events with future wait

8aabd65

Co-authored-by: Long Chen <[email protected]>

fix type issues

a78a2f0

change parameter from ns to s

96dc002

remove first_frame and pad silence after agent speech when necessary

4730a61

update padding behaviour in recorder io

5a82666

chenghao-mou added 2 commits January 8, 2026 15:00

fix typo

7139879

add back on_playback_started

1e825a1

chenghao-mou force-pushed the fix/accurate_span_times branch from 2639401 to 1e825a1 Compare January 8, 2026 15:00

AGT-2316: refine timestamps in spans and recording alignment #4131

Are you sure you want to change the base?

AGT-2316: refine timestamps in spans and recording alignment #4131

Conversation

chenghao-mou commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

longcw Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenghao-mou Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tinalenguyen commented Jan 8, 2026

Uh oh!

github-actions bot commented Jan 8, 2026

STT Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

chenghao-mou commented Dec 1, 2025 •

edited

Loading

longcw Dec 4, 2025 •

edited

Loading

chenghao-mou Dec 8, 2025 •

edited

Loading