-
Notifications
You must be signed in to change notification settings - Fork 2.3k
AGT-2316: refine timestamps in spans and recording alignment #4131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
74235c3 to
d2a97ee
Compare
| }, | ||
| ) | ||
| self.__padded = True | ||
| frames = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I understand correctly, the issue is the silent duration is not counted when both input and output buffers are empty. so it's not only the duration after the first response to the time user enables the mic, but also before the first response generated (LLM + TTS time).
can we record the time the RecordIO.start called as last_take_time, then modify RecorderAudioInput.take_buf to return a silence frame with duration current_time - last_take_time before there is any frame added to the input buffer?
then the total padding duration should be started_wall_time - RecordIO.start_wall_time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how, but it seems we must have padding somewhere outside agents:
output speech started at 1764842333.011143 (first output frame capture time)
input speech started at 1764842347.058694 (first input frame capture time)
agent speech span started at 1764842333.011
user speech span started at 1764842347.058
but we have the recording started almost 1 second before the agent turn even started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry what was the issue in this case, the audio wave looks matched with the span, do you mean it matched because you added padding outside agents, or you think it's not matched?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is matched, but we didn't do any padding in agents before the first agent_speaking AFAIK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is what we have confirmed: Audio recording in observability only starts playing when the cursor hits the start_wall_time/(first agent or user frame capture time), essentially padding silence in both channels;
But if we are padding this in agents, we are essentially pushing the RecordIO.start_wall_time to last_take_time, right?
939b976 to
7221955
Compare
7221955 to
01b87f3
Compare
| stopped_speaking_at: float | None = None | ||
|
|
||
| def _on_first_frame(_: asyncio.Future[None]) -> None: | ||
| def _on_first_frame(fut: asyncio.Future[float] | asyncio.Future[None]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't obvious that the float inside the future was for the started_speaking_at
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will add a comment!
| self._playback_enabled = asyncio.Event() | ||
| self._playback_enabled.set() | ||
|
|
||
| self._first_frame: rtc.AudioFrame | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to store the first_frame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I think this is more of a defensive implementation by checking both the future and the identity. I can simplify it.
| # wait for the first frame to be captured | ||
| if self._first_frame and self._first_frame_fut: | ||
| try: | ||
| await self._first_frame_fut |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to understand why we're doing this?
Why do we need to wait for the first frame before directly pushing the rest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We start a speech span and calculate pushed duration based on upstream frame push time:
- this sometimes create a span a few hundred ms early than the actual speech;
- if the speech is interrupted before the frame reaches the audio source, we would have created short but invalid spans and recordings
So this makes sure the audio source captures the first frame before committing more frames. The other approach I was thinking of is to create an on_playback_started event so callers can use that for span creation or recording processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make more sense to introduce a new on_playback_started event. It's going to make things more explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really have a strong opinion. cc @longcw would love to hear your thoughts as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now updated to use an on_playback_started event.
be78c45 to
d7b7386
Compare
| # we could pad with silence here with some fixed SR and channels, | ||
| # but it's better for the user to know that this is happening | ||
| elif pad_since and self.__started_time is None and not self.__padded and not frames: | ||
| logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be an issue if the user is muted by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think muting is fine. It still generates silence frames. The only case this would happen is when the user doesn't allow access to the microphone at the browser level at all.
897a51c to
2639401
Compare
Co-authored-by: Long Chen <[email protected]>
2639401 to
1e825a1
Compare
|
/test-stt |
STT Test ResultsStatus: ✗ Some tests failed
Failed Tests
Skipped Tests
Triggered by workflow run #87 |
Uh oh!
There was an error while loading. Please reload this page.