fix(inference): set STT capabilities.diarization from extra_kwargs#5283
fix(inference): set STT capabilities.diarization from extra_kwargs#5283russellmartin-livekit wants to merge 1 commit intomainfrom
Conversation
The inference STT capabilities.diarization was hardcoded to False, which caused MultiSpeakerAdapter to not work since it checks capabilities.diarization before enabling diarization. This change: - Adds diarize option to DeepgramOptions TypedDict - Adds speaker_labels option to AssemblyaiOptions TypedDict - Detects diarization params in extra_kwargs and sets capabilities - Updates capabilities when update_options() is called with diarization - Adds comprehensive tests for diarization capability detection Fixes AGT-2608 Slack thread: https://live-kit.slack.com/archives/C06TN33TV44/p1772573869144129?thread_ts=1771977322.899519&cid=C06TN33TV44 https://claude.ai/code/session_01VRKQuBXiq8BHKr9AiJ6uEw
|
|
There was a problem hiding this comment.
🔴 Diarization capability declared but speaker_id never populated in transcripts
The PR sets capabilities.diarization = True when diarize or speaker_labels is in extra_kwargs, but _process_transcript (line 665-681) never extracts speaker_id from the server response data and never passes it to SpeechData. The speaker_id field defaults to None.
This breaks MultiSpeakerAdapter, which checks stt.capabilities.diarization at livekit-agents/livekit/agents/stt/multi_speaker_adapter.py:47 and will accept this STT instance. However, when processing events, _PrimarySpeakerDetector.on_stt_event at livekit-agents/livekit/agents/stt/multi_speaker_adapter.py:244 checks if sd.speaker_id is None and short-circuits, so speaker detection/suppression never works. Compare with the Deepgram plugin at livekit-plugins/livekit-plugins-deepgram/livekit/plugins/deepgram/stt.py:742 which correctly populates speaker_id=f"S{speaker}" from the response.
(Refers to lines 665-681)
Prompt for agents
In livekit-agents/livekit/agents/inference/stt.py, the _process_transcript method (line 651-681) needs to extract speaker_id from the server response data and pass it to the SpeechData constructor. The exact field name in the server response depends on the gateway's response format (likely "speaker" or "speaker_id" in the data dict, or possibly in individual word entries similar to how Deepgram returns it in word["speaker"]). Add speaker_id extraction logic similar to what livekit-plugins/livekit-plugins-deepgram/livekit/plugins/deepgram/stt.py:730-734 does, and pass it as the speaker_id parameter to stt.SpeechData() at line 665. For example, extract speaker = data.get("speaker") or derive it from the words list, then set speaker_id=f"S{speaker}" if speaker is not None else None.
Was this helpful? React with 👍 or 👎 to provide feedback.
The inference STT capabilities.diarization was hardcoded to False, which caused MultiSpeakerAdapter to not work since it checks capabilities.diarization before enabling diarization.
This change:
Fixes AGT-2608
Slack thread: https://live-kit.slack.com/archives/C06TN33TV44/p1772573869144129?thread_ts=1771977322.899519&cid=C06TN33TV44
https://claude.ai/code/session_01VRKQuBXiq8BHKr9AiJ6uEw