Skip to content

fix(inference): set STT capabilities.diarization from extra_kwargs#5283

Open
russellmartin-livekit wants to merge 1 commit intomainfrom
claude/slack-support-diarization-stt-providers-cWpcE
Open

fix(inference): set STT capabilities.diarization from extra_kwargs#5283
russellmartin-livekit wants to merge 1 commit intomainfrom
claude/slack-support-diarization-stt-providers-cWpcE

Conversation

@russellmartin-livekit
Copy link
Copy Markdown
Contributor

The inference STT capabilities.diarization was hardcoded to False, which caused MultiSpeakerAdapter to not work since it checks capabilities.diarization before enabling diarization.

This change:

  • Adds diarize option to DeepgramOptions TypedDict
  • Adds speaker_labels option to AssemblyaiOptions TypedDict
  • Detects diarization params in extra_kwargs and sets capabilities
  • Updates capabilities when update_options() is called with diarization
  • Adds comprehensive tests for diarization capability detection

Fixes AGT-2608

Slack thread: https://live-kit.slack.com/archives/C06TN33TV44/p1772573869144129?thread_ts=1771977322.899519&cid=C06TN33TV44

https://claude.ai/code/session_01VRKQuBXiq8BHKr9AiJ6uEw

The inference STT capabilities.diarization was hardcoded to False,
which caused MultiSpeakerAdapter to not work since it checks
capabilities.diarization before enabling diarization.

This change:
- Adds diarize option to DeepgramOptions TypedDict
- Adds speaker_labels option to AssemblyaiOptions TypedDict
- Detects diarization params in extra_kwargs and sets capabilities
- Updates capabilities when update_options() is called with diarization
- Adds comprehensive tests for diarization capability detection

Fixes AGT-2608

Slack thread: https://live-kit.slack.com/archives/C06TN33TV44/p1772573869144129?thread_ts=1771977322.899519&cid=C06TN33TV44

https://claude.ai/code/session_01VRKQuBXiq8BHKr9AiJ6uEw
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 3 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Diarization capability declared but speaker_id never populated in transcripts

The PR sets capabilities.diarization = True when diarize or speaker_labels is in extra_kwargs, but _process_transcript (line 665-681) never extracts speaker_id from the server response data and never passes it to SpeechData. The speaker_id field defaults to None.

This breaks MultiSpeakerAdapter, which checks stt.capabilities.diarization at livekit-agents/livekit/agents/stt/multi_speaker_adapter.py:47 and will accept this STT instance. However, when processing events, _PrimarySpeakerDetector.on_stt_event at livekit-agents/livekit/agents/stt/multi_speaker_adapter.py:244 checks if sd.speaker_id is None and short-circuits, so speaker detection/suppression never works. Compare with the Deepgram plugin at livekit-plugins/livekit-plugins-deepgram/livekit/plugins/deepgram/stt.py:742 which correctly populates speaker_id=f"S{speaker}" from the response.

(Refers to lines 665-681)

Prompt for agents
In livekit-agents/livekit/agents/inference/stt.py, the _process_transcript method (line 651-681) needs to extract speaker_id from the server response data and pass it to the SpeechData constructor. The exact field name in the server response depends on the gateway's response format (likely "speaker" or "speaker_id" in the data dict, or possibly in individual word entries similar to how Deepgram returns it in word["speaker"]). Add speaker_id extraction logic similar to what livekit-plugins/livekit-plugins-deepgram/livekit/plugins/deepgram/stt.py:730-734 does, and pass it as the speaker_id parameter to stt.SpeechData() at line 665. For example, extract speaker = data.get("speaker") or derive it from the words list, then set speaker_id=f"S{speaker}" if speaker is not None else None.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants