Skip to content

ASR and TTS v3 update#3598

Open
dhruvladia-sarvam wants to merge 4 commits intopipecat-ai:mainfrom
dhruvladia-sarvam:sarvam-v3-update
Open

ASR and TTS v3 update#3598
dhruvladia-sarvam wants to merge 4 commits intopipecat-ai:mainfrom
dhruvladia-sarvam:sarvam-v3-update

Conversation

@dhruvladia-sarvam
Copy link
Contributor

This PR adds support for Sarvam AI's v3 models in both Speech-to-Text (STT) and Text-to-Speech (TTS) services, while maintaining backward compatibility with existing models.

Key additions:

  • STT: Adds saaras:v3 model with new mode parameter, retains saaras:v2.5 (STT-Translate) support
  • TTS: Adds bulbul:v3-beta model with new temperature parameter and 25 new speaker voices

Supported Models:

Model Language Prompt Mode Endpoint
saarika:v2.5 Required (default: "unknown") speech_to_text_streaming
saaras:v2.5 Auto-detect speech_to_text_translate_streaming
saaras:v3 Required (default: "en-IN") speech_to_text_streaming

New Features:

  • saaras:v3 model support with new mode parameter
    • Modes: transcribe, translate, verbatim, translit, codemix
    • Default mode: transcribe
  • Retained saaras:v2.5 (STT-Translate) with auto language detection
  • Model-specific validation for parameters (prompt, mode, language)
  • Dynamic endpoint selection based on model type

API Changes:

  • New mode parameter in InputParams and __init__
  • set_language() raises ValueError for saaras:v2.5 (auto-detects)
  • set_prompt() now supports both saaras:v2.5 and saaras:v3

Supported Models:

Model Pitch Loudness Pace Temperature Default Sample Rate Default Speaker
bulbul:v2 ✅ (-0.75 to 0.75) ✅ (0.3-3.0) ✅ (0.3-3.0) 22050 Hz anushka
bulbul:v3-beta ✅ (0.5-2.0) ✅ (0.01-1.0) 24000 Hz aditya

New Features:

  • bulbul:v3-beta model support with temperature control
  • New enums for type safety:
    • SarvamTTSModel: Model variants
    • SarvamTTSSpeakerV2: 7 speakers for v2
    • SarvamTTSSpeakerV3: 25 speakers for v3-beta
  • get_speakers_for_model() helper function
  • Automatic parameter clamping for pace when outside v3 range
  • Model-specific defaults for sample rate, speaker, and preprocessing

Speakers:

  • bulbul:v2 (7): anushka, abhilash, manisha, vidya, arya, karun, hitesh
  • bulbul:v3-beta (25): aditya, ritu, priya, neha, rahul, pooja, rohan, simran, kavya, amit, dev, ishita, shreya, ratan, varun, manan, sumit, roopa, kabir, aayan, shubh, ashutosh, advait, amelia, sophia

API Changes:

  • New temperature parameter in InputParams (0.01-1.0, default 0.6)
  • Warnings logged when using incompatible parameters (e.g., pitch with v3)
  • Both SarvamHttpTTSService and SarvamTTSService (WebSocket) updated

@codecov
Copy link

codecov bot commented Jan 30, 2026

Codecov Report

❌ Patch coverage is 0% with 162 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/pipecat/services/sarvam/tts.py 0.00% 125 Missing ⚠️
src/pipecat/services/sarvam/stt.py 0.00% 37 Missing ⚠️
Files with missing lines Coverage Δ
src/pipecat/services/sarvam/stt.py 0.00% <0.00%> (ø)
src/pipecat/services/sarvam/tts.py 0.00% <0.00%> (ø)

... and 27 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dhruvladia-sarvam
Copy link
Contributor Author

@markbackman can we have this PR reviewed and merged at the earliest please. It's urgent

Copy link
Contributor

@markbackman markbackman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of STT changes:

My biggest concern is maintainability. The different model code is spread throughout the class making it harder to understand. Could we instead centralize the model configuration and then modify the code to use the configuration based on the model used?

I spent a few minutes with Claude to demonstrate what I'm thinking. Here's the attached code reworked:
stt.py

You'll see I:

  • Added the ModelConfig as an immutable configuration
  • Then added the MODEL_CONFIGS dictionary using the ModelConfig to specify what the models are capable of
  • Then that MODEL_CONFIGS is used in the source

In this file, I also remove mode from __init__ and removed the repetition in the language settings.

WDYT?

model: str = "saarika:v2.5",
sample_rate: Optional[int] = None,
input_audio_codec: str = "wav",
mode: Optional[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is mode initialized in two places. I would recommend that it be removed from __init__ and kept in InputParams only

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the current implementation of mode sits fine with the required logic for it's value assignment

- "saaras:v3": Advanced STT model (supports mode and prompts)
sample_rate: Audio sample rate. Defaults to 16000 if not specified.
input_audio_codec: Audio codec/format of the input file. Defaults to "wav".
mode: Mode of operation for saaras:v3 models only. Options: transcribe, translate,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove mode from docstring if removing mode from __init__.

"model": self.model_name,
"vad_signals": vad_signals_str,
"high_vad_sensitivity": high_vad_sensitivity_str,
"input_audio_codec": self._input_audio_codec,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an intentional removal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor

@markbackman markbackman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TTS implementation has similar maintainability issues due to the models and configurations. I would recommend taking a similar approach to pulling the configuration out into separate code then using it in the service classes.

It could make sense to move the model config code outside of the stt.py and tts.py files into a separate models.py file. This would centralize the model configuration code and then keep the services focused on the core logic.

Can you make those changes and I can take a look again after the changes are made?

@dhruvladia-sarvam
Copy link
Contributor Author

Review of STT changes:

My biggest concern is maintainability. The different model code is spread throughout the class making it harder to understand. Could we instead centralize the model configuration and then modify the code to use the configuration based on the model used?

I spent a few minutes with Claude to demonstrate what I'm thinking. Here's the attached code reworked: stt.py

You'll see I:

  • Added the ModelConfig as an immutable configuration
  • Then added the MODEL_CONFIGS dictionary using the ModelConfig to specify what the models are capable of
  • Then that MODEL_CONFIGS is used in the source

In this file, I also remove mode from __init__ and removed the repetition in the language settings.

WDYT?

I agree, this sounds great in the long run too with a single source of truth

@dhruvladia-sarvam
Copy link
Contributor Author

It could make sense to move the model config code outside of the stt.py and tts.py files into a separate models.py file. This would centralize the model configuration code and then keep the services focused on the core logic.

Can you make those changes and I can take a look again after the changes are made?

It's a reasonable refactor but not essential I feel. The current structure works well because:

  1. Each service is self-contained
  2. The configs are semantically distinct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants