Skip to content

fix(sarvam-tts): wrap raw PCM in RIFF/WAVE header when wav codec returns headerless bytes#5280

Open
IgnazioDS wants to merge 1 commit intolivekit:mainfrom
IgnazioDS:fix/sarvam-wav-raw-pcm
Open

fix(sarvam-tts): wrap raw PCM in RIFF/WAVE header when wav codec returns headerless bytes#5280
IgnazioDS wants to merge 1 commit intolivekit:mainfrom
IgnazioDS:fix/sarvam-wav-raw-pcm

Conversation

@IgnazioDS
Copy link
Copy Markdown

Summary

Fixes #5267.

When output_audio_codec="wav" is set, the Sarvam API can return raw PCM bytes without a RIFF/WAVE header. The plugin then calls output_emitter.initialize(mime_type="audio/wav"), which causes downstream decoders to expect a RIFF/WAVE header β€” crashing with "Invalid WAV file: missing RIFF/WAVE".

Root cause

mime_type = f"audio/{self._opts.output_audio_codec}" correctly signals audio/wav, but base64.b64decode(b64) yields raw PCM bytes when Sarvam omits the container header.

Fix

Added a _pcm_to_wav() helper that prepends the standard RIFF/WAVE header (computed from sample_rate, num_channels, bit_depth=16) to raw PCM data.

REST path (Synthesize._run):

  • Collect all base64-decoded chunks
  • If bytes don't start with b"RIFF", wrap them with _pcm_to_wav()
  • Push the complete WAV to the emitter

Streaming WebSocket path (SynthesizeStream):

  • When output_audio_codec == "wav", buffer raw PCM chunks in _wav_buffer (instead of pushing individual chunks, which can't form valid standalone WAV frames)
  • On event_type == "final", assemble the buffer into a complete WAV and push before calling output_emitter.end_input()

The b"RIFF" check on the assembled bytes makes the fix safe for Sarvam API responses that do include a proper WAV header.

Test plan

  • output_audio_codec="wav" with bulbul:v3 no longer raises "Invalid WAV file: missing RIFF/WAVE"
  • output_audio_codec="mp3" (default) is unaffected
  • Streaming path emits audio correctly after "final" event with wav codec
  • _pcm_to_wav() produces a parseable WAV file (standard RIFF header verified)

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ’‘ Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 609115a9e1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with πŸ‘.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1070 to +1073
if self._opts.output_audio_codec == "wav":
# Buffer raw PCM; the complete WAV container will be pushed once
# the "final" event is received (see _handle_event_message).
self._wav_buffer.append(audio_bytes)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Flush buffered WAV audio on non-final stream termination

For output_audio_codec == "wav", _handle_audio_message now only appends chunks to _wav_buffer, and the buffer is emitted only in _handle_event_message when event_type == "final". If send_completion_event=False (a supported option wired into the WebSocket URL) or the server closes the socket without sending a final event, _run still calls end_input() but never pushes buffered audio, so the stream can complete with missing/empty output. Please add a fallback flush path when the WS loop exits without a final event.

Useful? React with πŸ‘Β / πŸ‘Ž.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 3 additional findings in Devin Review.

Open in Devin Review

Comment on lines +1070 to +1073
if self._opts.output_audio_codec == "wav":
# Buffer raw PCM; the complete WAV container will be pushed once
# the "final" event is received (see _handle_event_message).
self._wav_buffer.append(audio_bytes)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ”΄ Streaming wav audio silently dropped when WebSocket closes without "final" event

The new wav buffering logic in _handle_audio_message (line 1070-1073) appends audio chunks to self._wav_buffer instead of pushing them to the emitter. The buffer is only flushed in _handle_event_message when a "final" event is received (line 1122-1132). However, when send_completion_event=False is configured (a user-facing option at tts.py:395), the server never sends a "final" event β€” the WebSocket simply closes after all audio is streamed. The recv_task breaks out of its loop on the WS close at line 937, _run_ws completes normally, and the buffered audio is never flushed. The _run method's finally block (line 831-833) calls output_emitter.end_input() which closes the emitter's write channel, permanently discarding the buffered data.

Before this PR, audio chunks were pushed to the emitter immediately regardless of codec, so audio was always delivered. This is a regression that causes silent audio data loss.

Prompt for agents
In livekit-plugins/livekit-plugins-sarvam/livekit/plugins/sarvam/tts.py, the _wav_buffer is only flushed when a "final" event is received in _handle_event_message (lines 1122-1132). When send_completion_event=False, the server never sends this event, so the buffer is never flushed.

To fix this, add a fallback flush of the wav buffer at the end of _run_ws (around lines 983-987, in the inner finally block after asyncio.gather completes). After the gracefully_cancel call and before setting tasks to None, add:

    if self._wav_buffer:
        all_pcm = b"".join(self._wav_buffer)
        self._wav_buffer.clear()
        if not all_pcm.startswith(b"RIFF"):
            all_pcm = _pcm_to_wav(
                all_pcm, self._opts.speech_sample_rate, 1
            )
        output_emitter.push(all_pcm)

This ensures that even when the "final" event is never received (e.g. send_completion_event=False, or unexpected WS close), any buffered wav audio is still emitted.
Open in Devin Review

Was this helpful? React with πŸ‘ or πŸ‘Ž to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[sarvam tts] output_audio_codec="wav" causes "Invalid WAV file: missing RIFF/WAVE" error

2 participants