fix(sarvam-tts): wrap raw PCM in RIFF/WAVE header when wav codec returns headerless bytes#5280
fix(sarvam-tts): wrap raw PCM in RIFF/WAVE header when wav codec returns headerless bytes#5280IgnazioDS wants to merge 1 commit intolivekit:mainfrom
Conversation
β¦rns headerless bytes
|
|
There was a problem hiding this comment.
π‘ Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 609115a9e1
βΉοΈ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with π.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if self._opts.output_audio_codec == "wav": | ||
| # Buffer raw PCM; the complete WAV container will be pushed once | ||
| # the "final" event is received (see _handle_event_message). | ||
| self._wav_buffer.append(audio_bytes) |
There was a problem hiding this comment.
Flush buffered WAV audio on non-final stream termination
For output_audio_codec == "wav", _handle_audio_message now only appends chunks to _wav_buffer, and the buffer is emitted only in _handle_event_message when event_type == "final". If send_completion_event=False (a supported option wired into the WebSocket URL) or the server closes the socket without sending a final event, _run still calls end_input() but never pushes buffered audio, so the stream can complete with missing/empty output. Please add a fallback flush path when the WS loop exits without a final event.
Useful? React with πΒ / π.
| if self._opts.output_audio_codec == "wav": | ||
| # Buffer raw PCM; the complete WAV container will be pushed once | ||
| # the "final" event is received (see _handle_event_message). | ||
| self._wav_buffer.append(audio_bytes) |
There was a problem hiding this comment.
π΄ Streaming wav audio silently dropped when WebSocket closes without "final" event
The new wav buffering logic in _handle_audio_message (line 1070-1073) appends audio chunks to self._wav_buffer instead of pushing them to the emitter. The buffer is only flushed in _handle_event_message when a "final" event is received (line 1122-1132). However, when send_completion_event=False is configured (a user-facing option at tts.py:395), the server never sends a "final" event β the WebSocket simply closes after all audio is streamed. The recv_task breaks out of its loop on the WS close at line 937, _run_ws completes normally, and the buffered audio is never flushed. The _run method's finally block (line 831-833) calls output_emitter.end_input() which closes the emitter's write channel, permanently discarding the buffered data.
Before this PR, audio chunks were pushed to the emitter immediately regardless of codec, so audio was always delivered. This is a regression that causes silent audio data loss.
Prompt for agents
In livekit-plugins/livekit-plugins-sarvam/livekit/plugins/sarvam/tts.py, the _wav_buffer is only flushed when a "final" event is received in _handle_event_message (lines 1122-1132). When send_completion_event=False, the server never sends this event, so the buffer is never flushed.
To fix this, add a fallback flush of the wav buffer at the end of _run_ws (around lines 983-987, in the inner finally block after asyncio.gather completes). After the gracefully_cancel call and before setting tasks to None, add:
if self._wav_buffer:
all_pcm = b"".join(self._wav_buffer)
self._wav_buffer.clear()
if not all_pcm.startswith(b"RIFF"):
all_pcm = _pcm_to_wav(
all_pcm, self._opts.speech_sample_rate, 1
)
output_emitter.push(all_pcm)
This ensures that even when the "final" event is never received (e.g. send_completion_event=False, or unexpected WS close), any buffered wav audio is still emitted.
Was this helpful? React with π or π to provide feedback.
Summary
Fixes #5267.
When
output_audio_codec="wav"is set, the Sarvam API can return raw PCM bytes without a RIFF/WAVE header. The plugin then callsoutput_emitter.initialize(mime_type="audio/wav"), which causes downstream decoders to expect a RIFF/WAVE header β crashing with"Invalid WAV file: missing RIFF/WAVE".Root cause
mime_type = f"audio/{self._opts.output_audio_codec}"correctly signalsaudio/wav, butbase64.b64decode(b64)yields raw PCM bytes when Sarvam omits the container header.Fix
Added a
_pcm_to_wav()helper that prepends the standard RIFF/WAVE header (computed fromsample_rate,num_channels,bit_depth=16) to raw PCM data.REST path (
Synthesize._run):b"RIFF", wrap them with_pcm_to_wav()Streaming WebSocket path (
SynthesizeStream):output_audio_codec == "wav", buffer raw PCM chunks in_wav_buffer(instead of pushing individual chunks, which can't form valid standalone WAV frames)event_type == "final", assemble the buffer into a complete WAV and push before callingoutput_emitter.end_input()The
b"RIFF"check on the assembled bytes makes the fix safe for Sarvam API responses that do include a proper WAV header.Test plan
output_audio_codec="wav"withbulbul:v3no longer raises"Invalid WAV file: missing RIFF/WAVE"output_audio_codec="mp3"(default) is unaffected_pcm_to_wav()produces a parseable WAV file (standard RIFF header verified)