Skip to content

TTS: Audio generation stops when sending second text with continue_=True on same contextΒ #59

@mihbt

Description

@mihbt

I'm experiencing an issue where sending a second text chunk to the same TTS context causes the ongoing audio generation to immediately stop, even when using the continue_ parameter correctly.

Expected Behavior

When sending multiple text chunks to the same context with continue_=True, the audio should continue generating sequentially for all chunks.

Actual Behavior

The audio generation for the first text stops immediately when the second context.send() is called.

Environment

Cartesia SDK Version: 2.0.17
Python Version: 3.13.2

Code Example

from cartesia import AsyncCartesia, OutputFormat_RawParams, TtsRequestIdSpecifierParams

# Setup
client = AsyncCartesia(api_key="...")
ws = await client.tts.websocket()
context = ws.context("my-context-id")

# First text - this works fine
await context.send(
    model_id="sonic-3",
    transcript="First text to convert to speech.",
    voice=TtsRequestIdSpecifierParams(mode="id", id="ac197a78-cec7-4c50-93e5-93bdc1910b11"),
    stream=True,
    output_format=OutputFormat_RawParams(
        container="raw",
        encoding="pcm_s16le",
        sample_rate=22050,
    ),
    continue_=False  # First message
)

# Second text - this causes the audio to stop generating
await context.send(
    model_id="sonic-3",
    transcript="Second text to convert to speech.",
    voice=TtsRequestIdSpecifierParams(mode="id", id="ac197a78-cec7-4c50-93e5-93bdc1910b11"),
    stream=True,
    output_format=OutputFormat_RawParams(
        container="raw",
        encoding="pcm_s16le",
        sample_rate=22050,
    ),
    continue_=True  # Continuation of previous
)

await context.no_more_inputs()

# Receiving audio
async for output in context.receive():
    if output.audio:
        # Audio for second text never arrives
        handle_audio(output.audio)

Observations

  • The first text generates audio successfully
  • The audio generation stops exactly when context.send() is called for the second text. (ex: if I add a 1s delay - it will stream the audio output for 1s and after that stop)
  • Waiting for the first audio to be completely generated before sending the second text doesn't help either, the second text is not generated at all.
  • The continue_ flag is set correctly (False for first, True for subsequent)
  • I'm receiving audio in a concurrent task that starts before sending any text

Questions

  1. Is it expected to call context.send() multiple times on the same context?
  2. Does continue_=True require a specific timing or pattern between sends?
  3. Should I be using separate contexts for each text chunk instead? If yes, how do I keep the voice consistency/prosody between them?
  4. Is there a way to queue multiple text chunks for sequential processing? (to keep voice consistency/prosody)

My use Case

I'm streaming LLM-generated responses to TTS and need to send chunks as they arrive for minimal latency. I want to maintain voice consistency across chunks, which is why I'm trying to use the same context with continue_=True.

Any guidance would be greatly appreciated! πŸ™

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions