Skip to content

Releases: pipecat-ai/pipecat

v0.0.106

19 Mar 06:43
8750c26

Choose a tag to compare

Added

  • Added optional service field to ServiceUpdateSettingsFrame (and its subclasses LLMUpdateSettingsFrame, TTSUpdateSettingsFrame, STTUpdateSettingsFrame) to target a specific service instance. When service is set, only the matching service applies the settings; others forward the frame unchanged. This enables updating a single service when multiple services of the same type exist in the pipeline.
    (PR #4004)

  • Added sip_provider and room_geo parameters to configure() in the Daily runner. These convenience parameters let callers specify a SIP provider name and geographic region directly without manually constructing DailyRoomProperties and DailyRoomSipParams.
    (PR #4005)

  • Added PerplexityLLMAdapter that automatically transforms conversation messages to satisfy Perplexity's stricter API constraints (strict role alternation, no non-initial system messages, last message must be user/tool). Previously, certain conversation histories could cause Perplexity API errors that didn't occur with OpenAI (PerplexityLLMService subclasses OpenAILLMService since Perplexity uses an OpenAI-compatible API).
    (PR #4009)

  • Added DTMF input event support to the Daily transport. Incoming DTMF tones are now received via Daily's on_dtmf_event callback and pushed into the pipeline as InputDTMFFrame, enabling bots to react to keypad presses from phone callers.
    (PR #4047)

  • Added WakePhraseUserTurnStartStrategy for triggering user turns based on wake phrases, with support for single_activation mode. Deprecates WakeCheckFilter.
    (PR #4064)

  • Added default_user_turn_start_strategies() and default_user_turn_stop_strategies() helper functions for composing custom strategy lists.
    (PR #4064)

Changed

  • Changed tool result JSON serialization to use ensure_ascii=False, preserving UTF-8 characters instead of escaping them. This reduces context size and token usage for non-English languages.
    (PR #3457)

  • OpenAIRealtimeSTTService's noise_reduction parameter is now part of OpenAIRealtimeSTTSettings, making it runtime-updatable via STTUpdateSettingsFrame. The direct noise_reduction init argument is deprecated as of 0.0.106.
    (PR #3991)

  • Updated sarvamai dependency from 0.1.26a2 (alpha) to 0.1.26 (stable release).
    (PR #3997)

  • SimliVideoService now extends AIService instead of FrameProcessor, aligning it with the HeyGen and Tavus video services. It supports SimliVideoService.Settings(...) for configuration and uses start()/stop()/cancel() lifecycle methods. Existing constructor usage (api_key, face_id, etc.) remains unchanged.
    (PR #4001)

  • Update pipecat-ai-small-webrtc-prebuilt to 2.4.0.
    (PR #4023)

  • Nova Sonic assistant text transcripts are now delivered in real-time using speculative text events instead of delayed final text events. Previously, assistant text only arrived after all audio had finished playing, causing laggy transcripts in client UIs. Speculative text arrives before each audio chunk, providing text synchronized with what the bot is saying. This also simplifies the internal text handling by removing the interruption re-push hack and assistant text buffer.
    (PR #4042)

  • Updated daily-python dependency to 0.25.0.
    (PR #4047)

  • Added enable_dialout parameter to configure() in pipecat.runner.daily to support dial-out rooms. Also narrowed misleading Optional type hints and deduplicated token expiry calculation.
    (PR #4048)

  • Extended ProcessFrameResult to stop strategies, allowing a stop strategy to short-circuit evaluation of subsequent strategies by returning STOP.
    (PR #4064)

  • GradiumSTTService now takes both an encoding and sample_rate constructor argument which is assmebled in the class to form the input_format. PCM accepts 8000, 16000, and 24000 Hz sample rates.
    (PR #4066)

  • Improved GradiumSTTService transcription accuracy by reworking how text fragments are accumulated and finalized. Previously, trailing words could be dropped when the server's flushed response arrived before all text tokens were delivered. The service now uses a short aggregation delay after flush to capture trailing tokens, producing complete utterances.
    (PR #4066)

Deprecated

  • SimliVideoService.InputParams is deprecated. Use the direct constructor parameters max_session_length, max_idle_time, and enable_logging instead.
    (PR #4001)

  • Deprecated LocalSmartTurnAnalyzerV2 and LocalCoreMLSmartTurnAnalyzer. Use LocalSmartTurnAnalyzerV3 instead. Instantiating these analyzers will now emit a DeprecationWarning.
    (PR #4012)

  • Deprecated WakeCheckFilter in favor of WakePhraseUserTurnStartStrategy.
    (PR #4064)

Fixed

  • Fixed an issue where the default model for OpenAILLMService and AzureLLMService was mistakenly reverted to gpt-4o. The defaults are now restored to gpt-4.1.
    (PR #4000)

  • Fixed a race condition where EndTaskFrame could cause the pipeline to shut down before in-flight frames (e.g. LLM function call responses) finished processing. EndTaskFrame and StopTaskFrame now flow through the pipeline as ControlFrames, ensuring all pending work is flushed before shutdown begins. CancelTaskFrame and InterruptionTaskFrame remain immediate (SystemFrame).
    (PR #4006)

  • Fixed ParallelPipeline dropping or misordering frames during lifecycle synchronization. Buffered frames are now flushed in the correct order relative to synchronization frames (StartFrame goes first, EndFrame/CancelFrame go after), and frames added to the buffer during flush are also drained.
    (PR #4007)

  • Fixed TTSService potentially canceling in-flight audio during shutdown. The stop sequence now waits for all queued audio contexts to finish processing before canceling the stop frame task.
    (PR #4007)

  • Fixed Language enum values (e.g. Language.ES) not being converted to service-specific codes when passed via settings=Service.Settings(language=Language.ES) at init time. This caused API errors (e.g. 400 from Rime) because the raw enum was sent instead of the expected language code (e.g. "spa"). Runtime updates via UpdateSettingsFrame were unaffected. The fix centralizes conversion in the base TTSService and STTService classes so all services handle this consistently.
    (PR #4024)

  • Fixed DeepgramSTTService ignoring the base_url scheme when using ws:// or http://. Previously these were silently overwritten with wss:// / https://, breaking air-gapped or private deployments that don't use TLS. All scheme choices (wss://, https://, ws://, http://, or bare hostname) are now respected.
    (PR #4026)

  • Fixed LLMSwitcher.register_function() and register_direct_function() not accepting or forwarding the timeout_secs parameter.
    (PR #4037)

  • Fixed empty user transcriptions in Nova Sonic causing spurious interruptions. Previously, an empty transcription could trigger an interruption of the assistant's response even though the user hadn't actually spoken.
    (PR #4042)

  • Fixed SonioxSTTService and OpenAIRealtimeSTTService crash when language parameters contain plain strings instead of Language enum values.
    (PR #4046)

  • Fixed premature user turn stops caused by late transcriptions arriving between turns. A stale transcript from the previous turn could persist into the next turn and trigger a stop before the current turn's real transcript arrived. Stop strategies are now reset at both turn start and turn stop to prevent state from leaking across turn boundaries.
    (PR #4057)

  • Fixed raw language strings like "de-DE" silently failing when passed to TTS/STT services (e.g. ElevenLabs producing no audio). Raw strings now go through the same Language enum resolution as enum values, so regional codes like "de-DE" are properly converted to service-expected formats like "de". Unrecognized strings log a warning instead of failing silently.
    (PR #4058)

  • Fixed Deepgram STT list-type settings (keyterm, keywords, search, redact, replace) being stringified instead of passed as lists to the SDK, which caused them to be sent as literal strings (e.g. "['pipecat']") in the nWebSocket query params.
    (PR #4063)

  • ...

Read more

v0.0.105

11 Mar 01:01
7e88b13

Choose a tag to compare

Added

  • Added concurrent audio context support: CartesiaTTSService can now synthesize the next sentence while the previous one is still playing, by setting pause_frame_processing=False and routing each sentence through its own audio context queue.
    (PR #3804)

  • Added custom video track support to Daily transport. Use video_out_destinations in DailyParams to publish multiple video tracks simultaneously, mirroring the existing audio_out_destinations feature.
    (PR #3831)

  • Added ServiceSwitcherStrategyFailover that automatically switches to the next service when the active service reports a non-fatal error. Recovery policies can be implemented via the on_service_switched event handler.
    (PR #3861)

  • Added optional timeout_secs parameter to register_function() and register_direct_function() for per-tool function call timeout control, overriding the global function_call_timeout_secs default.
    (PR #3915)

  • Added cloud-audio-only recording option to Daily transport's enable_recording property.
    (PR #3916)

  • Wired up system_instruction in BaseOpenAILLMService, AnthropicLLMService, and AWSBedrockLLMService so it works as a default system prompt, matching the behavior of the Google services. This enables sharing a single LLMContext across multiple LLM services, where each service provides its own system instruction independently.

    llm = OpenAILLMService(
        api_key=os.getenv("OPENAI_API_KEY"),
        system_instruction="You are a helpful assistant.",
    )
    
    context = LLMContext()
    
    @transport.event_handler("on_client_connected")
    async def on_client_connected(transport, client):
        context.add_message({"role": "user", "content": "Please introduce yourself."})
        await task.queue_frames([LLMRunFrame()])

    (PR #3918)

  • Added vad_threshold parameter to AssemblyAIConnectionParams for configuring voice activity detection sensitivity in U3 Pro. Aligning this with external VAD thresholds (e.g., Silero VAD) prevents the "dead zone" where AssemblyAI transcribes speech that VAD hasn't detected yet.
    (PR #3927)

  • Added push_empty_transcripts parameter to BaseWhisperSTTService and OpenAISTTService to allow empty transcripts to be pushed downstream as TranscriptionFrame instead of discarding them (the default behavior). This is intended for situations where VAD fires even though the user did not speak. In these cases, it is useful to know that nothing was transcribed so that the agent can resume speaking, instead of waiting longer for a transcription.
    (PR #3930)

  • LLM services (BaseOpenAILLMService, AnthropicLLMService, AWSBedrockLLMService) now log a warning when both system_instruction and a system message in the context are set. The constructor's system_instruction takes precedence.
    (PR #3932)

  • Runtime settings updates (via STTUpdateSettingsFrame) now work for AWS Transcribe, Azure, Cartesia, Deepgram, ElevenLabs Realtime, Gradium, and Soniox STT services. Previously, changing settings at runtime only stored the new values without reconnecting.
    (PR #3946)

  • Exposed on_summary_applied event on LLMAssistantAggregator, allowing users to listen for context summarization events without accessing private members.
    (PR #3947)

  • Deepgram Flux STT settings (keyterm, eot_threshold, eager_eot_threshold, eot_timeout_ms) can now be updated mid-stream via STTUpdateSettingsFrame without triggering a reconnect. The new values are sent to Deepgram as a Configure WebSocket message on the existing connection.
    (PR #3953)

  • Added system_instruction parameter to run_inference across all LLM services, allowing callers to override the system prompt for one-shot inference calls. Used by _generate_summary to pass the summarization prompt cleanly.
    (PR #3968)

Changed

  • Audio context management (previously in AudioContextTTSService) is now built into TTSService. All WebSocket providers (cartesia, elevenlabs, asyncai, inworld, rime, gradium, resembleai) now inherit from WebsocketTTSService directly. Word-timestamp baseline is set automatically on the first audio chunk of each context instead of requiring each provider to call start_word_timestamps() in their receive loop.
    (PR #3804)

  • Daily transport now uses CustomVideoSource/CustomVideoTrack instead of VirtualCameraDevice for the default camera output, mirroring how audio already works with CustomAudioSource/CustomAudioTrack.
    (PR #3831)

  • ⚠️ Updated DeepgramSTTService to use deepgram-sdk v6. The LiveOptions class was removed from the SDK and is now provided by pipecat directly; import it from pipecat.services.deepgram.stt instead of deepgram.
    (PR #3848)

  • ServiceSwitcherStrategy base class now provides a handle_error() hook for subclasses to implement error-based switching. ServiceSwitcher defaults to ServiceSwitcherStrategyManual and strategy_type is now optional.
    (PR #3861)

  • Support for Voice Focus 2.0 models.

    • Updated aic-sdk to ~=2.1.0 to support Voice Focus 2.0 models.
    • Cleaned unused ParameterFixedError exception handling in AICFilter
      parameter setup.
      (PR #3889)
  • max_context_tokens and max_unsummarized_messages in LLMAutoContextSummarizationConfig (and deprecated LLMContextSummarizationConfig) can now be set to None independently to disable that summarization threshold. At least one must remain set.
    (PR #3914)

  • ⚠️ Removed formatted_finals and word_finalization_max_wait_time from AssemblyAIConnectionParams as these were v2 API parameters not supported in v3. Clarified that format_turns only applies to Universal-Streaming models; U3 Pro has automatic formatting built-in.
    (PR #3927)

  • Changed DeepgramTTSService to send a Clear message on interruption instead of disconnecting and reconnecting the WebSocket, allowing the connection to persist throughout the session.
    (PR #3958)

  • Re-added enhancement_level support to AICFilter with runtime FilterEnableFrame control, applying ProcessorParameter.Bypass and ProcessorParameter.EnhancementLevel together.
    (PR #3961)

  • Updated daily-python dependency from ~=0.23.0 to ~=0.24.0.
    (PR #3970)

  • Updated FishAudioTTSService default model from s1 to s2-pro, matching Fish Audio's latest recommended model for improved quality and speed.
    (PR #3973)

  • AzureSTTService region parameter is now optional when private_endpoint is provided. A ValueError is raised if neither is given, and a warning is logged if both are provided (private_endpoint takes priority).
    (PR #3974)

Deprecated

  • Deprecated AudioContextTTSService and AudioContextWordTTSService. Subclass WebsocketTTSService directly instead; audio context management is now part of the base TTSService.

    • Deprecated WordTTSService, WebsocketWordTTSService, and InterruptibleWordTTSService. Word timestamp logic is now always active in TTSService and no longer needs to be opted into via a subclass.
      (PR #3804)
  • Deprecated pipecat.services.google.llm_vertex, pipecat.services.google.llm_openai, and pipecat.services.google.gemini_live.llm_vertex modules. Use pipecat.services.google.vertex.llm, pipecat.services.google.openai.llm, and pipecat.services.google.gemini_live.vertex.llm instead. The old import paths still work but will emit a DeprecationWarning.
    (PR #3980)

Removed

  • ⚠️ Removed supports_word_timestamps parameter from TTSService.__init__(). Word timestamp logic is now always active. Remove this argument from any custom subclass super().__init__() calls.
    (PR #3804)

Fixed

  • Fixed DeepgramSTTService keepalive ping timeout disconnections. The deepgram-sdk v6 removed automatic keepalive; pipecat now sends explicit KeepAlive messages every 5 seconds, within the recommended 3–5 second interval before Deepgram's 10-second inactivity timeout.
    (PR #3848)

  • Fixed BufferError: Existing exports of data: object cannot be re-sized in AICFilter caused by holding a memoryview on the mutable audio buffer across async yield points.
    (PR #3889)

  • Fixed TTS context not being appended to the assistant message history when using TTSSpeakFrame with append_to_context=True with some TTS providers.
    (PR [#3936](https://githu...

Read more

v0.0.104

03 Mar 05:25
5940731

Choose a tag to compare

Added

  • Added TextAggregationMetricsData metric measuring the time from the first LLM token to the first complete sentence, representing the latency cost of sentence aggregation in the TTS pipeline.
    (PR #3696)

  • Added support for using strongly-typed objects instead of dicts for updating service settings at runtime.

    Instead of, say:

    await task.queue_frame(
        STTUpdateSettingsFrame(settings={"language": Language.ES})
    )

    you'd do:

    await task.queue_frame(
        STTUpdateSettingsFrame(delta=DeepgramSTTSettings(language=Language.ES))
    )

    Each service now vends strongly-typed classes like DeepgramSTTSettings representing the service's runtime-updatable settings.
    (PR #3714)

  • Added support for specifying private endpoints for Azure Speech-to-Text, enabling use in private networks behind firewalls.
    (PR #3764)

  • Added LemonSliceTransport and LemonSliceApi to support adding real-time LemonSlice Avatars to any Daily room.
    (PR #3791)

  • Added output_medium parameter to AgentInputParams and OneShotInputParams in Ultravox service to control initial output medium (text or voice) at call creation time.
    (PR #3806)

  • Added TurnMetricsData as a generic metrics class for turn detection, with e2e processing time measurement. KrispVivaTurn now emits TurnMetricsData with e2e_processing_time_ms tracking the interval from VAD speech-to-silence transition to turn completion.
    (PR #3809)

  • Added on_audio_context_interrupted() and on_audio_context_completed() callbacks to AudioContextTTSService. Subclasses can override these to perform provider-specific cleanup instead of overriding _handle_interruption().
    (PR #3814)

  • Added on_summary_applied event to LLMContextSummarizer for observability, providing message counts before and after context summarization.
    (PR #3855)

  • Added summary_message_template to LLMContextSummarizationConfig for customizing how summaries are formatted when injected into context (e.g., wrapping in XML tags).
    (PR #3855)

  • Added summarization_timeout to LLMContextSummarizationConfig (default 120s) to prevent hung LLM calls from permanently blocking future summarizations.
    (PR #3855)

  • Added optional llm field to LLMContextSummarizationConfig for routing summarization to a dedicated LLM service (e.g., a cheaper/faster model) instead of the pipeline's primary model.
    (PR #3855)

  • Add AssemblyAI u3-rt-pro model support with built-in turn detection mode
    (PR #3856)

  • Added LLMSummarizeContextFrame to trigger on-demand context summarization from anywhere in the pipeline (e.g. a function call tool). Accepts an optional config: LLMContextSummaryConfig to override summary generation settings per request.
    (PR #3863)

  • Added LLMContextSummaryConfig (summary generation params: target_context_tokens, min_messages_after_summary, summarization_prompt) and LLMAutoContextSummarizationConfig (auto-trigger thresholds: max_context_tokens, max_unsummarized_messages, plus a nested summary_config). These replace the monolithic LLMContextSummarizationConfig.
    (PR #3863)

  • Added support for the speed_alpha parameter to the arcana model in RimeTTSService.
    (PR #3873)

  • Added ClientConnectedFrame, a new SystemFrame pushed by all transports (Daily, LiveKit, FastAPI WebSocket, WebSocket Server, SmallWebRTC, HeyGen, Tavus) when a client connects. Enables observers to track transport readiness timing.
    (PR #3881)

  • Added StartupTimingObserver for measuring how long each processor's start() method takes during pipeline startup. Also measures transport readiness — the time from StartFrame to first client connection — via the on_transport_timing_report event.
    (PR #3881)

  • Added BotConnectedFrame for SFU transports and on_transport_timing_report event to StartupTimingObserver with bot and client connection timing.
    (PR #3881)

  • Added optional direction parameter to PipelineTask.queue_frame() and PipelineTask.queue_frames(), allowing frames to be pushed upstream from the end of the pipeline.
    (PR #3883)

  • Added on_latency_breakdown event to UserBotLatencyObserver providing per-service TTFB, text aggregation, user turn duration, and function call latency metrics for each user-to-bot response cycle.
    (PR #3885)

  • Added on_first_bot_speech_latency event to UserBotLatencyObserver measuring the time from client connection to first bot speech. An on_latency_breakdown is also emitted for this first speech event.
    (PR #3885)

  • Added broadcast_interruption() to FrameProcessor. This method pushes an InterruptionFrame both upstream and downstream directly from the calling processor, avoiding the round-trip through the pipeline task that push_interruption_task_frame_and_wait() required.
    (PR #3896)

Changed

  • Added text_aggregation_mode parameter to TTSService and all TTS subclasses with a new TextAggregationMode enum (SENTENCE, TOKEN). All text now flows through text aggregators regardless of mode, enabling pattern detection and tag handling in TOKEN mode.
    (PR #3696)

  • ⚠️ Refactored runtime-updatable service settings to use strongly-typed classes (TTSSettings, STTSettings, LLMSettings, and service-specific subclasses) instead of plain dicts. Each service's _settings now holds these strongly-typed objects. For service maintainers, see changes in COMMUNITY_INTEGRATIONS.md.
    (PR #3714)

  • Word timestamp support has been moved from WordTTSService into TTSService via a new supports_word_timestamps parameter. Services that previously extended WordTTSService, AudioContextWordTTSService, or WebsocketWordTTSService now pass supports_word_timestamps=True to their parent __init__ instead.
    (PR #3786)

  • Improved Ultravox TTFB measurement accuracy by using VAD speech end time instead of UserStoppedSpeakingFrame timing.
    (PR #3806)

  • Aligned UltravoxRealtimeLLMService frame handling with OpenAI/Gemini realtime services: added InterruptionFrame handling with metrics cleanup, processing metrics at response boundaries, and improved agent transcript handling for both voice and text output modalities.
    (PR #3806)

  • Updated OpenAIRealtimeLLMService default model to gpt-realtime-1.5.
    (PR #3807)

  • Added api_key parameter to KrispVivaSDKManager, KrispVivaTurn, and KrispVivaFilter for Krisp SDK v1.6.1+ licensing. Falls back to KRISP_VIVA_API_KEY environment variable.
    (PR #3809)

  • Bumped nltk minimum version from 3.9.1 to 3.9.3 to resolve a security vulnerability.
    (PR #3811)

  • ServiceSettingsUpdateFrames are now UninterruptibleFrames. Generally speaking, you don't want a user interruption to prevent a service setting change from going into effect. Note that you usually don't use ServiceSettingsUpdateFrame directly, you use one of its subclasses:

    • LLMUpdateSettingsFrame
    • TTSUpdateSettingsFrame
    • STTUpdateSettingsFrame
      (PR #3819)
  • Updated context summarization to use user role instead of assistant for summary messages.
    (PR #3855)

  • Rename AssemblyAISTTService parameter min_end_of_turn_silence_when_confident parameter to min_turn_silence (old name still supported with deprecation warning)
    (PR #3856)

  • ⚠️ Renamed LLMAssistantAggregatorParams fields: enable_context_summarizationenable_auto_context_summarization and context_summarization_configauto_context_summarization_config (now accepts LLMAutoContextSummarizationConfig). The old names still work with a DeprecationWarning for one release cycle.
    (PR #3863)

  • ElevenLabsRealtimeSTTService now sets TranscriptionFrame.finalized to True when using CommitStrategy.MANUAL.
    (PR #3865)

  • Updated numba version pin from == to >=0.61.2
    (PR #3868)

  • Updated tracing code to use ServiceSettings dataclass API (given_fields(), attribute access) instead of dict-style access (.items(), in, subscript).
    (PR [...

Read more

v0.0.103

21 Feb 00:47
b67af19

Choose a tag to compare

Added

  • Added "timestampTransportStrategy": "ASYNC" to InworldAITTSService. This allows timestamps info to trail audio chunks arrival, resulting in much better first audio chunk latency
    (PR #3625)

  • Added model-specific InputParams to RimeTTSService: arcana params (repetition_penalty, temperature, top_p) and mistv2 params (no_text_normalization, save_oovs, segment). Model, voice, and param changes now trigger WebSocket reconnection.
    (PR #3642)

  • Added write_transport_frame() hook to BaseOutputTransport allowing transport subclasses to handle custom frame types that flow through the audio queue.
    (PR #3719)

  • Added DailySIPTransferFrame and DailySIPReferFrame to the Daily transport. These frames queue SIP transfer and SIP REFER operations with audio, so the operation executes only after the bot finishes its current utterance.
    (PR #3719)

  • Added keepalive support to SarvamSTTService to prevent idle connection timeouts (e.g. when used behind a ServiceSwitcher).
    (PR #3730)

  • Added UserIdleTimeoutUpdateFrame to enable or disable user idle detection at runtime by updating the timeout dynamically.
    (PR #3748)

  • Added broadcast_sibling_id field to the base Frame class. This field is automatically set by broadcast_frame() and broadcast_frame_instance() to the ID of the paired frame pushed in the opposite direction, allowing receivers to identify broadcast pairs.
    (PR #3774)

  • Added ignored_sources parameter to RTVIObserverParams and add_ignored_source()/remove_ignored_source() methods to RTVIObserver to suppress RTVI messages from specific pipeline processors (e.g. a silent evaluation LLM).
    (PR #3779)

  • Added DeepgramSageMakerTTSService for running Deepgram TTS models deployed on AWS SageMaker endpoints via HTTP/2 bidirectional streaming. Supports the Deepgram TTS protocol (Speak, Flush, Clear, Close), interruption handling, and per-turn TTFB metrics.
    (PR #3785)

Changed

  • ⚠️ RimeTTSService now defaults to model="arcana" and the wss://users-ws.rime.ai/ws3 endpoint. InputParams defaults changed from mistv2-specific values to None — only explicitly-set params are sent as query params.
    (PR #3642)

  • AICFilter now shares read-only AIC models via a singleton AICModelManager
    in aic_filter.py.

    • Multiple filters using the same model path or (model_id, model_download_dir) share one loaded model, with reference counting and concurrent load deduplication.
    • Model file I/O runs off the event loop so the filter does not block.
      (PR #3684)
  • Added X-User-Agent and X-Request-Id headers to InworldTTSService for better traceability.
    (PR #3706)

  • DailyUpdateRemoteParticipantsFrame is no longer deprecated and is now queued with audio like other transport frames.
    (PR #3719)

  • Bumped Pillow dependency upper bound from <12 to <13 to allow Pillow 12.x.
    (PR #3728)

  • Moved STT keepalive mechanism from WebsocketSTTService to the STTService base class, allowing any STT service (not just websocket-based ones) to use idle-connection keepalive via the keepalive_timeout and keepalive_interval parameters.
    (PR #3730)

  • Improved audio context management in AudioContextTTSService by moving context ID tracking to the base class and adding reuse_context_id_within_turn parameter to control concurrent TTS request handling.

    • Added helper methods: has_active_audio_context(), get_active_audio_context_id(), remove_active_audio_context(), reset_active_audio_context()
    • Simplified Cartesia, ElevenLabs, Inworld, Rime, AsyncAI, and Gradium TTS implementations by removing duplicate context management code
      (PR #3732)
  • UserIdleController is now always created with a default timeout of 0 (disabled). The user_idle_timeout parameter changed from Optional[float] = None to float = 0 in UserTurnProcessor, LLMUserAggregatorParams, and UserIdleController.
    (PR #3748)

  • Change the version specifier from >=0.2.8 to ~=0.2.8 for the speechmatics-voice package to ensure compatibility with future patch versions.
    (PR #3761)

  • Updated InworldTTSService and InworldHttpTTSService to use ASYNC timestamp transport strategy by default
    (PR #3765)

  • Added start_time and end_time parameters to start_ttfb_metrics(), stop_ttfb_metrics(), start_processing_metrics(), and stop_processing_metrics() in FrameProcessor and FrameProcessorMetrics, allowing custom timestamps for metrics measurement. STTService now uses these instead of custom TTFB tracking.
    (PR #3776)

  • Updated default Anthropic model from claude-sonnet-4-5-20250929 to claude-sonnet-4-6.
    (PR #3792)

Deprecated

  • Deprecated unused Traceable, @traceable, @traced, and AttachmentStrategy in pipecat.utils.tracing.class_decorators. This module will be removed in a future release.
    (PR #3733)

Fixed

  • Fixed race condition where RTVIObserver could send messages before DailyTransport join completed. Outbound messages are now queued & delivered after the transport is ready.
    (PR #3615)

  • Fixed async generator cleanup in OpenAI LLM streaming to prevent AttributeError with uvloop on Python 3.12+ (MagicStack/uvloop#699).
    (PR #3698)

  • Fixed SmallWebRTCTransport input audio resampling to properly handle all sample rates, including 8kHz audio.
    (PR #3713)

  • Fixed a race condition in RTVIObserver where bot output messages could be sent before the bot-started-speaking event.
    (PR #3718)

  • Fixed Grok Realtime session.updated event parsing failure caused by the API returning prefixed voice names (e.g. "human_Ara" instead of "Ara").
    (PR #3720)

  • Fixed context ID reuse issue in ElevenLabsTTSService, InworldTTSService, RimeTTSService, CartesiaTTSService, AsyncAITTSService, and PlayHTTTSService. Services now properly reuse the same context ID across multiple run_tts() invocations within a single LLM turn, preventing context tracking issues and incorrect lifecycle signaling.
    (PR #3729)

  • Fixed word timestamp interleaving issue in ElevenLabsTTSService when processing multiple sentences within a single LLM turn.
    (PR #3729)

  • Fixed tracing service decorators executing the wrapped function twice when the function itself raised an exception (e.g., LLM rate limit, TTS timeout).
    (PR #3735)

  • Fixed LLMUserAggregator broadcasting mute events before StartFrame reaches downstream processors.
    (PR #3737)

  • Fixed UserIdleController false idle triggers caused by gaps between user and bot activity frames. The idle timer now starts only after BotStoppedSpeakingFrame and is suppressed during active user turns and function calls.
    (PR #3744)

  • Fixed incorrect sample_rate assignment in TavusInputTransport._on_participant_audio_data (was using audio.audio_frames instead of audio.sample_rate).
    (PR #3768)

  • Fixed RTVIObserver not processing upstream-only frames. Previously, all upstream frames were filtered out to avoid duplicate messages from broadcasted frames. Now only upstream copies of broadcasted frames are skipped.
    (PR #3774)

  • Fixed mutable default arguments in LLMContextAggregatorPair.__init__() that could cause shared state across instances.
    (PR #3782)

  • Fixed DeepgramSageMakerSTTService to properly track finalize lifecycle using request_finalize() / confirm_finalize() and use is_final (instead of is_final and speech_final) for final transcription detection, matching DeepgramSTTService behavior.
    (PR #3784)

  • Fixed a race condition in AudioContextTTSService where the audio context could time out between consecutive TTS requests within the same turn, causing audio to be discarded.
    (PR #3787)

  • Fixed push_interruption_task_frame_and_wait() hanging indefinitely when the InterruptionFrame does not reach the pipeline sink within the timeout. Added a timeout keyword argument to customize the wait duration.
    (PR [#3789](https://github.com...

Read more

v0.0.102

11 Feb 02:40
640940a

Choose a tag to compare

Added

  • Added ResembleAITTSService for text-to-speech using Resemble AI's streaming WebSocket API with word-level timestamps and jitter buffering for smooth audio playback.
    (PR #3134)

  • Added UserBotLatencyObserver for tracking user-to-bot response latency. When tracing is enabled, latency measurements are automatically recorded as turn.user_bot_latency_seconds attributes on OpenTelemetry turn spans.
    (PR #3355)

  • Added append_to_context parameter to TTSSpeakFrame for conditional LLM context addition.

    • Allows fine-grained control over whether text should be added to conversation context
    • Defaults to True to maintain backward compatibility
      (PR #3584)
  • Added TTS context tracking system with context_id field to trace audio generation through the pipeline.

    • TTSAudioRawFrame, TTSStartedFrame, TTSStoppedFrame now include context_id
    • AggregatedTextFrame and TTSTextFrame now include context_id
    • Enables tracking which TTS request generated specific audio chunks
      (PR #3584)
  • Added support for Inworld TTS Websocket Auto Mode for improved latency
    (PR #3593)

  • Added new frames for context summarization: LLMContextSummaryRequestFrame and LLMContextSummaryResultFrame.
    (PR #3621)

  • Added context summarization feature to automatically compress conversation history when conversation length limits (by token or message count) are reached, enabling efficient long-running conversations.

    • Configure via enable_context_summarization=True in LLMAssistantAggregatorParams
    • Customize behavior with LLMContextSummarizationConfig (max tokens, thresholds, etc.)
    • Automatically preserves incomplete function call sequences during summarization
    • See new examples:
      examples/foundational/54-context-summarization-openai.py and
      examples/foundational/54a-context-summarization-google.py
      (PR #3621)
  • Added RTVI function call lifecycle events (llm-function-call-started, llm-function-call-in-progress, llm-function-call-stopped) with configurable security levels via RTVIObserverParams.function_call_report_level. Supports per-function control over what information is exposed (DISABLED, NONE, NAME, or FULL).
    (PR #3630)

  • Added RequestMetadataFrame and metadata handling for ServiceSwitcher to ensure STT services correctly emit STTMetadataFrame when switching between services. Only the active service's metadata is propagated downstream, switching services triggers the newly active service to re-emit its metadata, and proper frame ordering is maintained at startup.
    (PR #3637)

  • Added STTMetadataFrame to broadcast STT service latency information at pipeline start.

    • STT services broadcast P99 time-to-final-segment (ttfs_p99_latency) to downstream processors
    • Turn stop strategies automatically configure their STT timeout from this metadata
    • Developers can override ttfs_p99_latency via constructor argument for custom deployments
    • Added measured P99 values for STT providers.
    • See stt-benchmark to measure latency for your configuration
      (PR #3637)
  • Added support for is_sandbox parameter in LiveAvatarNewSessionRequest to enable sandbox mode for HeyGen LiveAvatar sessions.
    (PR #3653)

  • Added support for video_settings parameter in LiveAvatarNewSessionRequest to configure video encoding (H264/VP8) and quality levels.
    (PR #3653)

  • Added OpenAIRealtimeSTTService for real-time streaming speech-to-text using OpenAI's Realtime API WebSocket transcription sessions. Supports local VAD and server-side VAD modes, noise reduction, and automatic reconnection.
    (PR #3656)

  • Added bulbul:v3-beta TTS model support for Sarvam AI with temperature control and 25 new speaker voices.
    (PR #3671)

  • Added saaras:v3 STT model support for Sarvam AI with new mode parameter (transcribe, translate, verbatim, translit, codemix) and prompt support.
    (PR #3671)

  • Added new OpenAI TTS voice options marin and cedar.
    (PR #3682)

  • Added UserMuteStartedFrame and UserMuteStoppedFrame system frames, and corresponding user-mute-started / user-mute-stopped RTVI messages, so clients can observe when mute strategies activate or deactivate.
    (PR #3687)

Changed

  • Updated all 30+ TTS service implementations to support context tracking with context_id.

    • Services now generate and propagate context IDs through TTS frames
    • Enables end-to-end tracing of TTS requests through the pipeline
      (PR #3584)
  • ⚠️ TTSService.run_tts() now requires a context_id parameter for context tracking.

    • Custom TTS service implementations must update their run_tts() signature
    • Before: async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
    • After: async def run_tts(self, text: str, context_id: str) -> AsyncGenerator[Frame, None]:
      (PR #3584)
  • Simplified context aggregators to use frame.append_to_context flag instead of tracking internal state.

    • Cleaner logic in LLMResponseAggregator and LLMResponseUniversalAggregator
    • More consistent behavior across aggregator implementations
      (PR #3584)
  • Updated timestamps to be cumulative within an agent turn, using flushCompleted message as an indication of when timestamps from the server are reset to 0
    (PR #3593)

  • Changed KokoroTTSService to use kokoro-onnx instead of kokoro as the underlying TTS engine.
    (PR #3612)

  • Improved user turn stop timing in TranscriptionUserTurnStopStrategy and TurnAnalyzerUserTurnStopStrategy.

    • Timeout now starts on VADUserStoppedSpeakingFrame for tighter, more predictable timing
    • Added support for finalized transcripts (TranscriptionFrame.finalized=True) to trigger earlier
    • Added fallback timeout for edge cases where transcripts arrive without VAD events
    • Removed InterimTranscriptionFrame handling (no longer affects timing)
      (PR #3637)
  • Improved the accuracy of the UserBotLatencyObserver and UserBotLatencyLogObserver by measuring from the time when the user actually starts speaking.
    (PR #3637)

  • ⚠️ Renamed timeout parameter to user_speech_timeout in TranscriptionUserTurnStopStrategy.
    (PR #3637)

  • Updated the VADUserStartedSpeakingFrame to include start_secs and timestamp and VADUserStoppedSpeakingFrame to include stop_secs and timestamp, removing the need to separately handle the SpeechControlParamsFrame for VADParams values.
    (PR #3637)

  • ⚠️ Renamed TranscriptionUserTurnStopStrategy to SpeechTimeoutUserTurnStopStrategy. The old name is deprecated and will be removed in a future release.
    (PR #3637)

  • AssemblyAISTTService now automatically configures optimal settings for manual turn detection when vad_force_turn_endpoint=True. This sets end_of_turn_confidence_threshold=1.0 and max_turn_silence=2000 by default, which disables model-based turn detection and reduces latency by relying on external VAD for turn endpoints. Warnings are logged if conflicting settings are detected.
    (PR #3644)

  • Upgraded the pipecat-ai-small-webrtc-prebuilt package to v2.1.0.
    (PR #3652)

  • Changed default session mode from "CUSTOM" to "LITE" in HeyGen LiveAvatar integration, with VP8 as the default video encoding.
    (PR #3653)

  • ⚠️ The default VADParams stop_secs default is changing from 0.8 seconds to 0.2 seconds. This change both simplifies the developer experience and improves the performance of STT services. With a shorter stop_secs value, STT services using a local VAD can finalize sooner, resulting in faster transcription.

    • SpeechTimeoutUserTurnStopStrategy: control how long to wait for additional user speech using user_speech_timeout (default: 0.6 sec).
    • TurnAnalyzerUserTurnStopStrategy: the turn analyzer automatically adjusts the user wait time based on the audio input.
      (PR #3659)
  • Moved interruption wait event from per-processor instance state to InterruptionFrame itself. Added InterruptionFrame.complete() to signal when the interruption has fully traversed the pipeline. Custom processors that block or consume an InterruptionFrame before it reaches the pipeline sink must call frame.complete() to avoid stalling `push_interruption_...

Read more

v0.0.101

31 Jan 07:01
7853e5c

Choose a tag to compare

Added

  • Additions for AICFilter and AICVADAnalyzer:

    • Added model downloading support to AICFilter with model_id and model_download_dir parameters.
    • Added model_path parameter to AICFilter for loading local .aicmodel files.
    • Added unit tests for AICFilter and AICVADAnalyzer.
      (PR #3408)
  • Added handling for server_content.interrupted signal in the Gemini Live service for faster interruption response in the case where there isn't already turn tracking in the pipeline, e.g. local VAD + context aggregators. When there is already turn tracking in the pipeline, the additional interruption does no harm.
    (PR #3429)

  • Added new GenesysFrameSerializer for the Genesys AudioHook WebSocket protocol, enabling bidirectional audio streaming between Pipecat pipelines and Genesys Cloud contact center.
    (PR #3500)

  • Added reached_upstream_types and reached_downstream_types read-only properties to PipelineTask for inspecting current frame filters.
    (PR #3510)

  • Added add_reached_upstream_filter() and add_reached_downstream_filter() methods to PipelineTask for appending frame types.
    (PR #3510)

  • Added UserTurnCompletionLLMServiceMixin for LLM services to detect and filter incomplete user turns. When enabled via filter_incomplete_user_turns in LLMUserAggregatorParams, the LLM outputs a turn completion marker at the start of each response: ✓ (complete), ○ (incomplete short), or ◐ (incomplete long). Incomplete turns are suppressed, and configurable timeouts automatically re-prompt the user.
    (PR #3518)

  • Added FrameProcessor.broadcast_frame_instance(frame) method to broadcast a frame instance by extracting its fields and creating new instances for each direction.
    (PR #3519)

  • PipelineTask now automatically adds RTVIProcessor and registers RTVIObserver when enable_rtvi=True (default), simplifying pipeline setup.
    (PR #3519)

  • Added RTVIProcessor.create_rtvi_observer() factory method for creating RTVI observers.
    (PR #3519)

  • Added video_out_codec parameter to TransportParams allowing configuration of the preferred video codec (e.g., "VP8", "H264", "H265") for video output in DailyTransport.
    (PR #3520)

  • Added location parameter to Google TTS services (GoogleHttpTTSService, GoogleTTSService, GeminiTTSService) for regional endpoint support.
    (PR #3523)

  • Added new PIPECAT_SMART_TURN_LOG_DATA environment variable, which causes Smart Turn input data to be saved to disk
    (PR #3525)

  • Added result_callback parameter to UserImageRequestFrame to support deferred function call results.
    (PR #3571)

  • Added function_call_timeout_secs parameter to LLMService to configure timeout for deferred function calls (defaults to 10.0 seconds).
    (PR #3571)

  • Added vad_analyzer parameter to LLMUserAggregatorParams. VAD analysis is now handled inside the LLMUserAggregator rather than in the transport, keeping voice activity detection closer to where it is consumed. The vad_analyzer on BaseInputTransport is now deprecated.

    context_aggregator = LLMContextAggregatorPair(
        context,
        user_params=LLMUserAggregatorParams(
            vad_analyzer=SileroVADAnalyzer(),
        ),
    )

    (PR #3583)

  • Added VADProcessor for detecting speech in audio streams within a pipeline. Pushes VADUserStartedSpeakingFrame, VADUserStoppedSpeakingFrame, and UserSpeakingFrame downstream based on VAD state changes.
    (PR #3583)

  • Added VADController for managing voice activity detection state and emitting speech events independently of transport or pipeline processors.
    (PR #3583)

  • Added local PiperTTSService for offline text-to-speech using Piper voice models. The existing HTTP-based service has been renamed to PiperHttpTTSService.
    (PR #3585)

  • main() in pipecat.runner.run now accepts an optional argparse.ArgumentParser, allowing bots to define custom CLI arguments accessible via runner_args.cli_args.
    (PR #3590)

  • Added KokoroTTSService for local text-to-speech synthesis using the Kokoro-82M model.
    (PR #3595)

Changed

  • Updated AICFilter and AICVADAnalyzer to use aic-sdk ~= 2.0.1.
    (PR #3408)

  • Improved the STT TTFB (Time To First Byte) measurement, reporting the delay between when the user stops speaking and when the final transcription is received. Note: Unlike traditional TTFB which measures from a discrete request, STT services receive continuous audio input—so we measure from speech end to final transcript, which captures the latency that matters for voice AI applications. In support of this change, added finalized field to TranscriptionFrame to indicate when a transcript is the final result for an utterance.
    (PR #3495)

  • SarvamSTTService now defaults vad_signals and high_vad_sensitivity to None (omitted from connection parameters), improving latency by ~300ms compared to the previous defaults.
    (PR #3495)

  • Changed frame filter storage from tuples to sets in PipelineTask.
    (PR #3510)

  • Changed default Inworld TTS model from inworld-tts-1 to inworld-tts-1.5-max.
    (PR #3531)

  • FrameSerializer now subclasses from BaseObject to enable event support.
    (PR #3560)

  • Added support for TTFS in SpeechmaticsSTTService and set the default mode to EXTERNAL to support Pipecat-controlled VAD.

    • Changed dependency to speechmatics-voice[smart]>=0.2.8
      (PR #3562)
  • ⚠️ Changed function call handling to use timeout-based completion instead of immediate callback execution.

    • Function calls that defer their results (e.g., UserImageRequestFrame) now use a timeout mechanism
    • The result_callback is invoked automatically when the deferred operation completes or after timeout
    • This change affects examples using UserImageRequestFrame - the result_callback should now be passed to the frame instead of being called immediately
      (PR #3571)
  • Pipecat runner now uses DAILY_ROOM_URL instead of DAILY_SAMPLE_ROOM_URL.
    (PR #3582)

  • Updates to GradiumSTTService:

    • Now flushes pending transcriptions when VAD detects the user stopped speaking, improving response latency.
    • GradiumSTTService now supports InputParams for configuring language and delay_in_frames settings.
      (PR #3587)

Deprecated

  • ⚠️ Deprecated vad_analyzer parameter on BaseInputTransport. Pass vad_analyzer to LLMUserAggregatorParams instead or use VADProcessor in the pipeline.
    (PR #3583)

Removed

  • Removed deprecated AICFilter parameters: enhancement_level, voice_gain, noise_gate_enable.
    (PR #3408)

Fixed

  • Fixed an issue where if you were using OpenRouterLLMService with a Gemini model, it wouldn't handle multiple "system" messages as expected (and as we do in GoogleLLMService), which is to convert subsequent ones into "user" messages. Instead, the latest "system" message would overwrite the previous ones.
    (PR #3406)

  • Transports now properly broadcast InputTransportMessageFrame frames both upstream and downstream instead of only pushing downstream.
    (PR #3519)

  • Fixed FrameProcessor.broadcast_frame() to deep copy kwargs, preventing shared mutable references between the downstream and upstream frame instances.
    (PR #3519)

  • Fixed OpenAI LLM services to emit ErrorFrame on completion timeout, enabling proper error handling and LLMSwitcher failover.
    (PR #3529)

  • Fixed a logging issue where non-ASCII characters (e.g., Japanese, Chinese, etc.) were being unnecessarily escaped to Unicode sequences when function call occurred.
    (PR #3536)

  • Fixed how audio tracks are synchronized inside the AudioBufferProcessor to fix timing issues where silence and audio were misaligned between user and bot buffers.
    (PR #3541)

  • Fixed race condition in OpenAIRealtimeBetaLLMService that could cause an error when truncating the conversation....

Read more

v0.0.100

21 Jan 03:37
768d395

Choose a tag to compare

Added

  • Added Hathora service to support Hathora-hosted TTS and STT models (only non-streaming)
    (PR #3169)

  • Added CambTTSService, using Camb.ai's TTS integration with MARS models (mars-flash, mars-pro, mars-instruct) for high-quality text-to-speech synthesis.
    (PR #3349)

  • Added the additional_headers param to WebsocketClientParams, allowing WebsocketClientTransport to send custom headers on connect, for cases such as authentication.
    (PR #3461)

  • Added UserIdleController for detecting user idle state, integrated into LLMUserAggregator and UserTurnProcessor via optional user_idle_timeout parameter. Emits on_user_turn_idle event for application-level handling. Deprecated UserIdleProcessor in favor of the new compositional approach.
    (PR #3482)

  • Added on_user_mute_started and on_user_mute_stopped event handlers to LLMUserAggregator for tracking user mute state changes.
    (PR #3490)

Changed

  • Enhanced interruption handling in AsyncAITTSService by supporting multi-context WebSocket sessions for more robust context management.
    (PR #3287)

  • Throttle UserSpeakingFrame to broadcast at most every 200ms instead of on every audio chunk, reducing frame processing overhead during user speech.
    (PR #3483)

Deprecated

  • For consistency with other package names, we just deprecated pipecat.turns.mute (introduced in Pipecat 0.0.99) in favor of pipecat.turns.user_mute.
    (PR #3479)

Fixed

  • Corrected TTFB metric calculation in AsyncAIHttpTTSService.
    (PR #3287)

  • Fixed an issue where the "bot-llm-text" RTVI event would not fire for realtime (speech-to-speech) services:

    • AWSNovaSonicLLMService
    • GeminiLiveLLMService
    • OpenAIRealtimeLLMService
    • GrokRealtimeLLMService

    The issue was that these services weren't pushing LLMTextFrames. Now they do.
    (PR #3446)

  • Fixed an issue where on_user_turn_stop_timeout could fire while a user is talking when using ExternalUserTurnStrategies.
    (PR #3454)

  • Fixed an issue where user turn start strategies were not being reset after a user turn started, causing incorrect strategy behavior.
    (PR #3455)

  • Fixed MinWordsUserTurnStartStrategy to not aggregate transcriptions, preventing incorrect turn starts when words are spoken with pauses between them.
    (PR #3462)

  • Fixed an issue where Grok Realtime would error out when running with SmallWebRTC transport.
    (PR #3480)

  • Fixed a Mem0MemoryService issue where passing async_mode: true was causing an error. See https://docs.mem0.ai/platform/features/async-mode-default-change.
    (PR #3484)

  • Fixed AWSNovaSonicLLMService.reset_conversation(), which would previously error out. Now it successfully reconnects and "rehydrates" from the context object.
    (PR #3486)

  • Fixed AzureTTSService transcript formatting issues:

    • Punctuation now appears without extra spaces (e.g., "Hello!" instead of "Hello !")
    • CJK languages (Chinese, Japanese, Korean) no longer have unwanted spaces between characters
      (PR #3489)
  • Fixed an issue where UninterruptibleFrame frames would not be preserved in some cases.
    (PR #3494)

  • Fixed memory leak in LiveKitTransport when video_in_enabled is False.
    (PR #3499)

  • Fixed an issue in AIService where unhandled exceptions in start(), stop(), or cancel() implementations would prevent process_frame() to continue and therefore StartFrame, EndFrame, or CancelFrame from being pushed downstream, causing the pipeline to not start or stop properly.
    (PR #3503)

  • Moved NVIDIATTSService and NVIDIASTTService client initialization from constructor to start() for better error handling.
    (PR #3504)

  • Optimized NVIDIATTSService to process incoming audio frames immediately.
    (PR #3509)

  • Optimized NVIDIASTTService by removing unnecessary queue and task.
    (PR #3509)

  • Fixed a CambTTSService issue where client was being initialized in the constructor which wouldn't allow for proper Pipeline error handling.
    (PR #3511)

v0.0.99

14 Jan 01:19
86ed485

Choose a tag to compare

Added

  • Introducing user turn strategies. User turn strategies indicate when the user turn starts or stops. In conversational agents, these are often referred to as start/stop speaking or turn-taking plans or policies.

    User turn start strategies indicate when the user starts speaking (e.g. using VAD events or when a user says one or more words).

    User turn stop strategies indicate when the user stops speaking (e.g. using an end-of-turn detection model or by observing incoming transcriptions).

    A list of strategies can be specified for both strategies; strategies are evaluated in order until one evaluates to true.

    Available user turn start strategies:
    - VADUserTurnStartStrategy
    - TranscriptionUserTurnStartStrategy
    - MinWordsUserTurnStartStrategy
    - ExternalUserTurnStartStrategy

    Available user turn stop strategies:
    - TranscriptionUserTurnStopStrategy
    - TurnAnalyzerUserTurnStopStrategy
    - ExternalUserTurnStopStrategy

    The default strategies are:
    - start: [VADUserTurnStartStrategy, TranscriptionUserTurnStartStrategy]
    - stop: [TranscriptionUserTurnStopStrategy]

    Turn strategies are configured when setting up LLMContextAggregatorPair. For example:

    context_aggregator = LLMContextAggregatorPair(
        context,
        user_params=LLMUserAggregatorParams(
            user_turn_strategies=UserTurnStrategies(
                stop=[
                    TurnAnalyzerUserTurnStopStrategy(
                      turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams())
                    )
                ],
            )
        ),
    )

    In order to use the user turn strategies you must update to the new universal LLMContext and LLMContextAggregatorPair. (PR #3045)

  • Added RNNoiseFilter for real-time noise suppression using RNNoise neural network via pyrnnoise library. (PR #3205)

  • Added GrokRealtimeLLMService for xAI's Grok Voice Agent API with real-time voice conversations:

    • Support for real-time audio streaming with WebSocket connection
    • Built-in server-side VAD (Voice Activity Detection)
    • Multiple voice options: Ara, Rex, Sal, Eve, Leo
    • Built-in tools support: web_search, x_search, file_search
    • Custom function calling with standard Pipecat tools schema
    • Configurable audio formats (PCM at 8kHz-48kHz)
      (PR #3267)
  • Added an approximation of TTFB for Ultravox.
    (PR #3268)

  • Added a new AudioContextTTSService to the TTS service base classes. The AudioContextWordTTSService now inherits from AudioContextTTSService and WebsocketWordTTSService. (PR #3289)

  • LLMUserAggregator now exposes the following events:

    • on_user_turn_started: triggered when a user turn starts
    • on_user_turn_stopped: triggered when a user turn ends
    • on_user_turn_stop_timeout: triggered when a user turn does not stop and times out
      (PR #3291)
  • Introducing user mute strategies. User mute strategies indicate when user input should be muted based on the current system state.

    In conversational agents, user mute strategies are used to prevent user input from interrupting bot speech, tool execution, or other critical system operations.

    A list of strategies can be specified; all strategies are evaluated for every frame so that each strategy can maintain its internal state. A user frame is muted if any of the configured strategies indicates it should be muted.

    Available user mute strategies:

    • FirstSpeechUserMuteStrategy
    • MuteUntilFirstBotCompleteUserMuteStrategy
    • AlwaysUserMuteStrategy
    • FunctionCallUserMuteStrategy

    User mute strategies replace the legacy STTMuteFilter and provide a more flexible and composable approach to muting user input.

    User mute strategies are configured when setting up the LLMContextAggregatorPair. For example:

    context_aggregator = LLMContextAggregatorPair(
        context,
        user_params=LLMUserAggregatorParams(
            user_mute_strategies=[
                FirstSpeechUserMuteStrategy(),
            ]
        ),
    )

    In order to use user mute strategies you should update to the new universal LLMContext and LLMContextAggregatorPair.
    (PR #3292)

  • Added use_ssl parameter to NvidiaSTTService, NvidiaSegmentedSTTService and NvidiaTTSService.
    (PR #3300)

  • Added enable_interruptions constructor argument to all user turn strategies. This tells the LLMUserAggregator to push or not push an InterruptionFrame.
    (PR #3316)

  • Added split_sentences parameter to SpeechmaticsSTTService to control sentence splitting behavior for finals on sentence boundaries.
    (PR #3328)

  • Added word-level timestamp support to AzureTTSService for accurate text-to-audio synchronization.
    (PR #3334)

  • Added pronunciation_dict_id parameter to CartesiaTTSService.InputParams and CartesiaHttpTTSService.InputParams to support Cartesia's pronunciation dictionary feature for custom pronunciations.
    (PR #3346)

  • Added support for using the HeyGen LiveAvatar API with the HeyGenTransport (see https://www.liveavatar.com/).
    (PR #3357)

  • Added image support to OpenAIRealtimeLLMService via InputImageRawFrame:

    • New start_video_paused parameter to control initial video input state
    • New video_frame_detail parameter to set image processing quality ("auto", "low", or "high"). This corresponds to OpenAI Realtime's image_detail parameter.
    • set_video_input_paused() method to pause/resume video input at runtime
    • set_video_frame_detail() method to adjust video frame quality dynamically
    • Automatic rate limiting (1 frame per second) to prevent API overload
      (PR #3360)
  • Added UserTurnProcessor, a frame processor built on UserTurnController that pushes UserStartedSpeakingFrame and UserStoppedSpeakingFrame frames and interruptions based on the controller's user turn strategies.
    (PR #3372)

  • Added UserTurnController to manage user turns. It emits on_user_turn_started, on_user_turn_stopped, and on_user_turn_stop_timeout events, and can be integrated into processors to detect and handle user turns. LLMUserAggregator and UserTurnProcessor are implemented using this controller.
    (PR #3372)

  • Added should_interrupt property to DeepgramFluxSTTService, DeepgramSTTService, and SpeechmaticsSTTService to configure whether the bot should be interrupted when the external service detects user speech.
    (PR #3374)

  • LLMAssistantAggregator now exposes the following events:

    • on_assistant_turn_started: triggered when the assistant turn starts
    • on_assistant_turn_stopped: triggered when the assistant turn ends
    • on_assistant_thought: triggered when there's an assistant thought available
      (PR #3385)
  • Added KrispVivaTurn analyzer for end of turn detection using the Krisp VIVA SDK (requires krisp_audio).
    (PR #3391)

  • Added support for setting up a pipeline task from external files. You can now register custom pipeline task setup files by setting the PIPECAT_SETUP_FILES environment variable. This variable should contain a colon-separated list of Python files (e.g. export PIPECAT_SETUP_FILES="setup1.py:setup.py:..."). Each file must define a function with the following signature:

    async def setup_pipeline_task(task: PipelineTask):
        ...

    (PR #3397)

  • Added a keepalive task for InworldTTSService to keep the service connected in the event of no generations for longer periods of time.
    (PR #3403)

  • Added enable_vad to Params for use in the GladiaSTTService. When enabled, GladiaSTTService acts as the turn controller, emitting UserStartedSpeakingFrame, UserStoppedSpeakingFrame, and optionally InterruptionFrame.
    (PR #3404)

  • Added should_interrupt property to GladiaSTTService to configure whether the bot should be interrupted when the external service detects user speech.
    (PR #3404)

  • Added VonageFrameSerializer for the Vonage Video API Audio Connector WebSocket protocol.
    (PR #3410)

  • Added append_trailing_space parameter to TTSService to automatically append a trailing space to text before sending to TTS, helping prevent some services from vocalizing trailing punctuation.
    (PR #3424)

Changed

  • Updated ElevenLabsRealtimeSTTService to accept the include_language_detection parameter to detect language.

      stt = ElevenLabsRealtimeSTTService(
          api_key=os.getenv("...
Read more

v0.0.98

17 Dec 19:31
f9fef78

Choose a tag to compare

Added

  • Added RimeNonJsonTTSService which supports non-JSON streaming mode. This new class supports websocket streaming for the Arcana model.
    (PR #3085)

  • Added additional functionality related to "thinking", for Google and Anthropic LLMs.

    1. New typed parameters for Google and Anthropic LLMs that control the models' thinking behavior (like how much thinking to do, and whether to output thoughts or thought summaries):
      • AnthropicLLMService.ThinkingConfig
      • GoogleLLMService.ThinkingConfig
    2. New frames for representing thoughts output by LLMs:
      • LLMThoughtStartFrame
      • LLMThoughtTextFrame
      • LLMThoughtEndFrame
    3. A generic mechanism for recording LLM thoughts to context, used specifically to support Anthropic, whose thought signatures are expected to appear alongside the text of the thoughts within assistant context messages. See:
      • LLMThoughtEndFrame.signature
      • LLMAssistantAggregator handling of the above field
      • AnthropicLLMAdapter handling of "thought" context messages
    4. Google-specific logic for inserting thought signatures into the context, to help maintain thinking continuity in a chain of LLM calls. See:
      • GoogleLLMService sending LLMMessagesAppendFrames to add LLM-specific "thought_signature" messages to context
      • GeminiLLMAdapter handling of "thought_signature" messages
    5. An expansion of TranscriptProcessor to process LLM thoughts in addition to user and assistant utterances. See:
      • TranscriptProcessor(process_thoughts=True) (defaults to False)
      • ThoughtTranscriptionMessage, which is now also emitted with the
        "on_transcript_update" event
        (PR #3175)
  • Data and control frames can now be marked as non-interruptible by using the UninterruptibleFrame mixin. Frames marked as UninterruptibleFrame will not be interrupted during processing, and any queued frames of this type will be retained in the internal queues. This is useful when you need ordered frames (data or control) that should not be discarded or cancelled due to interruptions.
    (PR #3189)

  • Added on_conversation_detected event to VoicemaiDetector.
    (PR #3207)

  • Added x-goog-api-client header with Pipecat's version to all Google services' requests.
    (PR #3208)

  • Added support for the HeyGen LiveAvatar API (see https://www.liveavatar.com/).
    (PR #3210)

  • Added to AWSNovaSonicLLMService functionality related to the new (and now default) Nova 2 Sonic model ("amazon.nova-2-sonic-v1:0"):

    • Added the endpointing_sensitivity parameter to control how quickly the model decides the user has stopped speaking.
    • Made the assistant-response-trigger hack a no-op. It's only needed for the older Nova Sonic model.
      (PR #3212)
  • Ultravox Realtime is now a supported speech-to-speech service.

    • Added UltravoxRealtimeLLMService for the integration.
    • Added 49-ultravox-realtime.py example (with tool calling).
      (PR #3227)
  • Added Daily PSTN dial-in support to the development runner with --dialin flag. This includes:

    • /daily-dialin-webhook endpoint that handles incoming Daily PSTN webhooks
    • Automatic Daily room creation with SIP configuration
    • DialinSettings and DailyDialinRequest types in pipecat.runner.types for type-safe dial-in data
    • The runner now mimics Pipecat Cloud's dial-in webhook handling for local development
      (PR #3235)
  • Add Gladia session id to logs for GladiaSTTService.
    (PR #3236)

  • Added InworldHttpTTSService which uses Inworld's HTTP based TTS service in either streaming or non-streaming mode. Note: This class was previously named InworldTTSService.
    (PR #3239)

  • Added language_hints_strict parameter to SonioxSTTService to strictly enforces language hints. This ensures that transcription occurs in the specified language.
    (PR #3245)

  • Added Pipecat library version info to the about field in the bot-ready RTVI message.
    (PR #3248)

  • Added VisionFullResponseStartFrame, VisionFullResponseEndFrame and VisionTextFrame. This are used by vision services similar to LLM services.
    (PR #3252)

Changed

  • FunctionCallInProgressFrame and FunctionCallResultFrame have changed from system frames to a control frame and a data frame, respectively, and are now both marked as UninterruptibleFrame.
    (PR #3189)

  • UserBotLatencyLogObserver now uses VADUserStartedSpeakingFrame and VADUserStoppedSpeakingFrame to determine latency from user stopped speaking to bot started speaking.
    (PR #3206)

  • Updated HeyGenVideoService and HeyGenTransport to support both HeyGen APIs (Interactive Avatar and Live Avatar).

    Using them is as simple as specifying the service_type when creating the HeyGenVideoService and the HeyGenTransport:

    heyGen = HeyGenVideoService(
        api_key=os.getenv("HEYGEN_LIVE_AVATAR_API_KEY"),
        service_type=ServiceType.LIVE_AVATAR,
        session=session,
    )

    (PR #3210)

  • Made "amazon.nova-2-sonic-v1:0" the new default model for AWSNovaSonicLLMService.
    (PR #3212)

  • Updated the run_inference methods in the LLM service classes (AnthropicLLMService, AWSBedrockLLMService, GoogleLLMService, and OpenAILLMService and its base classes) to use the provided LLM configuration parameters.
    (PR #3214)

  • Updated default models for:

    • GeminiLiveLLMService to gemini-2.5-flash-native-audio-preview-12-2025.
    • GeminiLiveVertexLLMService to gemini-live-2.5-flash-native-audio.
      (PR #3228)
  • Changed the reason field in EndFrame, CancelFrame, EndTaskFrame, and CancelTaskFrame from str to Any to indicate that it can hold values other than strings.
    (PR #3231)

  • Updated websocket STT services to use the WebsocketSTTService base class. This base class manages the websocket connection and handles reconnects.

    Updated services:

    • AssemblyAISTTService
    • AWSTranscribeSTTService
    • GladiaSTTService
    • SonioxSTTService
      (PR #3236)
  • Changed Inworld's TTS service implementations:

    • Previously, the HTTP implementation was named InworldTTSService. That has been moved to InworldHttpTTSService. This service now supports word-timestamp alignment data in both streaming and non-streaming modes.
    • Updated the InworldTTSService class to use Inworld's Websocket API. This class now has support for word-timestamp alignment data and tracks contexts for each user turn.
      (PR #3239)
  • ⚠️ Breaking change: WordTTSService.start_word_timestamps() and WordTTSService.reset_word_timestamps() are now async.
    (PR #3240)

  • Updated the current RTVI version to 1.1.0 to reflect recent additions and deprecations.

    • New RTVI Messages: send-text and bot-output
    • Deprecated Messages: append-to-context and bot-transcription
      (PR #3248)
  • MoondreamService now pushes VisionFullResponseStartFrame, VisionFullResponseEndFrame and VisionTextFrame.
    (PR #3252)

Deprecated

  • FalSmartTurnAnalyzer and LocalSmartTurnAnalyzer are deprecated and will be removed in a future version. Use LocalSmartTurnAnalyzerV3 instead.
    (PR #3219)

Removed

  • Removed the deprecated VLLM-based open source Ultravox STT service.
    (PR #3227)

Fixed

  • Fixed a bug in AWSNovaSonicLLMService where we would mishandle cancelled tool calls in the context, resulting in errors.
    (PR #3212)

  • Better support conversation history with Gemini 2.5 Flash Image (model "gemini-2.5-flash-image"). Prior to this fix, the model had no memory of previous images it had generated, so it wouldn't be able to iterate on them.
    (PR #3224)

  • Support conversations with Gemini 3 Pro Image (model "gemini-3-pro-image-preview"). Prior to this fix, after the model generated an image the conversation would not be able to progress.
    (PR #3224)

  • Fixed an issue where ElevenLabsHttpTTSService was not updating voice settings when receiving a TTSUpdateSettingsFrame.
    (PR #3226)

  • Fixed the return type for SmallWebRTCRequestHandler.handle_web_request() function.
    (PR #3230)

  • Fix a bug in LLM context audio content handling
    ...

Read more

v0.0.97

05 Dec 23:59
4cefe13

Choose a tag to compare

Added

  • Added new Gradium services, GradiumSTTService and GradiumTTSService, for speech-to-text and text-to-speech functionality using Gradium's API.

  • Additions for AsyncAITTSService and AsyncAIHttpTTSService:

    • Added new languages: pt, nl, ar, ru, ro, ja, he, hy, tr, hi, zh.
    • Updated the default model to asyncflow_multilingual_v1.0 for improved accuracy and broader language coverage.
  • Added optional tool and tool output filters for MCP services.

Changed

  • Updated Deepgram logging to include Deepgram request IDs for improved debugging.

  • Text Aggregation Improvements:

    • Breaking Change: BaseTextAggregator.aggregate() now returns AsyncIterator[Aggregation] instead of Optional[Aggregation]. This enables the aggregator to return multiple results based on the provided text.
    • Refactored text aggregators to use inheritance: SkipTagsAggregator and PatternPairAggregator now inherit from SimpleTextAggregator, reusing the base class's sentence detection logic.
  • Improved interruption handling to prevent bots from repeating themselves. LLM services that return multiple sentences in a single response (e.g., GoogleLLMService) are now split into individual sentences before being sent to TTS. This ensures interruptions occur at sentence boundaries, preventing the bot from repeating content after being interrupted during long responses.

  • Updated AICFilter to use Quail STT as the default model (AICModelType.QUAIL_STT). Quail STT is optimized for human-to-machine interaction (e.g., voice agents, speech-to-text) and operates at a native sample rate of 16 kHz with fixed enhancement parameters.

  • If an unexpected exception is caught, or if FrameProcessor.push_error() is called with an exception, the file name and line number where the exception occured are now logged.

  • Updated Smart Turn model weights to v3.1.

  • Smart Turn analyzer now uses the full context of the turn rather than just the audio since VAD last triggered.

  • Updated CartesiaSTTService to return the full transcription result in the TranscriptionFrame and InterimTranscriptionFrame. This provides access to word timestamp data.

  • Added tracking headers (X-Hume-Client-Name and X-Hume-Client-Version) to all requests made by HumeTTSService to the Hume API for better usage tracking and analytics.

    • Added stop() and cancel() cleanup methods to HumeTTSService to properly close the HTTP client and prevent resource leaks.

Deprecated

  • NVIDIA Services name changes (all functionality is unchanged):

    • NimLLMService is now deprecated, use NvidiaLLMService instead.
    • RivaSTTService is now deprecated, use NvidiaSTTService instead.
    • RivaTTSService is now deprecated, use NvidiaTTSService instead.
    • Use uv pip install pipecat-ai[nvidia] instead of uv pip install pipecat-ai[riva]
  • The noise_gate_enable parameter in AICFilter is deprecated and no longer has any effect. Noise gating is now handled automatically by the AIC VAD system. Use AICFilter.create_vad_analyzer() for VAD functionality instead.

  • Package pipecat.sync is deprecated, use pipecat.utils.sync instead.

Fixed

  • Fixed bug in PatternPairAggregator where pattern handlers could be called multiple times for KEEP or AGGREGATE patterns.

  • Fixed sentence aggregation to correctly handle ambiguous punctuation in streaming text, such as currency ("$29.95") and abbreviations ("Mr. Smith").

  • Fixed an issue in AWSTranscribeSTTService where the region arg was always set to us-east-1 when providing an AWS_REGION env var.

  • Fixed an issue in SarvamTTSService where the last sentence was not being spoken. Now, audio is flushed when the TTS services receives the LLMFullResponseEndFrame or EndFrame.

  • Fixed an issue in DeepgramTTSService where a TTSStoppedFrame was incorrectly pushed after a functional call. This caused an issue with the voice-ui-kit's conversational panel rending of the LLM output after a function call.

  • Fixed an issue where LLMTextFrame.skip_tts was being overwritten by LLM services.

  • Fixed an issue that caused WebsocketService instances to attempt reconnection during shutdown.

  • Fixed an issue in ElevenLabsTTSService where character usage metrics were only reported on the first TTS generation per turn.