Skip to content

Commit ceb1b5f

Browse files
authored
Merge pull request #2 from NVIDIA/develop
nvidia-pipecat 0.3.0 release changes
2 parents d746b90 + fe0906f commit ceb1b5f

File tree

144 files changed

+33342
-5959
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

144 files changed

+33342
-5959
lines changed

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,9 @@ docs/build/
2828

2929
# Ignore .DS_Store
3030
.DS_Store
31+
32+
*.log
33+
nim_cache/
34+
riva_cache/
35+
llm_cache/
36+
audio_dumps/

CHANGELOG.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,31 @@ All notable changes to this project will be documented in this file.
33
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
44
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
55

6+
## [0.3.0] - 2025-11-7
7+
8+
### Added
9+
- Added WebRTC based voice agent example and custom UI
10+
- Nemo Agent Toolkit integration and Voice Agent example with Agentic AI
11+
- Scripts for latency and throughput performance benchmarking for Voice Agents
12+
- Support for Dynamic LLM prompt ingestion and TTS Voice selection using WebRTC UI
13+
- Full-Duplex-Bench evaluation inference client script
14+
- BlingFireTextAggregator for TTS Service
15+
- Added steps for LLM deployment with KV Cache support
16+
17+
### Changed
18+
- Updated pipecat to version 0.0.85
19+
- Renamed GitHub repository to voice-agent-examples
20+
- Switched to Magpie TTS Multilingual model
21+
- Hardcoded NIM version tags in examples
22+
23+
### Fixed
24+
- Fixed user transcriptions and docker compose volume issues
25+
- Split long TTS sentences to handle Riva TTS character limit error
26+
27+
### Removed
28+
- Removed Animation and Audio2Face support
29+
- Removed ACE naming references
30+
631
## [0.2.0] - 2025-06-17
732

833
### Added

NVIDIA_PIPECAT.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# NVIDIA Pipecat
22

3-
The NVIDIA Pipecat library augments [the Pipecat framework](https://github.com/pipecat-ai/pipecat) by adding additional frame processors and services, as well as new multimodal frames to facilitate the creation of human-avatar interactions. This includes the integration of NVIDIA services and NIMs such as [NVIDIA Riva](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html), [NVIDIA Audio2Face](https://build.nvidia.com/nvidia/audio2face-3d), and [NVIDIA Foundational RAG](https://build.nvidia.com/nvidia/build-an-enterprise-rag-pipeline). It also introduces a few processors with a focus on improving the end-user experience for multimodal conversational agents, along with speculative speech processing to reduce latency for faster bot responses.
3+
The NVIDIA Pipecat library augments [the Pipecat framework](https://github.com/pipecat-ai/pipecat) by adding additional frame processors and NVIDIA services. This includes the integration of NVIDIA services and NIMs such as [Riva ASR](https://build.nvidia.com/nvidia/parakeet-ctc-1_1b-asr), [Riva TTS](https://build.nvidia.com/nvidia/magpie-tts-multilingual), [LLM NIMs](https://build.nvidia.com/models), [NAT (NeMo Agent Toolkit)](https://github.com/NVIDIA/NeMo-Agent-Toolkit), and [Foundational RAG](https://github.com/NVIDIA-AI-Blueprints/rag). It also introduces a few processors with a focus on improving the end-user experience for multimodal conversational agents, along with speculative speech processing to reduce latency for faster bot responses.
44

5-
The nvidia-pipecat source code can be found in [the GitHub repository](https://github.com/NVIDIA/ace-controller). Follow [the documentation](https://docs.nvidia.com/ace/ace-controller-microservice/latest/index.html) for more details.
5+
The nvidia-pipecat source code can be found in [the GitHub repository](https://github.com/NVIDIA/voice-agent-examples).

README.md

Lines changed: 24 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,44 @@
1-
# ACE Controller SDK
1+
# Riva Voice Agent Examples
22

3-
The ACE Controller SDK allows you to build your own ACE Controller service to manage multimodal, real-time interactions with voice bots and avatars using NVIDIA ACE. With the SDK, you can create controllers that leverage the Python-based open-source [Pipecat framework](https://github.com/pipecat-ai/pipecat) for creating real-time, voice-enabled, and multimodal conversational AI agents. The SDK contains enhancements to the Pipecat framework, enabling developers to effortlessly customize, debug, and deploy complex pipelines while integrating robust NVIDIA Services into the Pipecat ecosystem.
3+
This repository contains examples demonstrating how to build voice-enabled conversational AI agents using the NVIDIA services, built using [the Pipecat framework](https://github.com/pipecat-ai/pipecat). These examples demonstrate various implementation patterns, ranging from simple LLM-based conversations to complex agentic workflows, and from WebSocket-based solutions to advanced WebRTC implementations with real-time capabilities.
44

5-
## Main Features
5+
## Examples Overview
66

7-
- **Pipecat Extension:** A Pipecat extension to connect with ACE services and NVIDIA NIMs, facilitating the creation of human-avatar interactions. The NVIDIA Pipecat library augments [the Pipecat framework](https://github.com/pipecat-ai/pipecat) by adding additional frame processors and services, as well as new multimodal frames to enhance avatar interactions. This includes the integration of NVIDIA services and NIMs such as [NVIDIA Riva](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html), [NVIDIA Audio2Face](https://build.nvidia.com/nvidia/audio2face-3d), and [NVIDIA Foundational RAG](https://build.nvidia.com/nvidia/build-an-enterprise-rag-pipeline).
7+
- **[Voice Agent WebSocket](examples/voice_agent_websocket/)** : A simple voice assistant pipeline using WebSocket-based transport. This example demonstrates integration with NVIDIA LLM Service, Riva ASR and TTS NIMS.
8+
- **[Voice Agent WebRTC](examples/voice_agent_webrtc/)** : A more advanced voice agent using WebRTC Transport with real-time transcripts, dynamic prompt configuration and TTS voice selection via UI.
9+
- **[NAT Agent (NeMo Agent Toolkit)](examples/nat_agent/)** : An end-to-end intelligent voice assistant powered by NeMo Agent Toolkit. The ReWoo agent uses planning-based approach for efficient task decomposition and execution with custom tools for menu browsing, pricing and cart management.
810

9-
- **HTTP and WebSocket Server Implementation:** The SDK provides a FastAPI-based HTTP and WebSocket server implementation compatible with ACE. It includes functionality for stream and pipeline management by offering new Pipecat pipeline runners and transports. For ease of use and distribution, this functionality is currently included in the `nvidia-pipecat` Python library as well.
11+
We recommend starting with the Voice Agent WebSocket example for a simple introduction, then progressing to WebRTC-based examples for production use cases. More details on examples can be found in [examples README.md](examples/README.md).
1012

11-
## ACE Controller Microservice
13+
## NVIDIA Pipecat
1214

13-
The ACE Controller SDK was used to build the [ACE Controller Microservice](https://docs.nvidia.com/ace/ace-controller-microservice/latest/index.html).Check out the [ACE documentation](https://docs.nvidia.com/ace/tokkio/latest/customization/customization-options.html) for more details on how to configure the ACE Controller MS with your custom pipelines.
15+
The NVIDIA Pipecat library augments [the Pipecat framework](https://github.com/pipecat-ai/pipecat) by adding additional frame processors and NVIDIA services. This includes the integration of NVIDIA services and NIMs such as [Riva ASR](https://build.nvidia.com/nvidia/parakeet-ctc-1_1b-asr), [Riva TTS](https://build.nvidia.com/nvidia/magpie-tts-multilingual), [LLM NIMs](https://build.nvidia.com/models), [NAT (NeMo Agent Toolkit)](https://github.com/NVIDIA/NeMo-Agent-Toolkit), and [Foundational RAG](https://github.com/NVIDIA-AI-Blueprints/rag). It also introduces a few processors with a focus on improving the end-user experience for multimodal conversational agents, along with speculative speech processing to reduce latency for faster bot responses.
1416

1517

16-
## Getting Started
18+
### Getting Started
1719

1820
The NVIDIA Pipecat package is released as a wheel on PyPI. Create a Python virtual environment and use the pip command to install the nvidia-pipecat package.
1921

2022
```bash
2123
pip install nvidia-pipecat
2224
```
2325

24-
You can start building pipecat pipelines utilizing services from the NVIDIA Pipecat package. For more details, follow [the ACE Controller](https://docs.nvidia.com/ace/ace-controller-microservice/latest/index.html) and [the Pipecat Framework](https://docs.pipecat.ai/getting-started/overview) documentation.
26+
You can start building pipecat pipelines utilizing services from the NVIDIA Pipecat package.
2527

26-
## Hacking on the framework itself
28+
### Hacking on the framework itself
2729

2830
If you wish to work directly with the source code or modify services from the nvidia-pipecat package, you can utilize either the UV or Nix development setup as outlined below.
2931

30-
### Using UV
32+
#### Using UV
3133

3234

3335
To get started, first install the [UV package manager](https://docs.astral.sh/uv/#highlights).
3436

3537
Then, create a virtual environment with all the required dependencies by running the following commands:
3638
```bash
3739
uv venv
38-
uv sync
3940
source .venv/bin/activate
41+
uv sync
4042
```
4143

4244
Once the environment is set up, you can begin building pipelines or modifying the services in the source code.
@@ -59,7 +61,7 @@ ruff check
5961
```
6062

6163

62-
### Using Nix
64+
#### Using Nix
6365

6466
To set up your development environment using [the Nix](https://nixos.org/download/#nix-install-linux), follow these steps:
6567

@@ -76,6 +78,15 @@ To ensure that all checks such as the formatting and linter for the repository a
7678
nix flake check
7779
```
7880

81+
## Documentation
82+
83+
The project documentation includes:
84+
85+
- **[Voice Agent Examples](./examples/README.md)** - Voice agents examples built using pipecat and NVIDIA services
86+
- **[NVIDIA Pipecat](./docs/NVIDIA_PIPECAT.md)** - Custom Pipecat processors implemented for NVIDIA services
87+
- **[Best Practices](./docs/BEST_PRACTICES.md)** - Performance optimization guidelines and production deployment strategies
88+
- **[Speculative Speech Processing](./docs/SPECULATIVE_SPEECH_PROCESSING.md)** - Advanced speech processing techniques for reducing latency
89+
7990
## CONTRIBUTING
8091

8192
We invite contributions! Open a GitHub issue or pull request! See contributing guildelines [here](./CONTRIBUTING.md).

docs/BEST_PRACTICES.md

Lines changed: 259 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,259 @@
1+
# Voice Agent Best Practices
2+
3+
Building production-grade voice agents requires careful consideration of multiple dimensions: technical performance, user experience, security, and operational excellence. This guide consolidates best practices and lessons learned from deploying voice agents at scale.
4+
5+
---
6+
## Key Success Metrics
7+
8+
- **Latency**: Time from user speech end to bot response start (target: 600-1500ms)
9+
- **Accuracy**: ASR word error rate (WER), factual correctness, LLM generation quality etc.
10+
- **Availability**: System uptime and fault tolerance (target: 99.9%+)
11+
- **User Satisfaction**: Task completion rate and user feedback scores
12+
13+
---
14+
15+
## 1. Modular and Event-Driven Pipeline Design
16+
17+
Structure your voice agent as a composable pipeline of independent components:
18+
19+
```
20+
Audio Input → VAD → ASR → Agent → TTS → Audio Output
21+
```
22+
23+
Implement event-driven patterns for:
24+
- Real-time transcription updates
25+
- Intermediate processing results
26+
- System health events
27+
- User interaction events
28+
- async/await patterns for non-blocking operations
29+
30+
**Benefits:**
31+
- Easy to test and scale individual components
32+
- Swap providers without full rewrites
33+
34+
---
35+
36+
## 2. Optimizing Pipeline Latency
37+
38+
For optimizing latency, first we need to measure e2e and component wise latency. Voice agent latency comes from multiple pipeline components. Understanding each contributor enables targeted optimization:
39+
40+
### 2.1 Audio Processing Latency
41+
42+
**Voice Activity Detection (VAD):**
43+
- **Contribution**: 200-500ms (end of speech detection)
44+
- **Optimization**:
45+
- Use streaming VAD with shorter silence thresholds
46+
- Explore shorter EOU detection with Riva ASR and open-source smart turn detection models
47+
- Implement adaptive VAD sensitivity based on environment noise
48+
49+
**Audio Buffering:**
50+
- **Contribution**: 50-200ms (network buffering, codec processing)
51+
- **Optimization**:
52+
- Use lower latency audio codecs (Opus at 20ms frames)
53+
- Minimize audio buffer sizes while maintaining quality
54+
- Implement jitter buffers for network variations
55+
56+
### 2.2 ASR (Automatic Speech Recognition) Latency
57+
58+
**Model Processing:**
59+
- **Contribution**: 50-100 ms for Riva ASR
60+
- **Optimization**:
61+
- Prefer deploying Riva ASR NIM locally
62+
- Utilize latest GPU hardware and optimized models
63+
- Maintain consistent latency performance when handling multiple concurrent requests
64+
- Use streaming ASR with interim results for early processing
65+
66+
### 2.3 Language Model (LLM) Processing Latency
67+
68+
**Model Inference:**
69+
- **Contribution**: 200-800ms depending on model size and complexity
70+
- **Optimization**:
71+
- **Model Selection**: Use smaller, faster models (8B vs 70B parameters)
72+
- **TRT LLM Optimized**: Use TRT LLM optimized NIM deployments
73+
- **Quantization**: INT8/FP16 models for 2-3x speedup
74+
- **KV-Cache Optimization**: Enable KV caching for lower TTFB and optimize based on use case
75+
76+
**Context Management:**
77+
- **Contribution**: 50-200ms for large contexts
78+
- **Optimization**:
79+
- Implement context truncation strategies
80+
- Enable KV caching with adequate cache size
81+
82+
### 2.4 TTS (Text-to-Speech) Latency
83+
84+
**Synthesis Time:**
85+
- **Contribution**: 150-300ms for first audio chunk
86+
- **Optimization**:
87+
- **Streaming TTS**: Start playback before full synthesis
88+
- **Local Riva TTS**: 150-200ms with TRT optimized Magpie model
89+
- **Chunked Generation**: Process sentences as they're generated
90+
91+
**Audio Post-processing:**
92+
- **Contribution**: 50-100ms (normalization, encoding)
93+
- **Optimization**:
94+
- Minimize audio processing pipeline
95+
- Use hardware-accelerated audio codecs
96+
97+
### 2.5 Network and Infrastructure Latency
98+
99+
- **Geographic Distribution:** Distributed multi-node deployments based on user demographics
100+
- **Load Balancing:** Use sticky sessions to avoid context switching
101+
- **Monitoring:** Monitor key metrics in production deployment
102+
103+
### 2.6 Advanced Latency Reduction Techniques
104+
105+
**Speculative Speech Processing:**
106+
- Process interim ASR transcripts before speech ends
107+
- Pre-generate likely responses during user speech
108+
- **Potential Savings**: 200-400ms reduction in perceived latency
109+
- For more details, check [docs](SPECULATIVE_SPEECH_PROCESSING.md)
110+
111+
**Filler words or Intermediate responses:**
112+
- Generate or use random filler words to reduce perceived latency
113+
- For high latency agents or reasoning models, generate intermediate response based on function calls or thinking tokens
114+
115+
---
116+
117+
## 3. Designing User Experience
118+
119+
### 3.1 Conversation Design Principles
120+
121+
**Natural Turn-Taking:**
122+
- Allow interruptions (barge-in)
123+
- Implement proper silence handling
124+
- Use conversational markers ("um", "let me check")
125+
126+
**Progressive Disclosure:**
127+
```python
128+
# Don't overwhelm with options
129+
# Bad:
130+
"You can check balance, transfer funds, pay bills, view history,
131+
update profile, set alerts, or lock your card. What would you like?"
132+
133+
# Good:
134+
"What would you like to do today?"
135+
# (Let user guide, offer suggestions if confused)
136+
```
137+
### 3.2 Persona & Tone Consistency
138+
139+
**Define Agent Personality:**
140+
- Professional vs. casual
141+
- Proactive vs. reactive
142+
- Verbose vs. concise
143+
- Empathetic vs. neutral
144+
145+
**Maintain Consistency:**
146+
- Document persona guidelines
147+
- Use system prompts for LLMs
148+
- Implement tone checkers
149+
- Regular quality reviews
150+
151+
### 3.3 Voice Selection
152+
153+
**Considerations:**
154+
- Match voice to brand and use case
155+
- Consider user demographics
156+
- Regional accent preferences
157+
- Gender neutrality options
158+
- Custom IPA dictionary for mispronunciation
159+
160+
**Quality Metrics:**
161+
- Naturalness (MOS score > 4.0)
162+
- Prosody and intonation
163+
- Emotional expressiveness
164+
- Consistency across sessions
165+
166+
### 3.4 Response Optimization for Voice
167+
168+
**Voice-Specific Adaptations:**
169+
- Keep responses concise (1-3 sentences per turn)
170+
- Use conversational language (contractions, simple words)
171+
- Structure information hierarchically
172+
- Avoid lists with >3-4 items
173+
- Use explicit transitions
174+
175+
### 3.5 Prompt Design
176+
177+
**System Prompt Instructions:**
178+
- Include persona and tone guidelines directly in the system prompt for consistency
179+
- Provide clear instructions to avoid outputting formatting (bullet points, markdown, URLs) that doesn't translate to voice
180+
- Define conversation boundaries and scope to keep interactions focused and prevent rambling
181+
- Include examples of ideal voice responses in the prompt for few-shot guidance
182+
- Instructions for Progressive disclosure of options and Context-aware suggestions
183+
184+
### 3.6 ASR transcripts quality
185+
- Implement custom vocabulary boosting for domain terms
186+
- Use inverse text normalization (ITN) for proper formatting
187+
- Make sure user audio quality is good
188+
- Avoid resampling if possible
189+
- Riva ASR models are robust to noise, skip noise processing
190+
- Base critical decisions on final transcripts only
191+
- Finetune ASR model on domain data if needed
192+
193+
### 3.7 User-Facing Error Handling
194+
195+
**Error Categories:**
196+
197+
```python
198+
ERROR_MESSAGES = {
199+
"asr_failure": "I didn't catch that. Could you say that again?",
200+
"service_unavailable": "I'm having trouble connecting. Let me try again.",
201+
"timeout": "This is taking longer than expected. Please hold on.",
202+
"out_of_scope": "I'm not able to help with that, but I can help you with..."
203+
}
204+
```
205+
206+
**Recovery Strategies:**
207+
- Offer alternative input methods (DTMF, transfer to human)
208+
- Provide clear next steps
209+
- Graceful conversation termination
210+
211+
### 3.8 Continuous testing
212+
- Implement Unit and Integration Testing
213+
- Load testing to find bottlenecks for latencies
214+
- Prepare test data with different conversation scenarios
215+
- A/B Testing to improve user experience
216+
---
217+
218+
## 4. Scalability & Performance
219+
220+
### 4.1 Horizontal Scaling
221+
222+
**Stateless Services:**
223+
- Deploy ASR/TTS behind load balancers
224+
- Use container orchestration (Kubernetes)
225+
- Auto-scaling based on CPU/memory/queue depth
226+
227+
**Stateful Services:**
228+
- Use Sticky sessions
229+
- Distributed session storage (Redis)
230+
231+
### 4.2 Resource Optimization
232+
233+
**Model Optimization:**
234+
- Quantization (FP16, INT8) and TRT optimization for inference
235+
- Smaller models selection for lower footprint
236+
- Batch inference where possible
237+
- GPU sharing and multiplexing
238+
239+
### 4.3 Network Optimization
240+
241+
**WebRTC Best Practices:**
242+
- Use TURN servers for NAT traversal
243+
- Implement adaptive bitrate
244+
- Support multiple codecs (Opus preferred)
245+
- Handle network transitions (WiFi to cellular)
246+
247+
---
248+
249+
## Conclusion
250+
251+
Building production voice agents requires a holistic approach balancing technical performance, user experience, and operational excellence. Key takeaways:
252+
253+
1. **Design for Latency**: Every millisecond counts in conversational AI
254+
2. **Handle Errors Gracefully**: Users should never feel lost
255+
3. **Monitor Everything**: You can't improve what you don't measure
256+
4. **Test Thoroughly**: Automated testing catches issues before users do
257+
5. **Iterate Based on Data**: Use real user feedback to improve
258+
6. **Plan for Scale**: Design for 10x your current load
259+
7. **Prioritize Security**: Protect user data as your top responsibility

0 commit comments

Comments
 (0)