NVIDIA
diff --git a/‎.gitignore‎
Lines changed: 6 additions & 0 deletions b/‎.gitignore‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 25 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 25 additions & 0 deletions
diff --git a/‎NVIDIA_PIPECAT.md‎
Lines changed: 2 additions & 2 deletions b/‎NVIDIA_PIPECAT.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 24 additions & 13 deletions b/‎README.md‎
Lines changed: 24 additions & 13 deletions
diff --git a/‎docs/BEST_PRACTICES.md‎
Lines changed: 259 additions & 0 deletions b/‎docs/BEST_PRACTICES.md‎
Lines changed: 259 additions & 0 deletions
@@ -28,3 +28,9 @@ docs/build/
 
 # Ignore .DS_Store
 .DS_Store
+
+*.log
+nim_cache/
+riva_cache/
+llm_cache/
+audio_dumps/
@@ -3,6 +3,31 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [0.3.0] - 2025-11-7
+
+### Added
+- Added WebRTC based voice agent example and custom UI
+- Nemo Agent Toolkit integration and Voice Agent example with Agentic AI
+- Scripts for latency and throughput performance benchmarking for Voice Agents
+- Support for Dynamic LLM prompt ingestion and TTS Voice selection using WebRTC UI 
+- Full-Duplex-Bench evaluation inference client script
+- BlingFireTextAggregator for TTS Service
+- Added steps for LLM deployment with KV Cache support
+
+### Changed
+- Updated pipecat to version 0.0.85
+- Renamed GitHub repository to voice-agent-examples
+- Switched to Magpie TTS Multilingual model
+- Hardcoded NIM version tags in examples
+
+### Fixed
+- Fixed user transcriptions and docker compose volume issues
+- Split long TTS sentences to handle Riva TTS character limit error
+
+### Removed
+- Removed Animation and Audio2Face support
+- Removed ACE naming references
+
 ## [0.2.0] - 2025-06-17
 
 ### Added
 
@@ -1,5 +1,5 @@
 # NVIDIA Pipecat
 
-The NVIDIA Pipecat library augments [the Pipecat framework](https://github.com/pipecat-ai/pipecat) by adding additional frame processors and services, as well as new multimodal frames to facilitate the creation of human-avatar interactions. This includes the integration of NVIDIA services and NIMs such as [NVIDIA Riva](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html), [NVIDIA Audio2Face](https://build.nvidia.com/nvidia/audio2face-3d), and [NVIDIA Foundational RAG](https://build.nvidia.com/nvidia/build-an-enterprise-rag-pipeline). It also introduces a few processors with a focus on improving the end-user experience for multimodal conversational agents, along with speculative speech processing to reduce latency for faster bot responses. 
+The NVIDIA Pipecat library augments [the Pipecat framework](https://github.com/pipecat-ai/pipecat) by adding additional frame processors and NVIDIA services. This includes the integration of NVIDIA services and NIMs such as [Riva ASR](https://build.nvidia.com/nvidia/parakeet-ctc-1_1b-asr), [Riva TTS](https://build.nvidia.com/nvidia/magpie-tts-multilingual), [LLM NIMs](https://build.nvidia.com/models), [NAT (NeMo Agent Toolkit)](https://github.com/NVIDIA/NeMo-Agent-Toolkit), and [Foundational RAG](https://github.com/NVIDIA-AI-Blueprints/rag). It also introduces a few processors with a focus on improving the end-user experience for multimodal conversational agents, along with speculative speech processing to reduce latency for faster bot responses. 
 
-The nvidia-pipecat source code can be found in [the GitHub repository](https://github.com/NVIDIA/ace-controller). Follow [the documentation](https://docs.nvidia.com/ace/ace-controller-microservice/latest/index.html) for more details.
+The nvidia-pipecat source code can be found in [the GitHub repository](https://github.com/NVIDIA/voice-agent-examples).
@@ -1,42 +1,44 @@
-# ACE Controller SDK
+# Riva Voice Agent Examples
 
-The ACE Controller SDK allows you to build your own ACE Controller service to manage multimodal, real-time interactions with voice bots and avatars using NVIDIA ACE. With the SDK, you can create controllers that leverage the Python-based open-source [Pipecat framework](https://github.com/pipecat-ai/pipecat) for creating real-time, voice-enabled, and multimodal conversational AI agents. The SDK contains enhancements to the Pipecat framework, enabling developers to effortlessly customize, debug, and deploy complex pipelines while integrating robust NVIDIA Services into the Pipecat ecosystem.
+This repository contains examples demonstrating how to build voice-enabled conversational AI agents using the NVIDIA services, built using [the Pipecat framework](https://github.com/pipecat-ai/pipecat). These examples demonstrate various implementation patterns, ranging from simple LLM-based conversations to complex agentic workflows, and from WebSocket-based solutions to advanced WebRTC implementations with real-time capabilities.
 
-## Main Features
+## Examples Overview
 
-- **Pipecat Extension:** A Pipecat extension to connect with ACE services and NVIDIA NIMs, facilitating the creation of human-avatar interactions. The NVIDIA Pipecat library augments [the Pipecat framework](https://github.com/pipecat-ai/pipecat) by adding additional frame processors and services, as well as new multimodal frames to enhance avatar interactions. This includes the integration of NVIDIA services and NIMs such as [NVIDIA Riva](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html), [NVIDIA Audio2Face](https://build.nvidia.com/nvidia/audio2face-3d), and [NVIDIA Foundational RAG](https://build.nvidia.com/nvidia/build-an-enterprise-rag-pipeline).
+-  **[Voice Agent WebSocket](examples/voice_agent_websocket/)** : A simple voice assistant pipeline using WebSocket-based transport. This example demonstrates integration with NVIDIA LLM Service, Riva ASR and TTS NIMS. 
+- **[Voice Agent WebRTC](examples/voice_agent_webrtc/)** : A more advanced voice agent using WebRTC Transport with real-time transcripts, dynamic prompt configuration and TTS voice selection via UI.
+- **[NAT Agent (NeMo Agent Toolkit)](examples/nat_agent/)** : An end-to-end intelligent voice assistant powered by NeMo Agent Toolkit. The ReWoo agent uses planning-based approach for efficient task decomposition and execution with custom tools for menu browsing, pricing and cart management.
 
-- **HTTP and WebSocket Server Implementation:** The SDK provides a FastAPI-based HTTP and WebSocket server implementation compatible with ACE. It includes functionality for stream and pipeline management by offering new Pipecat pipeline runners and transports. For ease of use and distribution, this functionality is currently included in the `nvidia-pipecat` Python library as well.
+We recommend starting with the Voice Agent WebSocket example for a simple introduction, then progressing to WebRTC-based examples for production use cases. More details on examples can be found in [examples README.md](examples/README.md).
 
-## ACE Controller Microservice
+## NVIDIA Pipecat
 
-The ACE Controller SDK was used to build the [ACE Controller Microservice](https://docs.nvidia.com/ace/ace-controller-microservice/latest/index.html).Check out the [ACE documentation](https://docs.nvidia.com/ace/tokkio/latest/customization/customization-options.html) for more details on how to configure the ACE Controller MS with your custom pipelines.
+The NVIDIA Pipecat library augments [the Pipecat framework](https://github.com/pipecat-ai/pipecat) by adding additional frame processors and NVIDIA services. This includes the integration of NVIDIA services and NIMs such as [Riva ASR](https://build.nvidia.com/nvidia/parakeet-ctc-1_1b-asr), [Riva TTS](https://build.nvidia.com/nvidia/magpie-tts-multilingual), [LLM NIMs](https://build.nvidia.com/models), [NAT (NeMo Agent Toolkit)](https://github.com/NVIDIA/NeMo-Agent-Toolkit), and [Foundational RAG](https://github.com/NVIDIA-AI-Blueprints/rag). It also introduces a few processors with a focus on improving the end-user experience for multimodal conversational agents, along with speculative speech processing to reduce latency for faster bot responses.
 
 
-## Getting Started
+### Getting Started
 
 The NVIDIA Pipecat package is released as a wheel on PyPI. Create a Python virtual environment and use the pip command to install the nvidia-pipecat package.
 
 ```bash
 pip install nvidia-pipecat
 ```
 
-You can start building pipecat pipelines utilizing services from the NVIDIA Pipecat package. For more details, follow [the ACE Controller](https://docs.nvidia.com/ace/ace-controller-microservice/latest/index.html) and [the Pipecat Framework](https://docs.pipecat.ai/getting-started/overview) documentation.
+You can start building pipecat pipelines utilizing services from the NVIDIA Pipecat package.
 
-## Hacking on the framework itself
+### Hacking on the framework itself
 
 If you wish to work directly with the source code or modify services from the nvidia-pipecat package, you can utilize either the UV or Nix development setup as outlined below.
 
-### Using UV
+#### Using UV
 
 
 To get started, first install the [UV package manager](https://docs.astral.sh/uv/#highlights). 
 
 Then, create a virtual environment with all the required dependencies by running the following commands:
 ```bash
 uv venv
-uv sync
 source .venv/bin/activate
+uv sync
 ```
 
 Once the environment is set up, you can begin building pipelines or modifying the services in the source code.
@@ -59,7 +61,7 @@ ruff check
 ```
 
 
-### Using Nix
+#### Using Nix
 
 To set up your development environment using [the Nix](https://nixos.org/download/#nix-install-linux), follow these steps:
 
@@ -76,6 +78,15 @@ To ensure that all checks such as the formatting and linter for the repository a
 nix flake check
 ```
 
+## Documentation
+
+The project documentation includes:
+
+- **[Voice Agent Examples](./examples/README.md)** - Voice agents examples built using pipecat and NVIDIA services
+- **[NVIDIA Pipecat](./docs/NVIDIA_PIPECAT.md)** - Custom Pipecat processors implemented for NVIDIA services
+- **[Best Practices](./docs/BEST_PRACTICES.md)** - Performance optimization guidelines and production deployment strategies
+- **[Speculative Speech Processing](./docs/SPECULATIVE_SPEECH_PROCESSING.md)** - Advanced speech processing techniques for reducing latency
+
 ## CONTRIBUTING
 
 We invite contributions! Open a GitHub issue or pull request! See contributing guildelines [here](./CONTRIBUTING.md).
 
@@ -0,0 +1,259 @@
+# Voice Agent Best Practices
+
+Building production-grade voice agents requires careful consideration of multiple dimensions: technical performance, user experience, security, and operational excellence. This guide consolidates best practices and lessons learned from deploying voice agents at scale.
+
+---
+## Key Success Metrics
+
+- **Latency**: Time from user speech end to bot response start (target: 600-1500ms)
+- **Accuracy**: ASR word error rate (WER), factual correctness, LLM generation quality etc.
+- **Availability**: System uptime and fault tolerance (target: 99.9%+)
+- **User Satisfaction**: Task completion rate and user feedback scores
+
+---
+
+## 1. Modular and Event-Driven Pipeline Design
+
+Structure your voice agent as a composable pipeline of independent components:
+
+```
+Audio Input → VAD → ASR → Agent → TTS → Audio Output
+```
+
+Implement event-driven patterns for:
+- Real-time transcription updates
+- Intermediate processing results
+- System health events
+- User interaction events
+- async/await patterns for non-blocking operations
+
+**Benefits:**
+- Easy to test and scale individual components
+- Swap providers without full rewrites
+
+---
+
+## 2. Optimizing Pipeline Latency
+
+For optimizing latency, first we need to measure e2e and component wise latency. Voice agent latency comes from multiple pipeline components. Understanding each contributor enables targeted optimization:
+
+### 2.1 Audio Processing Latency
+
+**Voice Activity Detection (VAD):**
+- **Contribution**: 200-500ms (end of speech detection)
+- **Optimization**: 
+  - Use streaming VAD with shorter silence thresholds
+  - Explore shorter EOU detection with Riva ASR and open-source smart turn detection models
+  - Implement adaptive VAD sensitivity based on environment noise
+
+**Audio Buffering:**
+- **Contribution**: 50-200ms (network buffering, codec processing)
+- **Optimization**:
+  - Use lower latency audio codecs (Opus at 20ms frames)
+  - Minimize audio buffer sizes while maintaining quality
+  - Implement jitter buffers for network variations
+
+### 2.2 ASR (Automatic Speech Recognition) Latency
+
+**Model Processing:**
+- **Contribution**: 50-100 ms for Riva ASR
+- **Optimization**:
+  - Prefer deploying Riva ASR NIM locally
+  - Utilize latest GPU hardware and optimized models
+  - Maintain consistent latency performance when handling multiple concurrent requests
+  - Use streaming ASR with interim results for early processing
+
+### 2.3 Language Model (LLM) Processing Latency
+
+**Model Inference:**
+- **Contribution**: 200-800ms depending on model size and complexity
+- **Optimization**:
+  - **Model Selection**: Use smaller, faster models (8B vs 70B parameters)
+  - **TRT LLM Optimized**: Use TRT LLM optimized NIM deployments
+  - **Quantization**: INT8/FP16 models for 2-3x speedup
+  - **KV-Cache Optimization**: Enable KV caching for lower TTFB and optimize based on use case
+
+**Context Management:**
+- **Contribution**: 50-200ms for large contexts
+- **Optimization**:
+  - Implement context truncation strategies
+  - Enable KV caching with adequate cache size
+
+### 2.4 TTS (Text-to-Speech) Latency
+
+**Synthesis Time:**
+- **Contribution**: 150-300ms for first audio chunk
+- **Optimization**:
+  - **Streaming TTS**: Start playback before full synthesis
+  - **Local Riva TTS**: 150-200ms with TRT optimized Magpie model
+  - **Chunked Generation**: Process sentences as they're generated
+
+**Audio Post-processing:**
+- **Contribution**: 50-100ms (normalization, encoding)
+- **Optimization**:
+  - Minimize audio processing pipeline
+  - Use hardware-accelerated audio codecs
+
+### 2.5 Network and Infrastructure Latency
+
+- **Geographic Distribution:** Distributed multi-node deployments based on user demographics
+- **Load Balancing:** Use sticky sessions to avoid context switching
+- **Monitoring:** Monitor key metrics in production deployment
+
+### 2.6 Advanced Latency Reduction Techniques
+
+**Speculative Speech Processing:**
+- Process interim ASR transcripts before speech ends
+- Pre-generate likely responses during user speech
+- **Potential Savings**: 200-400ms reduction in perceived latency
+- For more details, check [docs](SPECULATIVE_SPEECH_PROCESSING.md)
+
+**Filler words or Intermediate responses:**
+- Generate or use random filler words to reduce perceived latency
+- For high latency agents or reasoning models, generate intermediate response based on function calls or thinking tokens
+
+---
+
+## 3. Designing User Experience
+
+### 3.1 Conversation Design Principles
+
+**Natural Turn-Taking:**
+- Allow interruptions (barge-in)
+- Implement proper silence handling
+- Use conversational markers ("um", "let me check")
+
+**Progressive Disclosure:**
+```python
+# Don't overwhelm with options
+# Bad:
+"You can check balance, transfer funds, pay bills, view history, 
+update profile, set alerts, or lock your card. What would you like?"
+
+# Good:
+"What would you like to do today?"
+# (Let user guide, offer suggestions if confused)
+```
+### 3.2 Persona & Tone Consistency
+
+**Define Agent Personality:**
+- Professional vs. casual
+- Proactive vs. reactive
+- Verbose vs. concise
+- Empathetic vs. neutral
+
+**Maintain Consistency:**
+- Document persona guidelines
+- Use system prompts for LLMs
+- Implement tone checkers
+- Regular quality reviews
+
+### 3.3 Voice Selection
+
+**Considerations:**
+- Match voice to brand and use case
+- Consider user demographics
+- Regional accent preferences
+- Gender neutrality options
+- Custom IPA dictionary for mispronunciation
+
+**Quality Metrics:**
+- Naturalness (MOS score > 4.0)
+- Prosody and intonation
+- Emotional expressiveness
+- Consistency across sessions
+
+### 3.4 Response Optimization for Voice
+
+**Voice-Specific Adaptations:**
+- Keep responses concise (1-3 sentences per turn)
+- Use conversational language (contractions, simple words)
+- Structure information hierarchically
+- Avoid lists with >3-4 items
+- Use explicit transitions
+
+### 3.5 Prompt Design
+
+**System Prompt Instructions:**
+- Include persona and tone guidelines directly in the system prompt for consistency
+- Provide clear instructions to avoid outputting formatting (bullet points, markdown, URLs) that doesn't translate to voice
+- Define conversation boundaries and scope to keep interactions focused and prevent rambling
+- Include examples of ideal voice responses in the prompt for few-shot guidance
+- Instructions for Progressive disclosure of options and Context-aware suggestions
+
+### 3.6 ASR transcripts quality
+- Implement custom vocabulary boosting for domain terms
+- Use inverse text normalization (ITN) for proper formatting
+- Make sure user audio quality is good
+- Avoid resampling if possible
+- Riva ASR models are robust to noise, skip noise processing 
+- Base critical decisions on final transcripts only
+- Finetune ASR model on domain data if needed
+
+### 3.7 User-Facing Error Handling
+
+**Error Categories:**
+
+```python
+ERROR_MESSAGES = {
+    "asr_failure": "I didn't catch that. Could you say that again?",
+    "service_unavailable": "I'm having trouble connecting. Let me try again.",
+    "timeout": "This is taking longer than expected. Please hold on.",
+    "out_of_scope": "I'm not able to help with that, but I can help you with..."
+}
+```
+
+**Recovery Strategies:**
+- Offer alternative input methods (DTMF, transfer to human)
+- Provide clear next steps
+- Graceful conversation termination
+
+### 3.8 Continuous testing
+- Implement Unit and Integration Testing
+- Load testing to find bottlenecks for latencies
+- Prepare test data with different conversation scenarios
+- A/B Testing to improve user experience
+---
+
+## 4. Scalability & Performance
+
+### 4.1 Horizontal Scaling
+
+**Stateless Services:**
+- Deploy ASR/TTS behind load balancers
+- Use container orchestration (Kubernetes)
+- Auto-scaling based on CPU/memory/queue depth
+
+**Stateful Services:**
+- Use Sticky sessions
+- Distributed session storage (Redis)
+
+### 4.2 Resource Optimization
+
+**Model Optimization:**
+- Quantization (FP16, INT8) and TRT optimization for inference
+- Smaller models selection for lower footprint
+- Batch inference where possible
+- GPU sharing and multiplexing
+
+### 4.3 Network Optimization
+
+**WebRTC Best Practices:**
+- Use TURN servers for NAT traversal
+- Implement adaptive bitrate
+- Support multiple codecs (Opus preferred)
+- Handle network transitions (WiFi to cellular)
+
+---
+
+## Conclusion
+
+Building production voice agents requires a holistic approach balancing technical performance, user experience, and operational excellence. Key takeaways:
+
+1. **Design for Latency**: Every millisecond counts in conversational AI
+2. **Handle Errors Gracefully**: Users should never feel lost
+3. **Monitor Everything**: You can't improve what you don't measure
+4. **Test Thoroughly**: Automated testing catches issues before users do
+5. **Iterate Based on Data**: Use real user feedback to improve
+6. **Plan for Scale**: Design for 10x your current load
+7. **Prioritize Security**: Protect user data as your top responsibility