You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/voice_agent/README.md
+73-7Lines changed: 73 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,24 @@ A fully open-source NVIDIA NeMo Voice Agent example demonstrating a simple way t
4
4
5
5
As of now, we only support English input and output, but more languages will be supported in the future.
6
6
7
+
## 📋 Table of Contents
8
+
-[✨ Key Features](#-key-features)
9
+
-[💡 Upcoming Next](#-upcoming-next)
10
+
-[📅 Latest Updates](#-latest-updates)
11
+
-[🚀 Quick Start](#-quick-start)
12
+
-[📑 Supported Models and Features](#-supported-models-and-features)
13
+
-[🤖 LLM](#-llm)
14
+
-[Thinking/reasoning Mode for LLMs](#thinkingreasoning-mode-for-llms)
15
+
-[🎤 ASR](#-asr)
16
+
-[💬 Speaker Diarization](#-speaker-diarization)
17
+
-[🔉 TTS](#-tts)
18
+
-[🔄 Turn-taking](#-turn-taking)
19
+
-[🔧 Tool Calling](#-tool-calling)
20
+
-[📝 Notes \& FAQ](#-notes--faq)
21
+
-[☁️ NVIDIA NIM Services](#️-nvidia-nim-services)
22
+
-[Acknowledgments](#acknowledgments)
23
+
-[Contributing](#contributing)
24
+
7
25
8
26
## ✨ Key Features
9
27
@@ -13,6 +31,7 @@ As of now, we only support English input and output, but more languages will be
13
31
- Low latency TTS for fast audio response generation.
14
32
- Speaker diarization up to 4 speakers in different user turns.
15
33
- WebSocket server for easy deployment.
34
+
- Tool calling for LLMs to use external tools and adjust its own behavior.
16
35
17
36
18
37
## 💡 Upcoming Next
@@ -21,13 +40,15 @@ As of now, we only support English input and output, but more languages will be
21
40
- Combine ASR and speaker diarization model to handle overlapping speech.
22
41
23
42
24
-
## Latest Updates
43
+
## 📅 Latest Updates
44
+
- 2025-12-31: Added examples for [tool calling](#tool-calling), such as changing the speaking speed, switching between male/female voices and British/American accents, and getting the current weather of a city. Diarization model is updated to [nvidia/diar_streaming_sortformer_4spk-v2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1) with improved performance.
25
45
- 2025-11-14: Added support for joint ASR and EOU detection with [Parakeet-realtime-eou-120m](https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1) model.
26
46
- 2025-10-10: Added support for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) TTS model.
27
47
- 2025-10-03: Add support for serving LLM with vLLM and auto-switch between vLLM and HuggingFace, add [nvidia/NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) as default LLM.
28
48
- 2025-09-05: First release of NeMo Voice Agent.
29
49
30
50
51
+
31
52
## 🚀 Quick Start
32
53
33
54
### Hardware requirements
@@ -106,17 +127,17 @@ Open the client via browser: `http://[YOUR MACHINE IP ADDRESS]:5173/` (or whatev
106
127
107
128
You can mute/unmute your microphone via the "Mute" button, and reset the LLM context history and speaker cache by clicking the "Reset" button.
108
129
109
-
**If using chrome browser, you need to add `http://[YOUR MACHINE IP ADDRESS]:5173/` to the allow list via `chrome://flags/#unsafely-treat-insecure-origin-as-secure`.**
130
+
**If using chrome browser, you need to add `http://[YOUR MACHINE IP ADDRESS]:5173/` to the allow list via `chrome://flags/#unsafely-treat-insecure-origin-as-secure`.** You may also need to restart the browser for the changes to take effect.
110
131
111
132
If you want to use a different port for client connection, you can modify `client/vite.config.js` to change the `port` variable.
112
133
113
-
## 📑 Supported Models
134
+
## 📑 Supported Models and Features
114
135
115
136
### 🤖 LLM
116
137
117
138
Most LLMs from HuggingFace are supported. A few examples are:
@@ -177,11 +198,56 @@ Here are the supported TTS models:
177
198
We will support more TTS models in the future.
178
199
179
200
180
-
### Turn-taking
201
+
### 🔄 Turn-taking
181
202
182
203
As the new turn-taking prediction model is not yet released, we use the VAD-based turn-taking prediction for now. You can set the `vad.stop_secs` to the desired value in `server/server_configs/default.yaml` to control the amount of silence needed to indicate the end of a user's turn.
183
204
184
-
Additionally, the voice agent supports ignoring back-channel phrases while the bot is talking, which it means phrases such as "uh-huh", "yeah", "okay" will not interrupt the bot while it's talking. To control the backchannel phrases to be used, you can set the `turn_taking.backchannel_phrases` in the server config to the desired list of phrases or a file path to a yaml file containing the list of phrases. By default, it will use the phrases in `server/backchannel_phrases.yaml`. Setting it to `null` will disable detecting backchannel phrases, and that the VAD will interrupt the bot immediately when the user starts speaking.
205
+
Additionally, the voice agent supports ignoring back-channel phrases while the bot is talking, which means phrases such as "uh-huh", "yeah", "okay" will not interrupt the bot while it's talking. To control the backchannel phrases to be used, you can set the `turn_taking.backchannel_phrases` in the server config to the desired list of phrases or a file path to a yaml file containing the list of phrases. By default, it will use the phrases in `server/backchannel_phrases.yaml`. Setting it to `null` will disable detecting backchannel phrases, and that the VAD will interrupt the bot immediately when the user starts speaking.
206
+
207
+
208
+
### 🔧 Tool Calling
209
+
210
+
We support tool calling for LLMs to use external tools (e.g., getting the current weather of a city) or adjust its own behavior (e.g., changing the speaking speed). Some example queries to try with the default server config:
211
+
212
+
1. Getting the current weather of a city:
213
+
- "What's the weather in New York city?"
214
+
- "What's the weather in Paris?"
215
+
- "What's the weather in Paris, Texas, USA?"
216
+
217
+
2. Changing the speaking speed of the voice agent:
218
+
- "Can you speak faster?"
219
+
- "Can you speak slower?"
220
+
- "Reset to the original speaking speed."
221
+
- "Speak twice as fast."
222
+
- "Speak half as slow."
223
+
224
+
3. Switching between British and American accents, and changing the gender of the voice:
225
+
- "Speak in British accent."
226
+
- "Switch to a male voice."
227
+
- "Switch to a female voice."
228
+
- "Reset to the original language and voice."
229
+
230
+
Currently, tool calling is only supported for vLLM server and specific LLM models:
More LLMs can be supported by referring to their documentation on how to enable tool calling in vLLM. Note that the system prompt may need to be tuned accordingly.
234
+
235
+
More tools will be added later. However, if you cannot wait to hack and add your own tools, please read the following section.
236
+
237
+
#### Adding new tools
238
+
239
+
Additional tools can be added in two ways:
240
+
- Adding a new [direct function](https://docs.pipecat.ai/guides/learn/function-calling#using-direct-functions-shorthand) such as the `get_city_weather` function in `nemo/agents/voice_agent/pipecat/utils/tool_calling/basic_tools.py`.
241
+
- Adding new tools to adjust the behavior of each of the STT/TTS/Diar/LLM/TurnTaking components, by adding the `ToolCallingMixin` to the component and implementing the `setup_tool_calling` method as the `KokoroTTSService` class in `nemo/agents/voice_agent/pipecat/services/nemo/tts.py`.
242
+
243
+
The tools are then registered to the LLM via the `register_direct_tools_to_llm` function in `nemo/agents/voice_agent/pipecat/utils/tool_calling/mixins.py`, as shown in the example in `examples/voice_agent/server/bot_websocket_server.py`.
244
+
245
+
More details on tool calling with Pipecat can be found in the [Pipecat documentation](https://docs.pipecat.ai/guides/learn/function-calling).
246
+
247
+
#### Notes on system prompt with tools
248
+
249
+
We notice that sometimes the LLM cannot do anything that's not related to the provided tools, or it might not actually use the tools even though it says it's using them. To alleviate this issue, we insert additional instructions to the system prompt (e.g., in `server/server_configs/llm_configs/nemotron_nano_v2.yaml`):
250
+
- "Before responding to the user, check if the user request requires using external tools, and use the tools if they match with the user's intention. Otherwise, use your internal knowledge to answer the user's question. Do not use tools for casual conversation or when the tools don't fit the use cases. You should still try to address the user's request when it's not related to the provided tools."
185
251
186
252
187
253
## 📝 Notes & FAQ
@@ -199,7 +265,7 @@ Additionally, the voice agent supports ignoring back-channel phrases while the b
199
265
200
266
NVIDIA also provides a variety of [NIM](https://developer.nvidia.com/nim?sortBy=developer_learning_library%2Fsort%2Ffeatured_in.nim%3Adesc%2Ctitle%3Aasc&hitsPerPage=12) services for better ASR, TTS and LLM performance with more efficient deployment on either cloud or local servers.
201
267
202
-
You can also modify the `server/bot_websocket_server.py` to use NVIDIA NIM services for better LLM,ASR and TTS performance, by refering to these Pipecat services:
268
+
You can also modify the `server/bot_websocket_server.py` to use NVIDIA NIM services for better LLM,ASR and TTS performance, by referring to these Pipecat services:
203
269
-[NVIDIA NIM LLM Service](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/nim/llm.py)
204
270
-[NVIDIA Riva ASR Service](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/riva/stt.py)
0 commit comments