Skip to content

Commit 194f1d5

Browse files
authored
[voice agent] Add examples for tool calling (#15243)
* fix llm model config manager Signed-off-by: stevehuang52 <heh@nvidia.com> * add tool calling Signed-off-by: stevehuang52 <heh@nvidia.com> * update readme for tool calling Signed-off-by: stevehuang52 <heh@nvidia.com> * update dependency Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> * update default diar to nvidia/diar_streaming_sortformer_4spk-v2.1 Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up nano-v2 tool calling Signed-off-by: stevehuang52 <heh@nvidia.com> * fix linting, clean up Signed-off-by: stevehuang52 <heh@nvidia.com> * fix linting Signed-off-by: stevehuang52 <heh@nvidia.com> * refactor and add voice change Signed-off-by: stevehuang52 <heh@nvidia.com> * update docstring Signed-off-by: stevehuang52 <heh@nvidia.com> * update readme Signed-off-by: stevehuang52 <heh@nvidia.com> * update readme Signed-off-by: stevehuang52 <heh@nvidia.com> * fix linting Signed-off-by: stevehuang52 <heh@nvidia.com> * update readme Signed-off-by: stevehuang52 <heh@nvidia.com> * refactor Signed-off-by: stevehuang52 <heh@nvidia.com> * update readme Signed-off-by: stevehuang52 <heh@nvidia.com> * fix typos Signed-off-by: stevehuang52 <heh@nvidia.com> * add timeout for python_weather Signed-off-by: stevehuang52 <heh@nvidia.com> * add more details on tool calling to readme Signed-off-by: stevehuang52 <heh@nvidia.com> * redo taejin's change on nano-v2 parser Signed-off-by: stevehuang52 <heh@nvidia.com> * improve tool calling Signed-off-by: stevehuang52 <heh@nvidia.com> * fix dependency Signed-off-by: stevehuang52 <heh@nvidia.com> * add reset to tool parser Signed-off-by: stevehuang52 <heh@nvidia.com> * relax vllm exception Signed-off-by: stevehuang52 <heh@nvidia.com> * relax vllm exception Signed-off-by: stevehuang52 <heh@nvidia.com> * update vllm for consecutive user turns Signed-off-by: stevehuang52 <heh@nvidia.com> * update vllm server Signed-off-by: stevehuang52 <heh@nvidia.com> * fix config Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> --------- Signed-off-by: stevehuang52 <heh@nvidia.com>
1 parent 4b37542 commit 194f1d5

File tree

19 files changed

+1177
-60
lines changed

19 files changed

+1177
-60
lines changed

examples/voice_agent/README.md

Lines changed: 73 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,24 @@ A fully open-source NVIDIA NeMo Voice Agent example demonstrating a simple way t
44

55
As of now, we only support English input and output, but more languages will be supported in the future.
66

7+
## 📋 Table of Contents
8+
- [✨ Key Features](#-key-features)
9+
- [💡 Upcoming Next](#-upcoming-next)
10+
- [📅 Latest Updates](#-latest-updates)
11+
- [🚀 Quick Start](#-quick-start)
12+
- [📑 Supported Models and Features](#-supported-models-and-features)
13+
- [🤖 LLM](#-llm)
14+
- [Thinking/reasoning Mode for LLMs](#thinkingreasoning-mode-for-llms)
15+
- [🎤 ASR](#-asr)
16+
- [💬 Speaker Diarization](#-speaker-diarization)
17+
- [🔉 TTS](#-tts)
18+
- [🔄 Turn-taking](#-turn-taking)
19+
- [🔧 Tool Calling](#-tool-calling)
20+
- [📝 Notes \& FAQ](#-notes--faq)
21+
- [☁️ NVIDIA NIM Services](#️-nvidia-nim-services)
22+
- [Acknowledgments](#acknowledgments)
23+
- [Contributing](#contributing)
24+
725

826
## ✨ Key Features
927

@@ -13,6 +31,7 @@ As of now, we only support English input and output, but more languages will be
1331
- Low latency TTS for fast audio response generation.
1432
- Speaker diarization up to 4 speakers in different user turns.
1533
- WebSocket server for easy deployment.
34+
- Tool calling for LLMs to use external tools and adjust its own behavior.
1635

1736

1837
## 💡 Upcoming Next
@@ -21,13 +40,15 @@ As of now, we only support English input and output, but more languages will be
2140
- Combine ASR and speaker diarization model to handle overlapping speech.
2241

2342

24-
## Latest Updates
43+
## 📅 Latest Updates
44+
- 2025-12-31: Added examples for [tool calling](#tool-calling), such as changing the speaking speed, switching between male/female voices and British/American accents, and getting the current weather of a city. Diarization model is updated to [nvidia/diar_streaming_sortformer_4spk-v2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1) with improved performance.
2545
- 2025-11-14: Added support for joint ASR and EOU detection with [Parakeet-realtime-eou-120m](https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1) model.
2646
- 2025-10-10: Added support for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) TTS model.
2747
- 2025-10-03: Add support for serving LLM with vLLM and auto-switch between vLLM and HuggingFace, add [nvidia/NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) as default LLM.
2848
- 2025-09-05: First release of NeMo Voice Agent.
2949

3050

51+
3152
## 🚀 Quick Start
3253

3354
### Hardware requirements
@@ -106,17 +127,17 @@ Open the client via browser: `http://[YOUR MACHINE IP ADDRESS]:5173/` (or whatev
106127

107128
You can mute/unmute your microphone via the "Mute" button, and reset the LLM context history and speaker cache by clicking the "Reset" button.
108129

109-
**If using chrome browser, you need to add `http://[YOUR MACHINE IP ADDRESS]:5173/` to the allow list via `chrome://flags/#unsafely-treat-insecure-origin-as-secure`.**
130+
**If using chrome browser, you need to add `http://[YOUR MACHINE IP ADDRESS]:5173/` to the allow list via `chrome://flags/#unsafely-treat-insecure-origin-as-secure`.** You may also need to restart the browser for the changes to take effect.
110131

111132
If you want to use a different port for client connection, you can modify `client/vite.config.js` to change the `port` variable.
112133

113-
## 📑 Supported Models
134+
## 📑 Supported Models and Features
114135

115136
### 🤖 LLM
116137

117138
Most LLMs from HuggingFace are supported. A few examples are:
118139
- [nvidia/NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) (default)
119-
- Please use `server/server_configs/llm_configs/nemotron-nano-v2.yaml` as the server config.
140+
- Please use `server/server_configs/llm_configs/nemotron_nano_v2.yaml` as the server config.
120141
- [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
121142
- Please use `server/server_configs/llm_configs/qwen2.5-7B.yaml` as the server config.
122143
- [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
@@ -177,11 +198,56 @@ Here are the supported TTS models:
177198
We will support more TTS models in the future.
178199

179200

180-
### Turn-taking
201+
### 🔄 Turn-taking
181202

182203
As the new turn-taking prediction model is not yet released, we use the VAD-based turn-taking prediction for now. You can set the `vad.stop_secs` to the desired value in `server/server_configs/default.yaml` to control the amount of silence needed to indicate the end of a user's turn.
183204

184-
Additionally, the voice agent supports ignoring back-channel phrases while the bot is talking, which it means phrases such as "uh-huh", "yeah", "okay" will not interrupt the bot while it's talking. To control the backchannel phrases to be used, you can set the `turn_taking.backchannel_phrases` in the server config to the desired list of phrases or a file path to a yaml file containing the list of phrases. By default, it will use the phrases in `server/backchannel_phrases.yaml`. Setting it to `null` will disable detecting backchannel phrases, and that the VAD will interrupt the bot immediately when the user starts speaking.
205+
Additionally, the voice agent supports ignoring back-channel phrases while the bot is talking, which means phrases such as "uh-huh", "yeah", "okay" will not interrupt the bot while it's talking. To control the backchannel phrases to be used, you can set the `turn_taking.backchannel_phrases` in the server config to the desired list of phrases or a file path to a yaml file containing the list of phrases. By default, it will use the phrases in `server/backchannel_phrases.yaml`. Setting it to `null` will disable detecting backchannel phrases, and that the VAD will interrupt the bot immediately when the user starts speaking.
206+
207+
208+
### 🔧 Tool Calling
209+
210+
We support tool calling for LLMs to use external tools (e.g., getting the current weather of a city) or adjust its own behavior (e.g., changing the speaking speed). Some example queries to try with the default server config:
211+
212+
1. Getting the current weather of a city:
213+
- "What's the weather in New York city?"
214+
- "What's the weather in Paris?"
215+
- "What's the weather in Paris, Texas, USA?"
216+
217+
2. Changing the speaking speed of the voice agent:
218+
- "Can you speak faster?"
219+
- "Can you speak slower?"
220+
- "Reset to the original speaking speed."
221+
- "Speak twice as fast."
222+
- "Speak half as slow."
223+
224+
3. Switching between British and American accents, and changing the gender of the voice:
225+
- "Speak in British accent."
226+
- "Switch to a male voice."
227+
- "Switch to a female voice."
228+
- "Reset to the original language and voice."
229+
230+
Currently, tool calling is only supported for vLLM server and specific LLM models:
231+
- [nvidia/NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) (default)
232+
233+
More LLMs can be supported by referring to their documentation on how to enable tool calling in vLLM. Note that the system prompt may need to be tuned accordingly.
234+
235+
More tools will be added later. However, if you cannot wait to hack and add your own tools, please read the following section.
236+
237+
#### Adding new tools
238+
239+
Additional tools can be added in two ways:
240+
- Adding a new [direct function](https://docs.pipecat.ai/guides/learn/function-calling#using-direct-functions-shorthand) such as the `get_city_weather` function in `nemo/agents/voice_agent/pipecat/utils/tool_calling/basic_tools.py`.
241+
- Adding new tools to adjust the behavior of each of the STT/TTS/Diar/LLM/TurnTaking components, by adding the `ToolCallingMixin` to the component and implementing the `setup_tool_calling` method as the `KokoroTTSService` class in `nemo/agents/voice_agent/pipecat/services/nemo/tts.py`.
242+
243+
The tools are then registered to the LLM via the `register_direct_tools_to_llm` function in `nemo/agents/voice_agent/pipecat/utils/tool_calling/mixins.py`, as shown in the example in `examples/voice_agent/server/bot_websocket_server.py`.
244+
245+
More details on tool calling with Pipecat can be found in the [Pipecat documentation](https://docs.pipecat.ai/guides/learn/function-calling).
246+
247+
#### Notes on system prompt with tools
248+
249+
We notice that sometimes the LLM cannot do anything that's not related to the provided tools, or it might not actually use the tools even though it says it's using them. To alleviate this issue, we insert additional instructions to the system prompt (e.g., in `server/server_configs/llm_configs/nemotron_nano_v2.yaml`):
250+
- "Before responding to the user, check if the user request requires using external tools, and use the tools if they match with the user's intention. Otherwise, use your internal knowledge to answer the user's question. Do not use tools for casual conversation or when the tools don't fit the use cases. You should still try to address the user's request when it's not related to the provided tools."
185251

186252

187253
## 📝 Notes & FAQ
@@ -199,7 +265,7 @@ Additionally, the voice agent supports ignoring back-channel phrases while the b
199265

200266
NVIDIA also provides a variety of [NIM](https://developer.nvidia.com/nim?sortBy=developer_learning_library%2Fsort%2Ffeatured_in.nim%3Adesc%2Ctitle%3Aasc&hitsPerPage=12) services for better ASR, TTS and LLM performance with more efficient deployment on either cloud or local servers.
201267

202-
You can also modify the `server/bot_websocket_server.py` to use NVIDIA NIM services for better LLM,ASR and TTS performance, by refering to these Pipecat services:
268+
You can also modify the `server/bot_websocket_server.py` to use NVIDIA NIM services for better LLM, ASR and TTS performance, by referring to these Pipecat services:
203269
- [NVIDIA NIM LLM Service](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/nim/llm.py)
204270
- [NVIDIA Riva ASR Service](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/riva/stt.py)
205271
- [NVIDIA Riva TTS Service](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/riva/tts.py)

examples/voice_agent/environment.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ dependencies:
3636
- addict==2.4.0
3737
- aiofiles==24.1.0
3838
- aiohappyeyeballs==2.6.1
39-
- aiohttp==3.12.15
39+
- aiohttp==3.13.2
4040
- aiosignal==1.4.0
4141
- alabaster==1.0.0
4242
- alembic==1.16.5
@@ -332,6 +332,7 @@ dependencies:
332332
- python-dotenv==1.1.1
333333
- python-json-logger==3.3.0
334334
- python-multipart==0.0.20
335+
- python-weather==2.1.1
335336
- pytorch-lightning==2.5.5
336337
- pytz==2025.2
337338
- pyyaml==6.0.3

examples/voice_agent/server/bot_websocket_server.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,6 @@
2020
import sys
2121

2222
from loguru import logger
23-
2423
from pipecat.audio.vad.silero import SileroVADAnalyzer
2524
from pipecat.frames.frames import EndTaskFrame
2625
from pipecat.pipeline.pipeline import Pipeline
@@ -41,6 +40,8 @@
4140
WebsocketServerTransport,
4241
)
4342
from nemo.agents.voice_agent.pipecat.utils.text.simple_text_aggregator import SimpleSegmentedTextAggregator
43+
from nemo.agents.voice_agent.pipecat.utils.tool_calling.basic_tools import get_city_weather
44+
from nemo.agents.voice_agent.pipecat.utils.tool_calling.mixins import register_direct_tools_to_llm
4445
from nemo.agents.voice_agent.utils.config_manager import ConfigManager
4546

4647

@@ -218,14 +219,20 @@ async def run_bot_websocket_server(host: str = "0.0.0.0", port: int = 8765):
218219
logger.info("TTS service initialized")
219220

220221
context = OpenAILLMContext(
221-
[
222+
messages=[
222223
{
223224
"role": SYSTEM_ROLE,
224225
"content": SYSTEM_PROMPT,
225226
}
226227
],
227228
)
228229

230+
if server_config.llm.get("enable_tool_calling", False):
231+
logger.info("Tools calling for LLM is enabled by config, registering tools...")
232+
register_direct_tools_to_llm(llm=llm, context=context, tool_mixins=[tts], tools=[get_city_weather])
233+
else:
234+
logger.info("Tools calling for LLM is disabled by config, skipping tool registration.")
235+
229236
original_messages = copy.deepcopy(context.get_messages())
230237
original_context = copy.deepcopy(context)
231238
original_context.set_llm_adapter(llm.get_llm_adapter())

0 commit comments

Comments
 (0)