NVIDIA-NeMo
diff --git a/‎examples/voice_agent/README.md‎
Lines changed: 73 additions & 7 deletions b/‎examples/voice_agent/README.md‎
Lines changed: 73 additions & 7 deletions
diff --git a/‎examples/voice_agent/environment.yaml‎
Lines changed: 2 additions & 1 deletion b/‎examples/voice_agent/environment.yaml‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎examples/voice_agent/server/bot_websocket_server.py‎
Lines changed: 9 additions & 2 deletions b/‎examples/voice_agent/server/bot_websocket_server.py‎
Lines changed: 9 additions & 2 deletions
@@ -4,6 +4,24 @@ A fully open-source NVIDIA NeMo Voice Agent example demonstrating a simple way t
 
 As of now, we only support English input and output, but more languages will be supported in the future.
 
+## 📋 Table of Contents
+- [✨ Key Features](#-key-features)
+- [💡 Upcoming Next](#-upcoming-next)
+- [📅 Latest Updates](#-latest-updates)
+- [🚀 Quick Start](#-quick-start)
+- [📑 Supported Models and Features](#-supported-models-and-features)
+  - [🤖 LLM](#-llm)
+    - [Thinking/reasoning Mode for LLMs](#thinkingreasoning-mode-for-llms)
+  - [🎤 ASR](#-asr)
+  - [💬 Speaker Diarization](#-speaker-diarization)
+  - [🔉 TTS](#-tts)
+  - [🔄 Turn-taking](#-turn-taking)
+  - [🔧 Tool Calling](#-tool-calling)
+- [📝 Notes \& FAQ](#-notes--faq)
+- [☁️ NVIDIA NIM Services](#️-nvidia-nim-services)
+- [Acknowledgments](#acknowledgments)
+- [Contributing](#contributing)
+
 
 ## ✨ Key Features
 
@@ -13,6 +31,7 @@ As of now, we only support English input and output, but more languages will be
 - Low latency TTS for fast audio response generation.
 - Speaker diarization up to 4 speakers in different user turns.
 - WebSocket server for easy deployment.
+- Tool calling for LLMs to use external tools and adjust its own behavior.
 
 
 ## 💡 Upcoming Next
@@ -21,13 +40,15 @@ As of now, we only support English input and output, but more languages will be
 - Combine ASR and speaker diarization model to handle overlapping speech.
 
 
-## Latest Updates
+## 📅 Latest Updates
+- 2025-12-31: Added examples for [tool calling](#tool-calling), such as changing the speaking speed, switching between male/female voices and British/American accents, and getting the current weather of a city. Diarization model is updated to [nvidia/diar_streaming_sortformer_4spk-v2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1) with improved performance.
 - 2025-11-14: Added support for joint ASR and EOU detection with [Parakeet-realtime-eou-120m](https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1) model.
 - 2025-10-10: Added support for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) TTS model.
 - 2025-10-03: Add support for serving LLM with vLLM and auto-switch between vLLM and HuggingFace, add [nvidia/NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) as default LLM.
 - 2025-09-05: First release of NeMo Voice Agent.
 
 
+
 ## 🚀 Quick Start
 
 ### Hardware requirements
@@ -106,17 +127,17 @@ Open the client via browser: `http://[YOUR MACHINE IP ADDRESS]:5173/` (or whatev
 
 You can mute/unmute your microphone via the "Mute" button, and reset the LLM context history and speaker cache by clicking the "Reset" button. 
 
-**If using chrome browser, you need to add `http://[YOUR MACHINE IP ADDRESS]:5173/` to the allow list via `chrome://flags/#unsafely-treat-insecure-origin-as-secure`.**
+**If using chrome browser, you need to add `http://[YOUR MACHINE IP ADDRESS]:5173/` to the allow list via `chrome://flags/#unsafely-treat-insecure-origin-as-secure`.** You may also need to restart the browser for the changes to take effect.
 
 If you want to use a different port for client connection, you can modify `client/vite.config.js` to change the `port` variable.
 
-## 📑 Supported Models
+## 📑 Supported Models and Features
 
 ### 🤖 LLM
 
 Most LLMs from HuggingFace are supported. A few examples are:
 - [nvidia/NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) (default)
-    - Please use `server/server_configs/llm_configs/nemotron-nano-v2.yaml` as the server config.
+    - Please use `server/server_configs/llm_configs/nemotron_nano_v2.yaml` as the server config.
 - [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
     - Please use `server/server_configs/llm_configs/qwen2.5-7B.yaml` as the server config.
 - [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
@@ -177,11 +198,56 @@ Here are the supported TTS models:
 We will support more TTS models in the future.
 
 
-### Turn-taking
+### 🔄 Turn-taking
 
 As the new turn-taking prediction model is not yet released, we use the VAD-based turn-taking prediction for now. You can set the `vad.stop_secs` to the desired value in `server/server_configs/default.yaml` to control the amount of silence needed to indicate the end of a user's turn.
 
-Additionally, the voice agent supports ignoring back-channel phrases while the bot is talking, which it means phrases such as "uh-huh", "yeah", "okay"  will not interrupt the bot while it's talking. To control the backchannel phrases to be used, you can set the `turn_taking.backchannel_phrases` in the server config to the desired list of phrases or a file path to a yaml file containing the list of phrases. By default, it will use the phrases in `server/backchannel_phrases.yaml`. Setting it to `null` will disable detecting backchannel phrases, and that the VAD will interrupt the bot immediately when the user starts speaking.
+Additionally, the voice agent supports ignoring back-channel phrases while the bot is talking, which means phrases such as "uh-huh", "yeah", "okay"  will not interrupt the bot while it's talking. To control the backchannel phrases to be used, you can set the `turn_taking.backchannel_phrases` in the server config to the desired list of phrases or a file path to a yaml file containing the list of phrases. By default, it will use the phrases in `server/backchannel_phrases.yaml`. Setting it to `null` will disable detecting backchannel phrases, and that the VAD will interrupt the bot immediately when the user starts speaking.
+
+
+### 🔧 Tool Calling
+
+We support tool calling for LLMs to use external tools (e.g., getting the current weather of a city) or adjust its own behavior (e.g., changing the speaking speed). Some example queries to try with the default server config:
+
+1. Getting the current weather of a city:
+   - "What's the weather in New York city?"
+   - "What's the weather in Paris?"
+   - "What's the weather in Paris, Texas, USA?"
+
+2. Changing the speaking speed of the voice agent:
+   - "Can you speak faster?"
+   - "Can you speak slower?"
+   - "Reset to the original speaking speed."
+   - "Speak twice as fast."
+   - "Speak half as slow."
+  
+3. Switching between British and American accents, and changing the gender of the voice:
+   - "Speak in British accent."
+   - "Switch to a male voice."
+   - "Switch to a female voice."
+   - "Reset to the original language and voice."
+
+Currently, tool calling is only supported for vLLM server and specific LLM models:
+- [nvidia/NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) (default)
+
+More LLMs can be supported by referring to their documentation on how to enable tool calling in vLLM. Note that the system prompt may need to be tuned accordingly.
+
+More tools will be added later. However, if you cannot wait to hack and add your own tools, please read the following section.
+
+#### Adding new tools
+
+Additional tools can be added in two ways:
+- Adding a new [direct function](https://docs.pipecat.ai/guides/learn/function-calling#using-direct-functions-shorthand) such as the `get_city_weather` function in `nemo/agents/voice_agent/pipecat/utils/tool_calling/basic_tools.py`.
+- Adding new tools to adjust the behavior of each of the STT/TTS/Diar/LLM/TurnTaking components, by adding the `ToolCallingMixin` to the component and implementing the `setup_tool_calling` method as the `KokoroTTSService` class in `nemo/agents/voice_agent/pipecat/services/nemo/tts.py`.
+
+The tools are then registered to the LLM via the `register_direct_tools_to_llm` function in `nemo/agents/voice_agent/pipecat/utils/tool_calling/mixins.py`, as shown in the example in `examples/voice_agent/server/bot_websocket_server.py`.
+
+More details on tool calling with Pipecat can be found in the [Pipecat documentation](https://docs.pipecat.ai/guides/learn/function-calling).
+
+#### Notes on system prompt with tools
+
+We notice that sometimes the LLM cannot do anything that's not related to the provided tools, or it might not actually use the tools even though it says it's using them. To alleviate this issue, we insert additional instructions to the system prompt (e.g., in `server/server_configs/llm_configs/nemotron_nano_v2.yaml`):
+- "Before responding to the user, check if the user request requires using external tools, and use the tools if they match with the user's intention. Otherwise, use your internal knowledge to answer the user's question. Do not use tools for casual conversation or when the tools don't fit the use cases. You should still try to address the user's request when it's not related to the provided tools."
 
 
 ## 📝 Notes & FAQ
@@ -199,7 +265,7 @@ Additionally, the voice agent supports ignoring back-channel phrases while the b
 
 NVIDIA also provides a variety of [NIM](https://developer.nvidia.com/nim?sortBy=developer_learning_library%2Fsort%2Ffeatured_in.nim%3Adesc%2Ctitle%3Aasc&hitsPerPage=12) services for better ASR, TTS and LLM performance with more efficient deployment on either cloud or local servers.
 
-You can also modify the `server/bot_websocket_server.py` to use NVIDIA NIM services for better LLM,ASR and TTS performance, by refering to these Pipecat services:
+You can also modify the `server/bot_websocket_server.py` to use NVIDIA NIM services for better LLM, ASR and TTS performance, by referring to these Pipecat services:
 - [NVIDIA NIM LLM Service](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/nim/llm.py)
 - [NVIDIA Riva ASR Service](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/riva/stt.py)
 - [NVIDIA Riva TTS Service](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/riva/tts.py)
 
@@ -36,7 +36,7 @@ dependencies:
     - addict==2.4.0
     - aiofiles==24.1.0
     - aiohappyeyeballs==2.6.1
-    - aiohttp==3.12.15
+    - aiohttp==3.13.2
     - aiosignal==1.4.0
     - alabaster==1.0.0
     - alembic==1.16.5
@@ -332,6 +332,7 @@ dependencies:
     - python-dotenv==1.1.1
     - python-json-logger==3.3.0
     - python-multipart==0.0.20
+    - python-weather==2.1.1
     - pytorch-lightning==2.5.5
     - pytz==2025.2
     - pyyaml==6.0.3
 
@@ -20,7 +20,6 @@
 import sys
 
 from loguru import logger
-
 from pipecat.audio.vad.silero import SileroVADAnalyzer
 from pipecat.frames.frames import EndTaskFrame
 from pipecat.pipeline.pipeline import Pipeline
@@ -41,6 +40,8 @@
     WebsocketServerTransport,
 )
 from nemo.agents.voice_agent.pipecat.utils.text.simple_text_aggregator import SimpleSegmentedTextAggregator
+from nemo.agents.voice_agent.pipecat.utils.tool_calling.basic_tools import get_city_weather
+from nemo.agents.voice_agent.pipecat.utils.tool_calling.mixins import register_direct_tools_to_llm
 from nemo.agents.voice_agent.utils.config_manager import ConfigManager
 
 
@@ -218,14 +219,20 @@ async def run_bot_websocket_server(host: str = "0.0.0.0", port: int = 8765):
     logger.info("TTS service initialized")
 
     context = OpenAILLMContext(
-        [
+        messages=[
             {
                 "role": SYSTEM_ROLE,
                 "content": SYSTEM_PROMPT,
             }
         ],
     )
 
+    if server_config.llm.get("enable_tool_calling", False):
+        logger.info("Tools calling for LLM is enabled by config, registering tools...")
+        register_direct_tools_to_llm(llm=llm, context=context, tool_mixins=[tts], tools=[get_city_weather])
+    else:
+        logger.info("Tools calling for LLM is disabled by config, skipping tool registration.")
+
     original_messages = copy.deepcopy(context.get_messages())
     original_context = copy.deepcopy(context)
     original_context.set_llm_adapter(llm.get_llm_adapter())