Skip to content

Commit 208cf2d

Browse files
[voice agent] Improve tool calling and logging ux (NVIDIA-NeMo#15269)
* refactor Signed-off-by: stevehuang52 <[email protected]> * update tts Signed-off-by: stevehuang52 <[email protected]> * improve ux Signed-off-by: stevehuang52 <[email protected]> * fix linting Signed-off-by: stevehuang52 <[email protected]> * refactor tts tool Signed-off-by: stevehuang52 <[email protected]> * Moving voice agent tests from example to test folder Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Moving the temporary test file to example folder Signed-off-by: taejinp <[email protected]> --------- Signed-off-by: stevehuang52 <[email protected]> Signed-off-by: taejinp <[email protected]> Signed-off-by: tango4j <[email protected]> Co-authored-by: taejinp <[email protected]> Co-authored-by: tango4j <[email protected]>
1 parent ab5fabc commit 208cf2d

File tree

14 files changed

+201
-51
lines changed

14 files changed

+201
-51
lines changed

examples/voice_agent/README.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,6 @@ As of now, we only support English input and output, but more languages will be
1111
- [🚀 Quick Start](#-quick-start)
1212
- [📑 Supported Models and Features](#-supported-models-and-features)
1313
- [🤖 LLM](#-llm)
14-
- [Thinking/reasoning Mode for LLMs](#thinkingreasoning-mode-for-llms)
1514
- [🎤 ASR](#-asr)
1615
- [💬 Speaker Diarization](#-speaker-diarization)
1716
- [🔉 TTS](#-tts)
@@ -171,6 +170,9 @@ For vLLM server, if you specify `--reasoning_parser` in `vllm_server_params`, th
171170

172171
We use [cache-aware streaming FastConformer](https://arxiv.org/abs/2312.17279) to transcribe the user's speech into text. While new models will be released soon, we use the existing English models for now:
173172
- [nvidia/parakeet_realtime_eou_120m-v1](https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1) (default)
173+
- This model supports EOU prediction and optimized for lowest latency, but does not support punctuation and capitalization.
174+
- [nvidia/nemotron-speech-streaming-en-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)
175+
- This model has better ASR accuracy and supports punctuation and capitalization, but does not predict EOU.
174176
- [stt_en_fastconformer_hybrid_large_streaming_80ms](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_hybrid_large_streaming_80ms)
175177
- [nvidia/stt_en_fastconformer_hybrid_large_streaming_multi](https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_large_streaming_multi)
176178

@@ -244,10 +246,11 @@ The tools are then registered to the LLM via the `register_direct_tools_to_llm`
244246

245247
More details on tool calling with Pipecat can be found in the [Pipecat documentation](https://docs.pipecat.ai/guides/learn/function-calling).
246248

247-
#### Notes on system prompt with tools
249+
#### Notes on tool calling issues
248250

249-
We notice that sometimes the LLM cannot do anything that's not related to the provided tools, or it might not actually use the tools even though it says it's using them. To alleviate this issue, we insert additional instructions to the system prompt (e.g., in `server/server_configs/llm_configs/nemotron_nano_v2.yaml`):
250-
- "Before responding to the user, check if the user request requires using external tools, and use the tools if they match with the user's intention. Otherwise, use your internal knowledge to answer the user's question. Do not use tools for casual conversation or when the tools don't fit the use cases. You should still try to address the user's request when it's not related to the provided tools."
251+
We notice that sometimes the LLM cannot do anything that's not related to the provided tools, or it might not actually use the tools even though it says it's using them. To alleviate this issue, we insert additional instructions to the system prompt to regulate its behavior (e.g., in `server/server_configs/llm_configs/nemotron_nano_v2.yaml`).
252+
253+
Sometimes, after answering a question related to the tools, the LLM might refuce to answer questions that are not related to the tools, or vice versa. This phenomenon can be called "commitment bias" or "tunnel vision". To alleviate this issue, we can insert additional instructions to the system prompt and explicitly asking the LLM to use or not use the tools in the user's query.
251254

252255

253256
## 📝 Notes & FAQ

examples/voice_agent/client/src/app.ts

Lines changed: 51 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,8 @@ class WebsocketClientApp {
4141
private analyser: AnalyserNode | null = null;
4242
private microphone: MediaStreamAudioSourceNode | null = null;
4343
private volumeUpdateInterval: number | null = null;
44+
private currentBotMessageElement: HTMLDivElement | null = null;
45+
private currentBotMessage: string = '';
4446

4547
// Server configurations
4648
private readonly serverConfigs = {
@@ -110,6 +112,19 @@ class WebsocketClientApp {
110112
console.log(message);
111113
}
112114

115+
/**
116+
* Create a bot message element and add it to the debug log
117+
*/
118+
private createBotMessageElement(initialText: string): HTMLDivElement | null {
119+
if (!this.debugLog) return null;
120+
const entry = document.createElement('div');
121+
entry.style.color = '#4CAF50';
122+
entry.textContent = `${new Date().toISOString()} - ${initialText}`;
123+
this.debugLog.appendChild(entry);
124+
this.debugLog.scrollTop = this.debugLog.scrollHeight;
125+
return entry;
126+
}
127+
113128
/**
114129
* Update the connection status display
115130
*/
@@ -240,7 +255,34 @@ class WebsocketClientApp {
240255
this.log(`User: ${data.text}`);
241256
}
242257
},
243-
onBotTranscript: (data) => this.log(`Bot: ${data.text}`),
258+
onBotTranscript: (data) => {
259+
// If no current element exists, create one (fallback in case BOT_LLM_STARTED didn't fire)
260+
if (!this.currentBotMessageElement) {
261+
this.currentBotMessage = '';
262+
this.currentBotMessageElement = this.createBotMessageElement('Bot: ');
263+
}
264+
265+
// Accumulate the text
266+
this.currentBotMessage += data.text;
267+
268+
// Update the current element
269+
if (this.currentBotMessageElement) {
270+
const timestamp = new Date().toISOString();
271+
this.currentBotMessageElement.textContent = `${timestamp} - Bot: ${this.currentBotMessage}`;
272+
this.debugLog?.scrollTo({ top: this.debugLog.scrollHeight, behavior: 'smooth' });
273+
}
274+
},
275+
onBotLlmStarted: () => {
276+
// Only create a new bot message element if the current one has content
277+
if (this.currentBotMessage !== '') {
278+
this.currentBotMessage = '';
279+
this.currentBotMessageElement = this.createBotMessageElement('Bot: ');
280+
} else if (!this.currentBotMessageElement) {
281+
// Create element if it doesn't exist at all
282+
this.currentBotMessage = '';
283+
this.currentBotMessageElement = this.createBotMessageElement('Bot: ');
284+
}
285+
},
244286
onMessageError: (error) => console.error('Message error:', error),
245287
onError: (error) => console.error('Error:', error),
246288
},
@@ -313,6 +355,10 @@ class WebsocketClientApp {
313355
// Stop volume monitoring
314356
this.stopVolumeMonitoring();
315357

358+
// Clean up bot message state
359+
this.currentBotMessage = '';
360+
this.currentBotMessageElement = null;
361+
316362
// Reset mute state
317363
this.isMuted = false;
318364

@@ -357,6 +403,10 @@ class WebsocketClientApp {
357403
// Stop volume monitoring
358404
this.stopVolumeMonitoring();
359405

406+
// Clean up bot message state
407+
this.currentBotMessage = '';
408+
this.currentBotMessageElement = null;
409+
360410
// Reset mute state
361411
this.isMuted = false;
362412

examples/voice_agent/server/bot_websocket_server.py

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -26,21 +26,21 @@
2626
from pipecat.pipeline.runner import PipelineRunner
2727
from pipecat.pipeline.task import PipelineParams, PipelineTask
2828
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
29-
from pipecat.processors.frameworks.rtvi import RTVIAction, RTVIConfig, RTVIProcessor
29+
from pipecat.processors.frameworks.rtvi import RTVIAction, RTVIConfig, RTVIObserverParams, RTVIProcessor
3030
from pipecat.serializers.protobuf import ProtobufFrameSerializer
3131

3232
from nemo.agents.voice_agent.pipecat.processors.frameworks.rtvi import RTVIObserver
3333
from nemo.agents.voice_agent.pipecat.services.nemo.diar import NemoDiarService
3434
from nemo.agents.voice_agent.pipecat.services.nemo.llm import get_llm_service_from_config
35-
from nemo.agents.voice_agent.pipecat.services.nemo.stt import NemoSTTService
35+
from nemo.agents.voice_agent.pipecat.services.nemo.stt import ASR_EOU_MODELS, NemoSTTService
3636
from nemo.agents.voice_agent.pipecat.services.nemo.tts import KokoroTTSService, NeMoFastPitchHiFiGANTTSService
3737
from nemo.agents.voice_agent.pipecat.services.nemo.turn_taking import NeMoTurnTakingService
3838
from nemo.agents.voice_agent.pipecat.transports.network.websocket_server import (
3939
WebsocketServerParams,
4040
WebsocketServerTransport,
4141
)
4242
from nemo.agents.voice_agent.pipecat.utils.text.simple_text_aggregator import SimpleSegmentedTextAggregator
43-
from nemo.agents.voice_agent.pipecat.utils.tool_calling.basic_tools import get_city_weather
43+
from nemo.agents.voice_agent.pipecat.utils.tool_calling.basic_tools import tool_get_city_weather
4444
from nemo.agents.voice_agent.pipecat.utils.tool_calling.mixins import register_direct_tools_to_llm
4545
from nemo.agents.voice_agent.utils.config_manager import ConfigManager
4646

@@ -83,7 +83,7 @@ def setup_logging():
8383
vad_params = config_manager.get_vad_params()
8484

8585
# STT configuration
86-
STT_MODEL_PATH = config_manager.STT_MODEL_PATH
86+
STT_MODEL = config_manager.STT_MODEL
8787
STT_DEVICE = config_manager.STT_DEVICE
8888
stt_params = config_manager.get_stt_params()
8989

@@ -137,6 +137,9 @@ async def run_bot_websocket_server(host: str = "0.0.0.0", port: int = 8765):
137137
)
138138
logger.info("VAD analyzer initialized")
139139

140+
has_turn_taking = True if STT_MODEL in ASR_EOU_MODELS else False
141+
logger.info(f"Setting STT service has_turn_taking to `{has_turn_taking}` based on model name: `{STT_MODEL}`")
142+
140143
ws_transport = WebsocketServerTransport(
141144
params=WebsocketServerParams(
142145
serializer=ProtobufFrameSerializer(),
@@ -146,8 +149,8 @@ async def run_bot_websocket_server(host: str = "0.0.0.0", port: int = 8765):
146149
vad_analyzer=vad_analyzer,
147150
session_timeout=None, # Disable session timeout
148151
audio_in_sample_rate=SAMPLE_RATE,
149-
can_create_user_frames=TURN_TAKING_BACKCHANNEL_PHRASES_PATH
150-
is None, # if backchannel phrases are disabled, we can use VAD to interrupt the bot immediately
152+
can_create_user_frames=TURN_TAKING_BACKCHANNEL_PHRASES_PATH is None
153+
or not has_turn_taking, # if backchannel phrases are disabled, we can use VAD to interrupt the bot immediately
151154
audio_out_10ms_chunks=TRANSPORT_AUDIO_OUT_10MS_CHUNKS,
152155
),
153156
host=host,
@@ -157,12 +160,12 @@ async def run_bot_websocket_server(host: str = "0.0.0.0", port: int = 8765):
157160
logger.info("Initializing STT service...")
158161

159162
stt = NemoSTTService(
160-
model=STT_MODEL_PATH,
163+
model=STT_MODEL,
161164
device=STT_DEVICE,
162165
params=stt_params,
163166
sample_rate=SAMPLE_RATE,
164167
audio_passthrough=True,
165-
has_turn_taking=True,
168+
has_turn_taking=has_turn_taking,
166169
backend="legacy",
167170
decoder_type="rnnt",
168171
)
@@ -229,7 +232,7 @@ async def run_bot_websocket_server(host: str = "0.0.0.0", port: int = 8765):
229232

230233
if server_config.llm.get("enable_tool_calling", False):
231234
logger.info("Tools calling for LLM is enabled by config, registering tools...")
232-
register_direct_tools_to_llm(llm=llm, context=context, tool_mixins=[tts], tools=[get_city_weather])
235+
register_direct_tools_to_llm(llm=llm, context=context, tool_mixins=[tts], tools=[tool_get_city_weather])
233236
else:
234237
logger.info("Tools calling for LLM is disabled by config, skipping tool registration.")
235238

@@ -288,7 +291,7 @@ async def reset_context_handler(rtvi_processor: RTVIProcessor, service: str, arg
288291

289292
pipeline = Pipeline(pipeline)
290293

291-
rtvi_text_aggregator = SimpleSegmentedTextAggregator(punctuation_marks=".!?\n")
294+
rtvi_params = RTVIObserverParams(bot_llm_enabled=False)
292295
task = PipelineTask(
293296
pipeline,
294297
params=PipelineParams(
@@ -299,7 +302,7 @@ async def reset_context_handler(rtvi_processor: RTVIProcessor, service: str, arg
299302
report_only_initial_ttfb=True,
300303
idle_timeout=None, # Disable idle timeout
301304
),
302-
observers=[RTVIObserver(rtvi, text_aggregator=rtvi_text_aggregator)],
305+
observers=[RTVIObserver(rtvi, params=rtvi_params)],
303306
idle_timeout_secs=None,
304307
cancel_on_idle_timeout=False,
305308
)

examples/voice_agent/server/server_configs/default.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ llm:
4646
enable_reasoning: false # it's best to turn-off reasoning for lowest latency, setting it to True will use the same config ending with `_think.yaml` instead
4747
# `system_prompt` is used as the sytem prompt to the LLM, please refer to differnt LLM webpage for spcial functions like enabling/disabling thinking
4848
# system_prompt: /path/to/prompt.txt # or use path to a txt file that contains a long prompt, for example in `../example_prompts/fast_bite.txt`
49-
system_prompt: "You are a helpful AI agent named Lisa. Start by greeting the user warmly and introducing yourself within one sentence. Your answer should be concise and to the point. You might also see speaker tags (<speaker_0>, <speaker_1>, etc.) in the user context. You should respond to the user based on the speaker tag and the context of that speaker. Do not include the speaker tags in your response, use them only to identify the speaker. Do not include any emoji in response."
49+
system_prompt: "You are a helpful AI agent named Lisa. Start by greeting the user warmly and introducing yourself within one sentence. Your answer should be concise and to the point. You might also see speaker tags (<speaker_0>, <speaker_1>, etc.) in the user context. You should respond to the user based on the speaker tag and the context of that speaker. Do not include the speaker tags in your response, use them only to identify the speaker. Avoid using emoji in your response."
5050

5151
tts:
5252
type: kokoro # choices in ['nemo', 'kokoro']

examples/voice_agent/server/server_configs/llm_configs/nemotron_nano_v2.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ type: vllm # Overwrite to vllm to enable tool calling, the HF backend currently
66
dtype: bfloat16 # torch.dtype for LLM
77
device: "cuda"
88
system_role: "system" # role for system prompt, set it to `user` for models that do not support system prompt
9-
system_prompt_suffix: "Before responding to the user, check if the user request requires using external tools, and use the tools if they match with the user's intention. Otherwise, use your internal knowledge to answer the user's question. Do not use tools for casual conversation or when the tools don't fit the use cases. You should still try to address the user's request when it's not related to the provided tools. /no_think" # a string that would be appended to the system prompt, `/think` and `/no_think` are used to enable/disable thinking
9+
system_prompt_suffix: "Before responding to the user, check if the user request requires using external tools, and use the tools if they match with the user's intention. Otherwise, use your internal knowledge to answer the user's question. Do not use tools for casual conversation or when the tools don't fit the use cases. You should still try to address the user's request when it's not related to the provided tools. If you are provided with a set of tools, use them only when needed, do not limit your capabilities to the scope of the tools. If the purpose of a tool matches well with a user's request, always try to call the tool first. Conversation history should not limit your behavior on whether you can use tools. You must answer questions not related to the tools. /no_think" # a string that would be appended to the system prompt, `/think` and `/no_think` are used to enable/disable thinking
1010
enable_tool_calling: True # set to True since the vllm config below supports tool calling
1111

1212
##############################

examples/voice_agent/server/server_configs/tts_configs/kokoro_82M.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,5 +12,4 @@ extra_separator: # a list of additional punctuations to chunk LLM response into
1212
- "?"
1313
- "!"
1414
- ";"
15-
- ":"
1615
think_tokens: ["<think>", "</think>"] # specify them to avoid TTS for thinking process, set to `null` to allow thinking out loud

examples/voice_agent/server/server_configs/tts_configs/nemo_fastpitch-hifigan.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,5 +11,4 @@ extra_separator: # a list of additional punctuations to chunk LLM response into
1111
- "?"
1212
- "!"
1313
- ";"
14-
- ":"
1514
think_tokens: ["<think>", "</think>"] # specify them to avoid TTS for thinking process, set to `null` to allow thinking out loud

examples/voice_agent/tests/test_config_manager.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -84,14 +84,17 @@ def test_configure_stt_nemo_model(self, voice_agent_server_base_path):
8484
# Create necessary files
8585
config_manager = ConfigManager(voice_agent_server_base_path)
8686

87-
assert "stt_en_fastconformer" in config_manager.STT_MODEL_PATH
87+
# STT_MODEL can be either a fastconformer model or an EOU model (e.g., parakeet_realtime_eou)
88+
assert (
89+
"stt_en_fastconformer" in config_manager.STT_MODEL or "parakeet_realtime_eou" in config_manager.STT_MODEL
90+
)
8891
assert isinstance(config_manager.stt_params, NeMoSTTInputParams)
8992

9093
@pytest.mark.unit
9194
def test_configure_stt_with_model_config(self, voice_agent_server_base_path):
9295
"""Test STT configuration with custom model config."""
9396
config_manager = ConfigManager(voice_agent_server_base_path)
94-
assert hasattr(config_manager, "STT_MODEL_PATH")
97+
assert hasattr(config_manager, "STT_MODEL")
9598

9699
@pytest.mark.unit
97100
def test_configure_diarization(self, voice_agent_server_base_path):
@@ -203,8 +206,8 @@ def test_get_vad_params(self, voice_agent_server_base_path):
203206

204207
assert isinstance(vad_params, VADParams)
205208
assert isinstance(vad_params.confidence, float) and 0.0 <= vad_params.confidence <= 1.0
206-
assert isinstance(vad_params.start_secs, float) and 0.0 <= vad_params.start_secs <= 1.0
207-
assert isinstance(vad_params.stop_secs, float) and 0.0 <= vad_params.stop_secs <= 1.0
209+
assert isinstance(vad_params.start_secs, float) and vad_params.start_secs >= 0.0
210+
assert isinstance(vad_params.stop_secs, float) and vad_params.stop_secs >= 0.0
208211
assert isinstance(vad_params.min_volume, float) and 0.0 <= vad_params.min_volume <= 1.0
209212

210213
@pytest.mark.unit

0 commit comments

Comments
 (0)