Skip to content

Commit 559cc97

Browse files
authored
[voice agent] make parakeet-eou model default stt (#15069)
* make eou model default stt Signed-off-by: stevehuang52 <[email protected]> * fix typo Signed-off-by: stevehuang52 <[email protected]> * clean up doc Signed-off-by: stevehuang52 <[email protected]> --------- Signed-off-by: stevehuang52 <[email protected]> Signed-off-by: He Huang (Steve) <[email protected]>
1 parent 26874a5 commit 559cc97

File tree

4 files changed

+13
-14
lines changed

4 files changed

+13
-14
lines changed

examples/voice_agent/README.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,20 +9,20 @@ As of now, we only support English input and output, but more languages will be
99

1010
- Open-source, local deployment, and flexible customization.
1111
- Allow users to talk to most LLMs from HuggingFace with configurable prompts.
12-
- Streaming speech recognition with low latency.
12+
- Streaming speech recognition with low latency and end-of-utterance detection.
1313
- Low latency TTS for fast audio response generation.
1414
- Speaker diarization up to 4 speakers in different user turns.
1515
- WebSocket server for easy deployment.
1616

1717

1818
## 💡 Upcoming Next
19-
- Joint ASR and EOU detection in the same model.
2019
- Accuracy and robustness ASR model improvements.
2120
- Better TTS with more natural voice (e.g., [Magpie-TTS](https://build.nvidia.com/nvidia/magpie-tts-multilingual)).
2221
- Combine ASR and speaker diarization model to handle overlapping speech.
2322

2423

2524
## Latest Updates
25+
- 2025-11-14: Added support for joint ASR and EOU detection with [Parakeet-realtime-eou-120m](https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1) model.
2626
- 2025-10-10: Added support for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) TTS model.
2727
- 2025-10-03: Add support for serving LLM with vLLM and auto-switch between vLLM and HuggingFace, add [nvidia/NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) as default LLM.
2828
- 2025-09-05: First release of NeMo Voice Agent.
@@ -148,16 +148,18 @@ A lot of LLMs support thinking/reasoning mode, which is useful for complex tasks
148148

149149
Different models may have different ways to support thinking/reasoning mode, please refer to the model's homepage for details on their thinking/reasoning mode support. Meanwhile, in many cases, they support enabling thinking/reasoning can be achieved by adding `/think` or `/no_think` to the end of the system prompt, and the thinking/reasoning content is wrapped by the tokens `["<think>", "</think>"]`. Some models may also support enabling thinking/reasoning by setting `llm.apply_chat_template_kwargs.enable_thinking=true/false` in the server config when `llm.type=hf`.
150150

151-
If thinking/reasoning mode is enabled (e.g., in `server/server_configs/qwen3-8B_think.yaml`), the voice agnet server will print out the thinking/reasoning content so that you can see the process of the LLM thinking and still have a smooth conversation experience. The thinking/reasoning content will not go through the TTS process, so you will only hear the final answer, and this is achieved by specifying the pair of thinking tokens `tts.think_tokens=["<think>", "</think>"]` in the server config.
151+
If thinking/reasoning mode is enabled (e.g., in `server/server_configs/qwen3-8B_think.yaml`), the voice agent server will print out the thinking/reasoning content so that you can see the process of the LLM thinking and still have a smooth conversation experience. The thinking/reasoning content will not go through the TTS process, so you will only hear the final answer, and this is achieved by specifying the pair of thinking tokens `tts.think_tokens=["<think>", "</think>"]` in the server config.
152152

153153
For vLLM server, if you specify `--reasoning_parser` in `vllm_server_params`, the thinking/reasoning content will be filtered out and does not show up in the output.
154154

155155
### 🎤 ASR
156156

157157
We use [cache-aware streaming FastConformer](https://arxiv.org/abs/2312.17279) to transcribe the user's speech into text. While new models will be released soon, we use the existing English models for now:
158-
- [stt_en_fastconformer_hybrid_large_streaming_80ms](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_hybrid_large_streaming_80ms) (default)
158+
- [nvidia/parakeet_realtime_eou_120m-v1](https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1) (default)
159+
- [stt_en_fastconformer_hybrid_large_streaming_80ms](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_hybrid_large_streaming_80ms)
159160
- [nvidia/stt_en_fastconformer_hybrid_large_streaming_multi](https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_large_streaming_multi)
160161

162+
161163
### 💬 Speaker Diarization
162164

163165
Speaker diarization aims to distinguish different speakers in the input speech audio. We use [streaming Sortformer](http://arxiv.org/abs/2507.18446) to detect the speaker for each user turn.
@@ -173,11 +175,10 @@ Please note that in some circumstances, the diarization model might not work wel
173175
### 🔉 TTS
174176

175177
Here are the supported TTS models:
176-
- [FastPitch-HiFiGAN](https://huggingface.co/nvidia/tts_en_fastpitch) is an NVIDIA-NeMo TTS model. It only supports English output.
177-
- Please use `server/server_configs/tts_configs/nemo_fastpitch-hifigan.yaml` as the server config.
178-
179178
- [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) is a lightweight TTS model. This model is the default speech generation backend.
180179
- Please use `server/server_configs/tts_configs/kokoro_82M.yaml` as the server config.
180+
- [FastPitch-HiFiGAN](https://huggingface.co/nvidia/tts_en_fastpitch) is an NVIDIA-NeMo TTS model. It only supports English output.
181+
- Please use `server/server_configs/tts_configs/nemo_fastpitch-hifigan.yaml` as the server config.
181182

182183
We will support more TTS models in the future.
183184

@@ -186,7 +187,7 @@ We will support more TTS models in the future.
186187

187188
As the new turn-taking prediction model is not yet released, we use the VAD-based turn-taking prediction for now. You can set the `vad.stop_secs` to the desired value in `server/server_configs/default.yaml` to control the amount of silence needed to indicate the end of a user's turn.
188189

189-
Additionally, the voice agent support ignoring back-channel phrases while the bot is talking, which it means phrases such as "uh-huh", "yeah", "okay" will not interrupt the bot while it's talking. To control the backchannel phrases to be used, you can set the `turn_taking.backchannel_phrases` in the server config to the desired list of phrases or a file path to a yaml file containing the list of phrases. By default, it will use the phrases in `server/backchannel_phrases.yaml`. Setting it to `null` will disable detecting backchannel phrases, and that the VAD will interrupt the bot immediately when the user starts speaking.
190+
Additionally, the voice agent supports ignoring back-channel phrases while the bot is talking, which it means phrases such as "uh-huh", "yeah", "okay" will not interrupt the bot while it's talking. To control the backchannel phrases to be used, you can set the `turn_taking.backchannel_phrases` in the server config to the desired list of phrases or a file path to a yaml file containing the list of phrases. By default, it will use the phrases in `server/backchannel_phrases.yaml`. Setting it to `null` will disable detecting backchannel phrases, and that the VAD will interrupt the bot immediately when the user starts speaking.
190191

191192

192193
## 📝 Notes & FAQ

examples/voice_agent/server/server_configs/default.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@ vad:
1515

1616
stt:
1717
type: nemo # choices in ['nemo'] currently only NeMo is supported
18-
# model: "stt_en_fastconformer_hybrid_large_streaming_80ms"
1918
model: "nvidia/parakeet_realtime_eou_120m-v1"
2019
model_config: "./server_configs/stt_configs/nemo_cache_aware_streaming.yaml"
2120
device: "cuda"

nemo/agents/voice_agent/pipecat/services/nemo/streaming_asr.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ class ASRResult:
4242
class NemoStreamingASRService:
4343
def __init__(
4444
self,
45-
model: str = "nvidia/stt_en_fastconformer_hybrid_large_streaming_multi",
45+
model: str = "nvidia/parakeet_realtime_eou_120m-v1",
4646
att_context_size: List[int] = [70, 1],
4747
device: str = "cuda",
4848
eou_string: str = "<EOU>",
@@ -72,8 +72,6 @@ def __init__(
7272
self.blank_id = self.get_blank_id()
7373
self.chunk_size_in_secs = chunk_size_in_secs
7474

75-
print("NemoLegacyASRService initialized")
76-
7775
assert len(self.att_context_size) == 2, "Att context size must be a list of two integers"
7876
assert (
7977
self.att_context_size[0] >= 0
@@ -112,6 +110,7 @@ def __init__(
112110
self._reset_cache()
113111
self._previous_hypotheses = self._get_blank_hypothesis()
114112
self._last_transcript_timestamp = time.time()
113+
print(f"NemoStreamingASRService initialized with model `{model}` on device `{self.device}`")
115114

116115
def _reset_cache(self):
117116
(

nemo/agents/voice_agent/pipecat/services/nemo/streaming_diar.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
class DiarizationConfig:
3131
"""Diarization configuration parameters for inference."""
3232

33-
model_path: str = "nvidia/diar_sortformer_4spk-v1"
33+
model_path: str = "nvidia/diar_streaming_sortformer_4spk-v2"
3434
device: str = "cuda"
3535

3636
log: bool = False # If True, log will be printed
@@ -81,7 +81,7 @@ def __init__(
8181
self.streaming_state = self.init_streaming_state(batch_size=1)
8282
self.total_preds = torch.zeros((1, 0, self.max_num_speakers), device=self.diarizer.device)
8383

84-
print("NeMoLegacyDiarService initialized")
84+
print(f"NeMoStreamingDiarService initialized with model `{model}` on device `{self.device}`")
8585

8686
def build_diarizer(self):
8787
if self.cfg.model_path.endswith(".nemo"):

0 commit comments

Comments
 (0)