[voice agent] make parakeet-eou model default stt (#15069)

stevehuang52 · web-flow · commit 559cc979d09f · 2025-11-14T00:02:46.000Z
* make eou model default stt

Signed-off-by: stevehuang52 &lt;heh@nvidia.com&gt;

* fix typo

Signed-off-by: stevehuang52 &lt;heh@nvidia.com&gt;

* clean up doc

Signed-off-by: stevehuang52 &lt;heh@nvidia.com&gt;

---------

Signed-off-by: stevehuang52 &lt;heh@nvidia.com&gt;
Signed-off-by: He Huang (Steve) &lt;105218074+stevehuang52@users.noreply.github.com&gt;
diff --git a/examples/voice_agent/README.md b/examples/voice_agent/README.md
@@ -9,20 +9,20 @@ As of now, we only support English input and output, but more languages will be
 
 - Open-source, local deployment, and flexible customization.
 - Allow users to talk to most LLMs from HuggingFace with configurable prompts. 
-- Streaming speech recognition with low latency.
+- Streaming speech recognition with low latency and end-of-utterance detection.
 - Low latency TTS for fast audio response generation.
 - Speaker diarization up to 4 speakers in different user turns.
 - WebSocket server for easy deployment.
 
 
 ## 💡 Upcoming Next
-- Joint ASR and EOU detection in the same model.
 - Accuracy and robustness ASR model improvements.
 - Better TTS with more natural voice (e.g., [Magpie-TTS](https://build.nvidia.com/nvidia/magpie-tts-multilingual)).
 - Combine ASR and speaker diarization model to handle overlapping speech.
 
 
 ## Latest Updates
+- 2025-11-14: Added support for joint ASR and EOU detection with [Parakeet-realtime-eou-120m](https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1) model.
 - 2025-10-10: Added support for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) TTS model.
 - 2025-10-03: Add support for serving LLM with vLLM and auto-switch between vLLM and HuggingFace, add [nvidia/NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) as default LLM.
 - 2025-09-05: First release of NeMo Voice Agent.
@@ -148,16 +148,18 @@ A lot of LLMs support thinking/reasoning mode, which is useful for complex tasks
 
 Different models may have different ways to support thinking/reasoning mode, please refer to the model's homepage for details on their thinking/reasoning mode support. Meanwhile, in many cases, they support enabling thinking/reasoning can be achieved by adding `/think` or `/no_think` to the end of the system prompt, and the thinking/reasoning content is wrapped by the tokens `["<think>", "</think>"]`. Some models may also support enabling thinking/reasoning by setting `llm.apply_chat_template_kwargs.enable_thinking=true/false` in the server config when `llm.type=hf`.
 
-If thinking/reasoning mode is enabled (e.g., in `server/server_configs/qwen3-8B_think.yaml`), the voice agnet server will print out the thinking/reasoning content so that you can see the process of the LLM thinking and still have a smooth conversation experience. The thinking/reasoning content will not go through the TTS process, so you will only hear the final answer, and this is achieved by specifying the pair of thinking tokens `tts.think_tokens=["<think>", "</think>"]` in the server config.
+If thinking/reasoning mode is enabled (e.g., in `server/server_configs/qwen3-8B_think.yaml`), the voice agent server will print out the thinking/reasoning content so that you can see the process of the LLM thinking and still have a smooth conversation experience. The thinking/reasoning content will not go through the TTS process, so you will only hear the final answer, and this is achieved by specifying the pair of thinking tokens `tts.think_tokens=["<think>", "</think>"]` in the server config.
 
 For vLLM server, if you specify `--reasoning_parser` in `vllm_server_params`, the thinking/reasoning content will be filtered out and does not show up in the output.
 
 ### 🎤 ASR 
 
 We use [cache-aware streaming FastConformer](https://arxiv.org/abs/2312.17279) to transcribe the user's speech into text. While new models will be released soon, we use the existing English models for now:
-- [stt_en_fastconformer_hybrid_large_streaming_80ms](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_hybrid_large_streaming_80ms)  (default)
+- [nvidia/parakeet_realtime_eou_120m-v1](https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1) (default)
+- [stt_en_fastconformer_hybrid_large_streaming_80ms](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_hybrid_large_streaming_80ms)
 - [nvidia/stt_en_fastconformer_hybrid_large_streaming_multi](https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_large_streaming_multi)
 
+
 ### 💬 Speaker Diarization
 
 Speaker diarization aims to distinguish different speakers in the input speech audio. We use [streaming Sortformer](http://arxiv.org/abs/2507.18446) to detect the speaker for each user turn. 
@@ -173,11 +175,10 @@ Please note that in some circumstances, the diarization model might not work wel
 ### 🔉 TTS
 
 Here are the supported TTS models:
-- [FastPitch-HiFiGAN](https://huggingface.co/nvidia/tts_en_fastpitch) is an NVIDIA-NeMo TTS model. It only supports English output. 
-    - Please use `server/server_configs/tts_configs/nemo_fastpitch-hifigan.yaml` as the server config.
-
 - [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) is a lightweight TTS model. This model is the default speech generation backend.
     - Please use `server/server_configs/tts_configs/kokoro_82M.yaml` as the server config.
+- [FastPitch-HiFiGAN](https://huggingface.co/nvidia/tts_en_fastpitch) is an NVIDIA-NeMo TTS model. It only supports English output. 
+    - Please use `server/server_configs/tts_configs/nemo_fastpitch-hifigan.yaml` as the server config.
 
 We will support more TTS models in the future.
 
@@ -186,7 +187,7 @@ We will support more TTS models in the future.
 
 As the new turn-taking prediction model is not yet released, we use the VAD-based turn-taking prediction for now. You can set the `vad.stop_secs` to the desired value in `server/server_configs/default.yaml` to control the amount of silence needed to indicate the end of a user's turn.
 
-Additionally, the voice agent support ignoring back-channel phrases while the bot is talking, which it means phrases such as "uh-huh", "yeah", "okay"  will not interrupt the bot while it's talking. To control the backchannel phrases to be used, you can set the `turn_taking.backchannel_phrases` in the server config to the desired list of phrases or a file path to a yaml file containing the list of phrases. By default, it will use the phrases in `server/backchannel_phrases.yaml`. Setting it to `null` will disable detecting backchannel phrases, and that the VAD will interrupt the bot immediately when the user starts speaking.
+Additionally, the voice agent supports ignoring back-channel phrases while the bot is talking, which it means phrases such as "uh-huh", "yeah", "okay"  will not interrupt the bot while it's talking. To control the backchannel phrases to be used, you can set the `turn_taking.backchannel_phrases` in the server config to the desired list of phrases or a file path to a yaml file containing the list of phrases. By default, it will use the phrases in `server/backchannel_phrases.yaml`. Setting it to `null` will disable detecting backchannel phrases, and that the VAD will interrupt the bot immediately when the user starts speaking.
 
 
 ## 📝 Notes & FAQ
diff --git a/examples/voice_agent/server/server_configs/default.yaml b/examples/voice_agent/server/server_configs/default.yaml
@@ -15,7 +15,6 @@ vad:
 
 stt:
   type: nemo # choices in ['nemo'] currently only NeMo is supported
-  # model: "stt_en_fastconformer_hybrid_large_streaming_80ms"
   model: "nvidia/parakeet_realtime_eou_120m-v1"
   model_config: "./server_configs/stt_configs/nemo_cache_aware_streaming.yaml"
   device: "cuda"
diff --git a/nemo/agents/voice_agent/pipecat/services/nemo/streaming_asr.py b/nemo/agents/voice_agent/pipecat/services/nemo/streaming_asr.py
@@ -42,7 +42,7 @@ class ASRResult:
 class NemoStreamingASRService:
     def __init__(
         self,
-        model: str = "nvidia/stt_en_fastconformer_hybrid_large_streaming_multi",
+        model: str = "nvidia/parakeet_realtime_eou_120m-v1",
         att_context_size: List[int] = [70, 1],
         device: str = "cuda",
         eou_string: str = "<EOU>",
@@ -72,8 +72,6 @@ def __init__(
         self.blank_id = self.get_blank_id()
         self.chunk_size_in_secs = chunk_size_in_secs
 
-        print("NemoLegacyASRService initialized")
-
         assert len(self.att_context_size) == 2, "Att context size must be a list of two integers"
         assert (
             self.att_context_size[0] >= 0
@@ -112,6 +110,7 @@ def __init__(
         self._reset_cache()
         self._previous_hypotheses = self._get_blank_hypothesis()
         self._last_transcript_timestamp = time.time()
+        print(f"NemoStreamingASRService initialized with model `{model}` on device `{self.device}`")
 
     def _reset_cache(self):
         (
diff --git a/nemo/agents/voice_agent/pipecat/services/nemo/streaming_diar.py b/nemo/agents/voice_agent/pipecat/services/nemo/streaming_diar.py
@@ -30,7 +30,7 @@
 class DiarizationConfig:
     """Diarization configuration parameters for inference."""
 
-    model_path: str = "nvidia/diar_sortformer_4spk-v1"
+    model_path: str = "nvidia/diar_streaming_sortformer_4spk-v2"
     device: str = "cuda"
 
     log: bool = False  # If True, log will be printed
@@ -81,7 +81,7 @@ def __init__(
         self.streaming_state = self.init_streaming_state(batch_size=1)
         self.total_preds = torch.zeros((1, 0, self.max_num_speakers), device=self.diarizer.device)
 
-        print("NeMoLegacyDiarService initialized")
+        print(f"NeMoStreamingDiarService initialized with model `{model}` on device `{self.device}`")
 
     def build_diarizer(self):
         if self.cfg.model_path.endswith(".nemo"):