You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working on an Arabic ASR system for Quran recitation (tajweed, formal MSA Arabic).
The model works correctly when used directly in NeMo, but fails after conversion and deployment in NVIDIA Riva, where it produces empty outputs or only very short tokens.
I am trying to determine:
1. Which ASR model architecture is correct for Quran recitation and streaming
2. Whether my chosen model is supported for Riva streaming
3. What exact configuration (model type + riva-build flags) is required
4. Why NeMo inference works but Riva inference does not
⸻
My Goal
• Language: Arabic (ar-AR)
• Domain: Quran recitation
• Use case: Streaming ASR (low latency)
• Output: Full verse-level transcription, not partial tokens
• Deployment target: NVIDIA Riva
⸻
What Works
• Tokenizer is correct and verified:
• SentencePiece tokenizer
• Arabic text round-trip works (text → ids → text)
• Model inference in NeMo works correctly:
• Full Arabic sentences are decoded
• WER output in NeMo training logs is correct
• Model fine-tuning completed successfully
• .nemo model loads without error
⸻
What Fails
After converting the same model to Riva and deploying:
• Streaming and offline Riva pipelines return:
• Empty transcripts
• Or a single repeated token (e.g. “وَ”)
• No runtime crash
• Riva server starts successfully
• Models appear loaded, but inference output is unusable
Known Related Threads (Symptoms Match Exactly)
• “Finetuned ASR conformer returns only empty transcripts”
• “Issue Deploying Fine-Tuned Arabic Conformer Model in NVIDIA Riva”
• “Riva providing empty transcriptions but NeMo does not”
• “Known issue with conformer models – try –nn.use_trt_fp32”
• FastConformer RNNT models reported as not officially supported for Riva streaming
⸻
Key Observations
1. NeMo works, Riva does not
2. Empty or near-empty output is a known Riva failure mode
3. Multiple threads suggest:
• Conformer / FastConformer RNNT streaming is fragile or unsupported
• TRT FP16 causes silent decoding failures
4. Canary models are offline-only
5. Parakeet models are designed for streaming but have limited Arabic coverage
⸻
Questions (Core of This Issue)
1. Which ASR model architecture is officially supported for Arabic streaming ASR in Riva?
• Conformer-CTC?
• Citrinet?
• Parakeet?
• Something else?
2. Is EncDecHybridRNNTCTCBPEModel supported for streaming in Riva?
• If not, what is the recommended alternative?
3. Is Quran recitation a valid use case for Riva streaming ASR, or is offline decoding required?
4. Which riva-build flags are mandatory to avoid empty outputs?
• --nn.use_trt_fp32
• disabling VAD?
• different chunk/padding constraints?
5. Is there an official reference pipeline for Arabic ASR deployment in Riva?
What I Am Looking For
• A clear recommendation:
• Correct model
• Correct decoding mode (streaming vs offline)
• Correct Riva configuration
• Confirmation whether my current approach is fundamentally incompatible
• A known-good Arabic Riva ASR deployment example
⸻
Thank you for your time.
I am happy to provide logs, configs, or a minimal reproduction if needed.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
I am working on an Arabic ASR system for Quran recitation (tajweed, formal MSA Arabic).
The model works correctly when used directly in NeMo, but fails after conversion and deployment in NVIDIA Riva, where it produces empty outputs or only very short tokens.
I am trying to determine:
1. Which ASR model architecture is correct for Quran recitation and streaming
2. Whether my chosen model is supported for Riva streaming
3. What exact configuration (model type + riva-build flags) is required
4. Why NeMo inference works but Riva inference does not
⸻
My Goal
• Language: Arabic (ar-AR)
• Domain: Quran recitation
• Use case: Streaming ASR (low latency)
• Output: Full verse-level transcription, not partial tokens
• Deployment target: NVIDIA Riva
⸻
What Works
• Tokenizer is correct and verified:
• SentencePiece tokenizer
• Arabic text round-trip works (text → ids → text)
• Model inference in NeMo works correctly:
• Full Arabic sentences are decoded
• WER output in NeMo training logs is correct
• Model fine-tuning completed successfully
• .nemo model loads without error
⸻
What Fails
After converting the same model to Riva and deploying:
• Streaming and offline Riva pipelines return:
• Empty transcripts
• Or a single repeated token (e.g. “وَ”)
• No runtime crash
• Riva server starts successfully
• Models appear loaded, but inference output is unusable
⸻
Model Details
• Model type: EncDecHybridRNNTCTCBPEModel
• Encoder: Conformer / FastConformer
• Decoder: RNNT + CTC
• Tokenizer: SentencePiece (1024 vocab)
• Language: Arabic (ar-AR)
⸻
Conversion & Deployment Steps Used
NeMo → Riva
nemo2riva
--out Speech_To_Text_Finetuning.riva
--max-dim 5000
--max-batch 4
--device cuda
Speech_To_Text_Finetuning.nemo
Riva Build (Streaming)
riva-build speech_recognition
asr_streaming.rmir
Speech_To_Text_Finetuning.riva
--streaming=true
--decoder_type=greedy
--ms_per_timestep=40
--chunk_size=4.8
--left_padding_size=1.6
--right_padding_size=1.6
--max_batch_size=4
--featurizer.use_utterance_norm_params=False
--featurizer.precalc_norm_time_steps=0
--featurizer.precalc_norm_params=False
--language_code=ar-AR
Same issue occurs with offline pipeline.
⸻
Known Related Threads (Symptoms Match Exactly)
• “Finetuned ASR conformer returns only empty transcripts”
• “Issue Deploying Fine-Tuned Arabic Conformer Model in NVIDIA Riva”
• “Riva providing empty transcriptions but NeMo does not”
• “Known issue with conformer models – try –nn.use_trt_fp32”
• FastConformer RNNT models reported as not officially supported for Riva streaming
⸻
Key Observations
1. NeMo works, Riva does not
2. Empty or near-empty output is a known Riva failure mode
3. Multiple threads suggest:
• Conformer / FastConformer RNNT streaming is fragile or unsupported
• TRT FP16 causes silent decoding failures
4. Canary models are offline-only
5. Parakeet models are designed for streaming but have limited Arabic coverage
⸻
Questions (Core of This Issue)
1. Which ASR model architecture is officially supported for Arabic streaming ASR in Riva?
• Conformer-CTC?
• Citrinet?
• Parakeet?
• Something else?
2. Is EncDecHybridRNNTCTCBPEModel supported for streaming in Riva?
• If not, what is the recommended alternative?
3. Is Quran recitation a valid use case for Riva streaming ASR, or is offline decoding required?
4. Which riva-build flags are mandatory to avoid empty outputs?
• --nn.use_trt_fp32
• disabling VAD?
• different chunk/padding constraints?
5. Is there an official reference pipeline for Arabic ASR deployment in Riva?
⸻
Environment
• OS: Ubuntu 22.04
• GPU: RTX 3060 (6GB)
• CUDA: 12.x
• NeMo: recent version
• Riva: 2.x
• Audio: 16kHz mono WAV
• Language code: ar-AR
⸻
What I Am Looking For
• A clear recommendation:
• Correct model
• Correct decoding mode (streaming vs offline)
• Correct Riva configuration
• Confirmation whether my current approach is fundamentally incompatible
• A known-good Arabic Riva ASR deployment example
⸻
Thank you for your time.
I am happy to provide logs, configs, or a minimal reproduction if needed.
Beta Was this translation helpful? Give feedback.
All reactions