You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m working on a real-time / livestream-capable ASR system using NVIDIA NeMo, and I would like both:
help debugging a concrete training crash, and
a sanity check whether our overall configuration and training approach is aligned with NeMo best practices.
This is a research / prototype setup (no enterprise support, no NGC enterprise subscription).
────────────────────────
Goal / Use Case
────────────────────────
Our long-term goal is a low-latency streaming ASR system with fine-grained feedback:
Streaming-capable ASR (low latency)
RNNT + CTC hybrid model
Offline evaluation (WER, alignment) — NOT during training_step
Forced alignment (CTM) for fine-grained error detection
Future live feedback during recitation / speech (phoneme / tajweed-style errors)
The current focus is only on stable training.
────────────────────────
Model & Software Stack
────────────────────────
NeMo version: 2.x
PyTorch Lightning: 2.x
PyTorch: 2.x
Model: EncDecHybridRNNTCTCBPEModel
Precision: AMP (16-mixed)
Strategy: DDP (issue reproduces with 1 GPU and 2 GPUs)
Decoding / WER:
No decoding during training_step
No WER metrics in training_step
Offline evaluation only (post-checkpoint)
────────────────────────
Data
────────────────────────
Custom ASR dataset
~100+ hours of audio
SentencePiece BPE tokenizer
Classical / Quranic Arabic-style speech
(however, the issue appears framework-level, not language-specific)
────────────────────────
What We Are Trying to Do (Training Logic)
────────────────────────
We are overriding the training_step in order to:
Call the model forward pass explicitly
Compute RNNT loss + CTC loss
Combine them into a total loss
Avoid any decoding or metric computation during training
We intentionally do NOT:
Compute WER during training
Call greedy / beam decoding
Use TorchMetrics all_gather during training
────────────────────────
Observed Problem
────────────────────────
Training crashes immediately in training_step with an input signature mismatch.
Error (from both rank 0 and rank 1):
RuntimeError:
Model forward failed with all known kwarg sets.
Last error:
Input argument audio_signal has no corresponding input_type match.
Existing input_types = dict_keys([
‘input_signal’,
‘input_signal_length’,
‘processed_signal’,
‘processed_signal_length’
])
We verified the model input signature via:
print(model.input_types)
which returns:
input_signal
input_signal_length
processed_signal
processed_signal_length
However, internally (or via our override), the forward path appears to still receive audio_signal, leading to the mismatch.
────────────────────────
Why We Are Unsure
────────────────────────
We are unsure about the correct and recommended way to:
Override training_step safely for EncDecHybridRNNTCTCBPEModel
Call the model forward pass without triggering decoding or metrics
Properly pass raw audio vs preprocessed signals in NeMo 2.x
Align custom training loops with NeMo’s internal expectations
We would also appreciate confirmation on whether:
Using EncDecHybridRNNTCTCBPEModel for streaming + offline alignment is appropriate
Avoiding WER/decoding during training is the correct approach
Our overall setup is aligned with how NeMo is intended to be used
────────────────────────
What We Are Asking
────────────────────────
What is the correct forward-call pattern for EncDecHybridRNNTCTCBPEModel in NeMo 2.x when overriding training_step?
Should raw audio always be passed as input_signal, and is audio_signal ever valid in this context?
Is overriding training_step recommended at all for this model, or should we rely strictly on the built-in loss computation?
Are we on the right architectural path for a streaming RNNT+CTC system with offline alignment-based evaluation?
Any guidance, pointers to examples, or confirmation would be greatly appreciated.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi NVIDIA NeMo team,
I’m working on a real-time / livestream-capable ASR system using NVIDIA NeMo, and I would like both:
help debugging a concrete training crash, and
a sanity check whether our overall configuration and training approach is aligned with NeMo best practices.
This is a research / prototype setup (no enterprise support, no NGC enterprise subscription).
────────────────────────
Goal / Use Case
────────────────────────
Our long-term goal is a low-latency streaming ASR system with fine-grained feedback:
Streaming-capable ASR (low latency)
RNNT + CTC hybrid model
Offline evaluation (WER, alignment) — NOT during training_step
Forced alignment (CTM) for fine-grained error detection
Future live feedback during recitation / speech (phoneme / tajweed-style errors)
The current focus is only on stable training.
────────────────────────
Model & Software Stack
────────────────────────
NeMo version: 2.x
PyTorch Lightning: 2.x
PyTorch: 2.x
Model: EncDecHybridRNNTCTCBPEModel
Precision: AMP (16-mixed)
Strategy: DDP (issue reproduces with 1 GPU and 2 GPUs)
Decoding / WER:
No decoding during training_step
No WER metrics in training_step
Offline evaluation only (post-checkpoint)
────────────────────────
Data
────────────────────────
Custom ASR dataset
~100+ hours of audio
SentencePiece BPE tokenizer
Classical / Quranic Arabic-style speech
(however, the issue appears framework-level, not language-specific)
────────────────────────
What We Are Trying to Do (Training Logic)
────────────────────────
We are overriding the training_step in order to:
Call the model forward pass explicitly
Compute RNNT loss + CTC loss
Combine them into a total loss
Avoid any decoding or metric computation during training
We intentionally do NOT:
Compute WER during training
Call greedy / beam decoding
Use TorchMetrics all_gather during training
────────────────────────
Observed Problem
────────────────────────
Training crashes immediately in training_step with an input signature mismatch.
Error (from both rank 0 and rank 1):
RuntimeError:
Model forward failed with all known kwarg sets.
Last error:
Input argument audio_signal has no corresponding input_type match.
Existing input_types = dict_keys([
‘input_signal’,
‘input_signal_length’,
‘processed_signal’,
‘processed_signal_length’
])
We verified the model input signature via:
print(model.input_types)
which returns:
input_signal
input_signal_length
processed_signal
processed_signal_length
However, internally (or via our override), the forward path appears to still receive audio_signal, leading to the mismatch.
────────────────────────
Why We Are Unsure
────────────────────────
We are unsure about the correct and recommended way to:
Override training_step safely for EncDecHybridRNNTCTCBPEModel
Call the model forward pass without triggering decoding or metrics
Properly pass raw audio vs preprocessed signals in NeMo 2.x
Align custom training loops with NeMo’s internal expectations
We would also appreciate confirmation on whether:
Using EncDecHybridRNNTCTCBPEModel for streaming + offline alignment is appropriate
Avoiding WER/decoding during training is the correct approach
Our overall setup is aligned with how NeMo is intended to be used
────────────────────────
What We Are Asking
────────────────────────
What is the correct forward-call pattern for EncDecHybridRNNTCTCBPEModel in NeMo 2.x when overriding training_step?
Should raw audio always be passed as input_signal, and is audio_signal ever valid in this context?
Is overriding training_step recommended at all for this model, or should we rely strictly on the built-in loss computation?
Are we on the right architectural path for a streaming RNNT+CTC system with offline alignment-based evaluation?
Any guidance, pointers to examples, or confirmation would be greatly appreciated.
Thank you for your time and for maintaining NeMo.
Beta Was this translation helpful? Give feedback.
All reactions