NeMo ASR training sanity check: streaming RNNT+CTC setup, custom training_step, and input signature mismatch #15202

NovaConAI · 2025-12-17T10:00:25Z

NovaConAI
Dec 17, 2025

Hi NVIDIA NeMo team,

I’m working on a real-time / livestream-capable ASR system using NVIDIA NeMo, and I would like both:

help debugging a concrete training crash, and
a sanity check whether our overall configuration and training approach is aligned with NeMo best practices.
This is a research / prototype setup (no enterprise support, no NGC enterprise subscription).

────────────────────────
Goal / Use Case
────────────────────────

Our long-term goal is a low-latency streaming ASR system with fine-grained feedback:

Streaming-capable ASR (low latency)
RNNT + CTC hybrid model
Offline evaluation (WER, alignment) — NOT during training_step
Forced alignment (CTM) for fine-grained error detection
Future live feedback during recitation / speech (phoneme / tajweed-style errors)
The current focus is only on stable training.

────────────────────────
Model & Software Stack
────────────────────────

NeMo version: 2.x
PyTorch Lightning: 2.x
PyTorch: 2.x
Model: EncDecHybridRNNTCTCBPEModel
Precision: AMP (16-mixed)
Strategy: DDP (issue reproduces with 1 GPU and 2 GPUs)
Decoding / WER:
No decoding during training_step
No WER metrics in training_step
Offline evaluation only (post-checkpoint)
────────────────────────
Data
────────────────────────

Custom ASR dataset
~100+ hours of audio
SentencePiece BPE tokenizer
Classical / Quranic Arabic-style speech
(however, the issue appears framework-level, not language-specific)
────────────────────────
What We Are Trying to Do (Training Logic)
────────────────────────

We are overriding the training_step in order to:

Call the model forward pass explicitly
Compute RNNT loss + CTC loss
Combine them into a total loss
Avoid any decoding or metric computation during training
We intentionally do NOT:

Compute WER during training
Call greedy / beam decoding
Use TorchMetrics all_gather during training
────────────────────────
Observed Problem
────────────────────────

Training crashes immediately in training_step with an input signature mismatch.

Error (from both rank 0 and rank 1):

RuntimeError:
Model forward failed with all known kwarg sets.
Last error:
Input argument audio_signal has no corresponding input_type match.
Existing input_types = dict_keys([
‘input_signal’,
‘input_signal_length’,
‘processed_signal’,
‘processed_signal_length’
])

We verified the model input signature via:

print(model.input_types)
which returns:

input_signal

input_signal_length

processed_signal

processed_signal_length

However, internally (or via our override), the forward path appears to still receive audio_signal, leading to the mismatch.

────────────────────────

Why We Are Unsure

────────────────────────

We are unsure about the correct and recommended way to:

Override training_step safely for EncDecHybridRNNTCTCBPEModel

Call the model forward pass without triggering decoding or metrics

Properly pass raw audio vs preprocessed signals in NeMo 2.x

Align custom training loops with NeMo’s internal expectations

We would also appreciate confirmation on whether:

Using EncDecHybridRNNTCTCBPEModel for streaming + offline alignment is appropriate

Avoiding WER/decoding during training is the correct approach

Our overall setup is aligned with how NeMo is intended to be used

────────────────────────

What We Are Asking

────────────────────────

What is the correct forward-call pattern for EncDecHybridRNNTCTCBPEModel in NeMo 2.x when overriding training_step?

Should raw audio always be passed as input_signal, and is audio_signal ever valid in this context?

Is overriding training_step recommended at all for this model, or should we rely strictly on the built-in loss computation?

Are we on the right architectural path for a streaming RNNT+CTC system with offline alignment-based evaluation?

Any guidance, pointers to examples, or confirmation would be greatly appreciated.

Thank you for your time and for maintaining NeMo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NeMo ASR training sanity check: streaming RNNT+CTC setup, custom training_step, and input signature mismatch #15202

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

NeMo ASR training sanity check: streaming RNNT+CTC setup, custom training_step, and input signature mismatch #15202

Uh oh!

NovaConAI Dec 17, 2025

Replies: 0 comments

NovaConAI
Dec 17, 2025