Skip to content

Conversation

@arushidNV
Copy link
Collaborator

@arushidNV arushidNV commented Dec 15, 2025

What does this PR do?

Adds FeatureBuffer support to the Cache-Aware RNNT streaming pipeline, enabling both raw audio (Frame-based) and pre-computed feature (FeatureBuffer-based) streaming inference.

Collection: ASR

Changelog

  • Implemented transcribe_step_for_feature_buffers() method in CacheAwareRNNTPipeline and CacheAwareCTCPipelineto handle pre-computed features
  • Updated type hints in run_greedy_decoder() and cache_aware_transcribe_step() to support both Frame and FeatureBuffer types

Usage

This enables flexible deployment scenarios where features can be pre-computed on different hardware (e.g., TensorRT encoder) before streaming to the RNNT decoder:

from nemo.collections.asr.inference.factory import CacheAwarePipelineBuilder

# Initialize pipeline with FeatureBuffer support
cfg.streaming.request_type = "feature_buffer"  # or "frame" for raw audio
pipeline = CacheAwarePipelineBuilder.build(cfg)

# Use with pre-computed features
feature_buffers = [FeatureBuffer(
    features=computed_features,  # [n_feat, T]
    stream_id=stream_id,
    is_last=is_final_chunk
)]

# Process features
pipeline.transcribe_step(feature_buffers)
results = pipeline.get_results()

WER Comparison

The WER comparison results with Frame and Feature Buffer for open-asr-leaderboard datasets:
Model: https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b
Attention Context Size: [70,3]
EOU Threshold: 800
Residue Tokens: 0

Dataset Frame WER Feature Buffer WER
ami-test 31.2132 31.221
common_voice-test 15.9703 15.9742
earnings22-test 21.0542 21.1941
gigaspeech-test 19.6682 19.6876
librispeech-test.clean 5.4333 5.4596
librispeech-test.other 9.322 9.3484
spgispeech-test 7.8701 7.8541
tedlium-test 13.0503 12.998
voxpopuli-test 10.5082 10.5126
Average 14.8989 14.9166

GitHub Actions CI

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
ASR contributors and maintainers can review this PR.

Additional Information

@github-actions github-actions bot added the ASR label Dec 15, 2025
@arushidNV arushidNV changed the title Add FeatureBuffer support to Cache-Aware RNNT streaming pipeline Add FeatureBuffer support to Cache-Aware streaming pipeline Dec 15, 2025
@nithinraok nithinraok requested a review from naymaraq December 15, 2025 17:52
Args:
state: (CacheAwareCTCStreamingState) The state of the stream
frame: (Frame) The current frame
frame: (Frame | FeatureBuffer) The current frame or feature buffer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Request type instead of Frame | FeatureBuffer

Decode the log probabilities and update the state
Args:
frames: (list[Frame]) List of frames to transcribe.
frames: (list[Frame | FeatureBuffer]) List of frames or feature buffers to transcribe.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Request type instead of Frame | FeatureBuffer

Args:
frames: (list[Frame]) List of frames to transcribe.
frames: (list[Frame | FeatureBuffer]) List of frames or feature buffers to transcribe.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Request type instead of Frame | FeatureBuffer

Args:
state: (CacheAwareRNNTStreamingState) The state of the stream
frame: (Frame) The current frame
frame: (Frame | FeatureBuffer) The current frame or feature buffer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Request type instead of Frame | FeatureBuffer

"""
Cache Aware Transcribe Step
It receives a list of frames and features and do the following:
It receives a list of frames (Frame or FeatureBuffer) and features and do the following:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Request type instead of Frame | FeatureBuffer

8. Update the ready states to indicate that the state is ready for text post-processing
Args:
frames: (list[Frame]) List of frames to transcribe.
frames: (list[Frame | FeatureBuffer]) List of frames or feature buffers to transcribe.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Request type instead of Frame | FeatureBuffer

# update the previous hypothesis and reset the previous hypothesis for the streams that has ended
for state, hyp, eos in zip(states, best_hyp, eos_flags):
for i, (state, hyp, eos) in enumerate(zip(states, best_hyp, eos_flags)):
hyp_len = len(hyp.y_sequence) if hyp is not None and hasattr(hyp, 'y_sequence') else 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems hyp_len is not used, if so, no need to enumerate

@arushidNV arushidNV force-pushed the cache-aware-feature-support branch from aa477a3 to 25530b6 Compare December 18, 2025 04:23
batch_size: 64 # Number of audio frames per batch
word_boundary_tolerance: 4 # Tolerance for word boundaries
att_context_size: [70,13] # Attention context size: [70,13],[70,6],[70,1],[70,0]
att_context_size: [70,3] # Attention context size: [70,13],[70,6],[70,1],[70,0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep [70, 13] for default

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to change the default model to nvidia/nemotron-speech-streaming-en-0.6b.

for fbuffer in fbuffers:
feature = fbuffer.features
# Trim to expected feature buffer length (safeguard for external feature buffer inputs)
feature = drop_trailing_features(feature.unsqueeze(0), self.expected_feature_buffer_len).squeeze(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion: I would suggest to drop trailing features in the preprocess method if possible

Signed-off-by: arushid <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants