-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Add FeatureBuffer support to Cache-Aware streaming pipeline #15188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add FeatureBuffer support to Cache-Aware streaming pipeline #15188
Conversation
| Args: | ||
| state: (CacheAwareCTCStreamingState) The state of the stream | ||
| frame: (Frame) The current frame | ||
| frame: (Frame | FeatureBuffer) The current frame or feature buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use Request type instead of Frame | FeatureBuffer
| Decode the log probabilities and update the state | ||
| Args: | ||
| frames: (list[Frame]) List of frames to transcribe. | ||
| frames: (list[Frame | FeatureBuffer]) List of frames or feature buffers to transcribe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use Request type instead of Frame | FeatureBuffer
| Args: | ||
| frames: (list[Frame]) List of frames to transcribe. | ||
| frames: (list[Frame | FeatureBuffer]) List of frames or feature buffers to transcribe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use Request type instead of Frame | FeatureBuffer
| Args: | ||
| state: (CacheAwareRNNTStreamingState) The state of the stream | ||
| frame: (Frame) The current frame | ||
| frame: (Frame | FeatureBuffer) The current frame or feature buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use Request type instead of Frame | FeatureBuffer
| """ | ||
| Cache Aware Transcribe Step | ||
| It receives a list of frames and features and do the following: | ||
| It receives a list of frames (Frame or FeatureBuffer) and features and do the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use Request type instead of Frame | FeatureBuffer
| 8. Update the ready states to indicate that the state is ready for text post-processing | ||
| Args: | ||
| frames: (list[Frame]) List of frames to transcribe. | ||
| frames: (list[Frame | FeatureBuffer]) List of frames or feature buffers to transcribe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use Request type instead of Frame | FeatureBuffer
| # update the previous hypothesis and reset the previous hypothesis for the streams that has ended | ||
| for state, hyp, eos in zip(states, best_hyp, eos_flags): | ||
| for i, (state, hyp, eos) in enumerate(zip(states, best_hyp, eos_flags)): | ||
| hyp_len = len(hyp.y_sequence) if hyp is not None and hasattr(hyp, 'y_sequence') else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems hyp_len is not used, if so, no need to enumerate
aa477a3 to
25530b6
Compare
b03778d to
c7b373e
Compare
c7b373e to
aaa5091
Compare
Signed-off-by: arushid <[email protected]>
Signed-off-by: arushidNV <[email protected]>
Signed-off-by: arushid <[email protected]>
Signed-off-by: arushidNV <[email protected]>
Signed-off-by: arushid <[email protected]>
Signed-off-by: arushidNV <[email protected]>
Signed-off-by: arushid <[email protected]>
Signed-off-by: arushid <[email protected]>
39bd118 to
2d7c394
Compare
Signed-off-by: arushid <[email protected]>
| batch_size: 64 # Number of audio frames per batch | ||
| word_boundary_tolerance: 4 # Tolerance for word boundaries | ||
| att_context_size: [70,13] # Attention context size: [70,13],[70,6],[70,1],[70,0] | ||
| att_context_size: [70,3] # Attention context size: [70,13],[70,6],[70,1],[70,0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep [70, 13] for default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to change the default model to nvidia/nemotron-speech-streaming-en-0.6b.
| for fbuffer in fbuffers: | ||
| feature = fbuffer.features | ||
| # Trim to expected feature buffer length (safeguard for external feature buffer inputs) | ||
| feature = drop_trailing_features(feature.unsqueeze(0), self.expected_feature_buffer_len).squeeze(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion: I would suggest to drop trailing features in the preprocess method if possible
Signed-off-by: arushid <[email protected]>
What does this PR do?
Adds FeatureBuffer support to the Cache-Aware RNNT streaming pipeline, enabling both raw audio (Frame-based) and pre-computed feature (FeatureBuffer-based) streaming inference.
Collection: ASR
Changelog
transcribe_step_for_feature_buffers()method inCacheAwareRNNTPipelineandCacheAwareCTCPipelineto handle pre-computed featuresrun_greedy_decoder()andcache_aware_transcribe_step()to support bothFrameandFeatureBuffertypesUsage
This enables flexible deployment scenarios where features can be pre-computed on different hardware (e.g., TensorRT encoder) before streaming to the RNNT decoder:
WER Comparison
The WER comparison results with Frame and Feature Buffer for open-asr-leaderboard datasets:
Model: https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b
Attention Context Size: [70,3]
EOU Threshold: 800
Residue Tokens: 0
GitHub Actions CI
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
ASR contributors and maintainers can review this PR.
Additional Information