Implement Duplex Speech-to-text model and rebase #15092

kevinhu-nv · 2025-11-19T23:31:39Z

What does this PR do ?

Merge duplex STT changes to NeMo main.

Collection: speechlm2

Changelog

Added training support for using nano-9b as LLM backbone
Added prompt tokens support
Added streaming ASR support
Added Refactoring, unit tests, and other minor changes

Pre checks:

Make sure you read and followed Contributor guidelines
[X ] Did you write any new necessary tests?
[ X ] Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Ankita Pasad <[email protected]>

… possibly inflated results Signed-off-by: Ankita Pasad <[email protected]>

Signed-off-by: kevinhu <[email protected]>

Signed-off-by: Ankita Pasad <[email protected]>

…-sysprompt support training and inference for data with system prompt

Signed-off-by: kevinhu <[email protected]>

…e unnecessary code Signed-off-by: kevinhu <[email protected]>

Signed-off-by: kevinhu <[email protected]>

github-advanced-security

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

zhehuaichen

Thanks for getting this started!

docs/source/speechlm2/configs.rst

nemo/collections/common/data/lhotse/cutset.py

zhehuaichen · 2025-11-26T19:57:02Z

nemo/collections/common/data/lhotse/cutset.py

+
+
+@data_type_parser(["s2s_duplex_overlap_as_s2s_duplex"])
+def read_s2s_duplex_overlap_as_s2s_duplex(config) -> tuple[CutSet, bool]:


@pzelasko do you think it is ok to continue growing the size of this file or should we create a separate file for s2s specifics

let's grow it and refactor it to smaller files later

nemo/collections/speechlm2/data/s2s_dataset.py

nemo/collections/speechlm2/models/duplex_stt_model.py

… add recent fixes Signed-off-by: kevinhu <[email protected]>

kevinhu-nv · 2025-12-19T17:31:38Z

Resolved comments, and finished another pass of the code to brush up the code by: 1) removing debug code, and 2) add necessary updates since last rebased (e.g. speech cutoff fix).

Some areas to discuss before I make changes:

How to define agent_bos, agent_eos, user_bos, user_eos
Did not incorporate most recent changes such as early interruption, etc since it is experimental

PTAL @zhehuaichen

Signed-off-by: kevinhu <[email protected]>

Signed-off-by: kevinhu-nv <[email protected]>

kevinhu-nv · 2025-12-19T18:25:04Z

@pzelasko @Edresson Can you start taking a look and maybe leave some high-level comments first?

Edresson and others added 30 commits May 27, 2025 05:39

Add Speech decoder new parameters

03c2267

Add target_first_turn_audio in dataloader and bug fixes

6aaf7f3

Bug fix in modality adaptor embedding

c11b902

Add tts repeat after me dataloader

48147e4

Add modality adapter input quantizer

e88390d

Add pretrained_tts parameter

721fce3

Save audio samples during evaluation

3878b58

Bug fix on repeat after meter me dataloader

f83095d

Bug fix on checkpoint loading

9e2ec7c

Fix loss mask to ignore seqpuence padding

9ee11c5

Bug fix on repeat after me dataloader

99a6587

Add init_speec pretrained_tts_from_s2s parameter

d1402a2

Update

d72a66e

Update

21bae41

Bug fix on speaker embedding conditioning

a5a30c6

Rebase bug fix

d40db11

Bug fix on real conv data dataloader

d55d456

Fix some formatting

1239d21

Add use_cas_cache parameter

a228e03

Bug fix on pretrained_tts_from_s2s

ce6226a

Add gated fusion

90b15db

Add custom speech token parameters and restore from s2s option

ae24124

Move rs2s checkpoint loading to cpu to avoid OOM in multi node training

e15ba5c

Rectify BLEU evaluation

1699681

Signed-off-by: Ankita Pasad <[email protected]>

Same fix for ASRBLEU; previous implementation will give incorrect and…

6546dd7

… possibly inflated results Signed-off-by: Ankita Pasad <[email protected]>

Add token accccuracy metrics

4a6f4df

Limit audio id size to avoid File name too long error

45a16b9

Add new lhotse formatters

22f7808

Remove old formatters

e02d324

Add asr emb

5360e22

Chen Chen and others added 17 commits November 5, 2025 20:48

add noise-aug

0cb29ae

merge edits for streaming asr, noise aug, etc

0863229

Resolve conflicts

c770193

Signed-off-by: kevinhu <[email protected]>

fix noise-aug

c5efff8

Signed-off-by: kevinhu <[email protected]>

Merge branch 'duplex-s2s-aug' into duplex-s2s-aug-with-sysprompt

ec3bc13

Signed-off-by: Ankita Pasad <[email protected]>

Merge pull request NVIDIA-NeMo#3 from ankitapasad/duplex-s2s-aug-with…

df963d0

…-sysprompt support training and inference for data with system prompt

simple merge of prompt feature

89d91a2

Signed-off-by: kevinhu <[email protected]>

fix

f499894

Signed-off-by: kevinhu <[email protected]>

Make prompt tokens compatible with streaming ASR

5475922

Signed-off-by: kevinhu <[email protected]>

remove half duplex asr training code

d231d57

Signed-off-by: kevinhu <[email protected]>

remove eou augmentation training code

caa5fbc

Signed-off-by: kevinhu <[email protected]>

consolidate several falgs into a single --predict_user_text and remov…

86e766f

…e unnecessary code Signed-off-by: kevinhu <[email protected]>

convert to duplex stt

502699a

Signed-off-by: kevinhu <[email protected]>

fix a few tests

1082753

Signed-off-by: kevinhu <[email protected]>

refactor offline inference

f9d614a

Signed-off-by: kevinhu <[email protected]>

minor fixes

76bd9d8

Signed-off-by: kevinhu <[email protected]>

Rebase and a few fixes

0354328

Signed-off-by: kevinhu <[email protected]>

github-actions bot added the common label Nov 19, 2025

github-advanced-security bot found potential problems Nov 19, 2025

View reviewed changes

zhehuaichen reviewed Nov 27, 2025

View reviewed changes

resolve comments and brush up the code by removing debugging code and…

43cf48e

… add recent fixes Signed-off-by: kevinhu <[email protected]>

kevinhu-nv added 2 commits December 19, 2025 13:17

resolve conflicts

f0df56c

Signed-off-by: kevinhu <[email protected]>

Merge branch 'main' into duplex-stt-rebased

38f6334

Signed-off-by: kevinhu-nv <[email protected]>

kevinhu-nv marked this pull request as ready for review December 19, 2025 18:19

Apply isort and black reformatting

8cf5b2a

Signed-off-by: kevinhu-nv <[email protected]>

kevinhu-nv changed the title ~~Duplex stt rebased~~ Implement Duplex Speech-to-text model and rebase Dec 19, 2025

zhehuaichen requested review from Edresson and pzelasko December 19, 2025 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Duplex Speech-to-text model and rebase #15092

Implement Duplex Speech-to-text model and rebase #15092

Uh oh!

kevinhu-nv commented Nov 19, 2025 •

edited

Loading

Uh oh!

github-advanced-security bot left a comment

Uh oh!

zhehuaichen left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhehuaichen Nov 26, 2025

Uh oh!

pzelasko Jan 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevinhu-nv commented Dec 19, 2025

Uh oh!

kevinhu-nv commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants



		@data_type_parser(["s2s_duplex_overlap_as_s2s_duplex"])
		def read_s2s_duplex_overlap_as_s2s_duplex(config) -> tuple[CutSet, bool]:

Implement Duplex Speech-to-text model and rebase #15092

Are you sure you want to change the base?

Implement Duplex Speech-to-text model and rebase #15092

Uh oh!

Conversation

kevinhu-nv commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Who can review?

Additional Information

Uh oh!

github-advanced-security bot left a comment

Choose a reason for hiding this comment

Uh oh!

zhehuaichen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhehuaichen Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

pzelasko Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevinhu-nv commented Dec 19, 2025

Uh oh!

kevinhu-nv commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

kevinhu-nv commented Nov 19, 2025 •

edited

Loading