[TTS][MagpieTTS] Longform TTS using MagpieTTS #15210

subhankar-ghosh · 2025-12-19T10:26:38Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add Longform TTS support in MagpieTTS.

Collection: [Note which collection this PR will affect]
TTS

Changelog

LongFormTTSInferenceDataset - A new dataset class subclass of MagpieTTSDataset, that splits input text into chunks based on sentence delimiters. This minimal class handles inference-only data: context_audio_codes/audio, speaker_indices, and chunked text.
--longform_mode option - Added to examples/tts/magpietts_inference.py to enable longform inference with auto-detection based on text length. Nothing changes for user running inference script.
Longform methods in MagpieTTSModel - Core logic in generate_long_form_speech and construct_longform_inference_prior, plus helper methods for managing state across chunks.
Dataclass-based state management - Uses LongformConfig for immutable constants and LongformChunkState for mutable state that persists across chunk iterations (per @blisc's suggestion).
Unified inference runner - MagpieInferenceRunner now handles both standard and longform inference with auto-detection, eliminating the need for a separate LongFormInferenceRunner class.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: subhankar-ghosh <[email protected]>

nemo/collections/tts/data/text_to_speech_dataset.py

Signed-off-by: subhankar-ghosh <[email protected]>

…on raised in special method Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: Subhankar Ghosh <[email protected]>

Copilot

Pull request overview

This PR adds longform TTS support to MagpieTTS, enabling the model to generate speech from long text inputs by processing them sentence-by-sentence with chunked inference.

Key Changes:

Implements sentence-level text chunking with new utility functions (split_by_sentence, chunk_and_tokenize_text_by_sentence)
Introduces LongFormTTSInferenceDataset for preparing longform data and LongFormInferenceRunner for orchestrating batch inference
Adds generate_long_form_speech method to MagpieTTSModel with sliding window logic, attention prior construction, and chunk-based decoding

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 17 comments.

Show a summary per file

File	Description
nemo/collections/tts/parts/utils/tts_dataset_utils.py	Adds utility functions for splitting text into sentences and tokenizing chunks with EOS tokens
nemo/collections/tts/modules/magpietts_inference/inference.py	Implements `LongFormInferenceRunner` class for managing longform batch inference with code accumulation across chunks
nemo/collections/tts/models/magpietts.py	Adds core longform generation logic including `LongformDecoderState` dataclass, `generate_long_form_speech` method, and attention prior helper methods
nemo/collections/tts/data/text_to_speech_dataset.py	Implements `LongFormTTSInferenceDataset` for loading and preprocessing longform samples with sentence chunking
examples/tts/magpietts_inference.py	Adds `--longform` CLI argument and integrates longform runner into inference pipeline

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-19T10:54:48Z

nemo/collections/tts/parts/utils/tts_dataset_utils.py

+    for i, char in enumerate(paragraph):
+        # Check if current char is a separator and next char is a space
+        # This avoids splitting abbreviations like "Dr." or "a.m."
+        next_char = paragraph[i + 1] if i + 1 < len(paragraph) else ""
+        if char in sentence_separators and next_char == " ":
+            sentences.append(paragraph[last_sep_idx + 1 : i + 1].strip())
+            last_sep_idx = i + 1
+


The condition checks if char is in sentence_separators, but sentence_separators is a list containing strings like '.', '?', '!', '...'. When iterating through characters, char is a single character string. However, '...' is a 3-character string and will never match a single character. This means the ellipsis separator ('...') will not be detected correctly.

To fix this, you should either:

Check for multi-character separators first (longest match first)

Remove '...' from the default list since it cannot match single characters

Use a different approach like regex or string methods for matching

Suggested change

for i, char in enumerate(paragraph):

# Check if current char is a separator and next char is a space

# This avoids splitting abbreviations like "Dr." or "a.m."

next_char = paragraph[i + 1] if i + 1 < len(paragraph) else ""

if char in sentence_separators and next_char == " ":

sentences.append(paragraph[last_sep_idx + 1 : i + 1].strip())

last_sep_idx = i + 1

# Separate single- and multi-character separators so we can correctly

# handle cases like "..." while preserving the original behavior.

multi_char_separators = sorted(

[sep for sep in sentence_separators if len(sep) > 1],

key=len,

reverse=True,

)

single_char_separators = [sep for sep in sentence_separators if len(sep) == 1]

i = 0

while i < len(paragraph):

# First, check for multi-character separators starting at position i.

matched_multi = False

for sep in multi_char_separators:

if paragraph.startswith(sep, i):

end_idx = i + len(sep) - 1

next_char_idx = end_idx + 1

next_char = paragraph[next_char_idx] if next_char_idx < len(paragraph) else ""

# Only split when the separator is followed by a space, to avoid

# splitting within abbreviations or numbers.

if next_char == " ":

sentences.append(paragraph[last_sep_idx + 1 : end_idx + 1].strip())

last_sep_idx = end_idx + 1

i = end_idx + 1

matched_multi = True

break

if matched_multi:

continue

# Fallback to single-character separator handling.

char = paragraph[i]

next_char = paragraph[i + 1] if i + 1 < len(paragraph) else ""

# Check if current char is a separator and next char is a space.

# This avoids splitting abbreviations like "Dr." or "a.m."

if char in single_char_separators and next_char == " ":

sentences.append(paragraph[last_sep_idx + 1 : i + 1].strip())

last_sep_idx = i + 1

i += 1

nemo/collections/tts/parts/utils/tts_dataset_utils.py

nemo/collections/tts/data/text_to_speech_dataset.py

nemo/collections/tts/models/magpietts.py

Copilot · 2025-12-19T10:54:52Z

nemo/collections/tts/models/magpietts.py

            "de": ["german_phoneme", "german"],
            "es": ["spanish_phoneme", "spanish"],
-            "fr": ["french_phoneme", "french"],
+            "fr": ["french_chartokenizer", "french"],


The tokenizer for French has been changed from "french_phoneme" to "french_chartokenizer". This appears to be an unrelated change that shouldn't be part of a PR focused on longform TTS functionality. If this is an intentional fix, it should either be in a separate PR or explicitly mentioned in the PR description. If unintended, this change should be reverted.

Suggested change

"fr": ["french_chartokenizer", "french"],

"fr": ["french_phoneme", "french"],

Can you merge with main?

nemo/collections/tts/models/magpietts.py

Co-authored-by: Copilot <[email protected]> Signed-off-by: Subhankar Ghosh <[email protected]>

blisc

Let's clean up some of the classes especially on LongFormInferenceRunner. Additionally, I want a design for longform that is seamless to the user. Users should not to decide apriori whether they need longform generation or the non-longform path.

blisc · 2025-12-19T18:02:50Z

examples/tts/magpietts_inference.py

+    # Create appropriate inference runner based on longform flag
+    if longform:
+        logging.info("Using longform inference mode (sentence-by-sentence processing)")
+        runner = LongFormInferenceRunner(model, inference_config)
+    else:
+        runner = MagpieInferenceRunner(model, inference_config)


Do we need this? Can we natively switch to longform once we go over the 20s of decoder generation?

We have to predetermine LongForm or standard generation because for LongForm we need to save and update history variables, which are necessary for the window mechanism. However, to make the user experience seemless I can try to write a logic that would determine LongForm or standard generation given the number of words in the input text (~40-50 words in 20sec). What do you think?

Is it not possible to initialize these parameters on the fly?

This does batch processing, so determining if each datapoint is longform or short and running the corresponding Runner would be complicated. MagpieInferenceRunner cannot do longform, MagpieTTSDataset cannot be used for longform. But it might be possible to do standard and longform with LongFormInferenceRunner.

Let me try:

The auto-detect logic -> based on the input manifest, if any of the entries is longform (len(text) > 50 words) we use longform else standard inference.

Merge the two inference runners into one single runner.

@blisc Please check implementation now. It is much cleaner and reusing most of the existing code. User experience is also seamless as they do not need to decide between longform path or standard path.

blisc · 2025-12-19T18:08:33Z

nemo/collections/tts/models/magpietts.py

            "de": ["german_phoneme", "german"],
            "es": ["spanish_phoneme", "spanish"],
-            "fr": ["french_phoneme", "french"],
+            "fr": ["french_chartokenizer", "french"],


Can you merge with main?

nemo/collections/tts/data/text_to_speech_dataset.py

nemo/collections/tts/models/magpietts.py

nemo/collections/tts/modules/magpietts_inference/inference.py

Co-authored-by: Copilot <[email protected]> Signed-off-by: Subhankar Ghosh <[email protected]>

Signed-off-by: subhankar-ghosh <[email protected]>

…nto magpietts_os_longform

Signed-off-by: subhankar-ghosh <[email protected]>

Co-authored-by: Copilot <[email protected]> Signed-off-by: Subhankar Ghosh <[email protected]>

Signed-off-by: subhankar-ghosh <[email protected]>

…nto magpietts_os_longform

blisc

Please add back https://github.com/NVIDIA-NeMo/NeMo/blob/magpietts_2508/tests/functional_tests/L2_TTS_InferEvaluateStreaming_Magpietts_ZeroShot.sh and then we can merge

blisc · 2025-12-22T19:04:14Z

nemo/collections/tts/models/magpietts.py

+    def _penalize_attention_sinks(
+        self,
+        attn_prior: torch.Tensor,
+        batch_idx: int,
+        attended_timestep_counter: Dict[int, int],
+        left_offset: int,
+        eps_sq: float,
+    ) -> None:
+        """
+        Penalize timesteps that have been over-attended (attention sinks).
+
+        When a position is attended more than the threshold, suppress all
+        positions up to and including it to force the model to move forward.
+
+        Args:
+            attn_prior: Prior tensor to modify in-place. Shape: (B, 1, T_text).
+            batch_idx: Index of current batch item.
+            attended_timestep_counter: Dict tracking attention counts per timestep.
+            left_offset: Chunk offset for this batch item.
+            eps_sq: Squared epsilon for strong suppression.
+        """
+        threshold = self.longform_config.attention_sink_threshold
+
+        for timestep, count in attended_timestep_counter.items():
+            if timestep > left_offset and count >= threshold:
+                logging.debug(f"Attention sink at timestep {timestep} for batch {batch_idx}, count: {count}")
+                relative_pos = timestep - left_offset
+                attn_prior[batch_idx, 0, : relative_pos + 1] = eps_sq
+
+    def _update_text_completion_state(
+        self,
+        batch_idx: int,
+        attended_pos: int,
+        text_len: int,
+        is_finished: bool,
+        unfinished_texts: Dict[int, bool],
+        finished_texts_counter: Dict[int, int],
+    ) -> None:
+        """
+        Update tracking state for text completion detection.
+
+        A text is considered "near end" when the attended position is within
+        `longform_near_end_threshold` positions of the text end.
+
+        Args:
+            batch_idx: Index of current batch item.
+            attended_pos: Currently attended text position (chunk-relative).
+            text_len: Length of text for this batch item.
+            is_finished: Whether this batch item has already finished.
+            unfinished_texts: Dict to update in-place.
+            finished_texts_counter: Dict to update in-place.
+        """
+        is_near_end = attended_pos >= text_len - self.longform_config.near_end_threshold
+
+        # Text is unfinished if not near end AND not already marked finished
+        unfinished_texts[batch_idx] = not is_near_end and not is_finished
+
+        # Start counting when near end or already finished
+        if is_near_end or is_finished:
+            finished_texts_counter.setdefault(batch_idx, 0)


These two functions look very similar to the non-longform counterparts. While not a blocker for this PR, we should attempt to merge both codepaths together when possible.

I will try to incorporate this in the next PR.

Signed-off-by: subhankar-ghosh <[email protected]>

subhankar-ghosh · 2025-12-22T21:03:10Z

Please add back https://github.com/NVIDIA-NeMo/NeMo/blob/magpietts_2508/tests/functional_tests/L2_TTS_InferEvaluateStreaming_Magpietts_ZeroShot.sh and then we can merge

Added.

github-actions · 2025-12-23T23:40:36Z

[🤖]: Hi @subhankar-ghosh 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

Longform TTS using magpietts

44c4644

Signed-off-by: subhankar-ghosh <[email protected]>

subhankar-ghosh requested review from blisc and rlangman December 19, 2025 10:26

github-actions bot added the TTS label Dec 19, 2025

Apply isort and black reformatting

335007f

Signed-off-by: subhankar-ghosh <[email protected]>

github-advanced-security bot found potential problems Dec 19, 2025

View reviewed changes

nemo/collections/tts/data/text_to_speech_dataset.py Fixed Show fixed Hide fixed

subhankar-ghosh and others added 4 commits December 19, 2025 02:43

Using LongformDecoderState in LongForm Magpietts

e3584d8

Signed-off-by: subhankar-ghosh <[email protected]>

Using LongformDecoderState in LongForm Magpietts

eff4703

Signed-off-by: subhankar-ghosh <[email protected]>

Apply isort and black reformatting

73ddc3d

Signed-off-by: subhankar-ghosh <[email protected]>

Potential fix for code scanning alert no. 16815: Non-standard excepti…

d2714a7

…on raised in special method Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: Subhankar Ghosh <[email protected]>

subhankar-ghosh requested a review from Copilot December 19, 2025 10:48

Copilot started reviewing on behalf of subhankar-ghosh December 19, 2025 10:49 View session

Copilot AI reviewed Dec 19, 2025

View reviewed changes

subhankar-ghosh and others added 3 commits December 19, 2025 06:33

Update nemo/collections/tts/data/text_to_speech_dataset.py

f7fce00

Co-authored-by: Copilot <[email protected]> Signed-off-by: Subhankar Ghosh <[email protected]>

Update nemo/collections/tts/models/magpietts.py

132434b

Co-authored-by: Copilot <[email protected]> Signed-off-by: Subhankar Ghosh <[email protected]>

Update nemo/collections/tts/models/magpietts.py

8a4e144

Co-authored-by: Copilot <[email protected]> Signed-off-by: Subhankar Ghosh <[email protected]>

blisc added the Run CICD label Dec 19, 2025

blisc temporarily deployed to test December 19, 2025 18:03 — with GitHub Actions Inactive

blisc requested changes Dec 19, 2025

View reviewed changes

github-actions bot removed the Run CICD label Dec 19, 2025

subhankar-ghosh requested a review from blisc December 19, 2025 21:41

subhankar-ghosh and others added 9 commits December 20, 2025 00:51

Update nemo/collections/tts/data/text_to_speech_dataset.py

0f41d39

Co-authored-by: Copilot <[email protected]> Signed-off-by: Subhankar Ghosh <[email protected]>

Combining Inference runner, using data classes in longform

312f127

Signed-off-by: subhankar-ghosh <[email protected]>

Merge branch 'magpietts_os_longform' of github.com:NVIDIA-NeMo/NeMo i…

bd187b5

…nto magpietts_os_longform

Apply isort and black reformatting

577eff9

Signed-off-by: subhankar-ghosh <[email protected]>

make LongFormTTSInferenceDataset a subclass of MagpieTTSDataset

a8610d8

Signed-off-by: subhankar-ghosh <[email protected]>

make LongFormTTSInferenceDataset a subclass of MagpieTTSDataset

9e139c4

Signed-off-by: subhankar-ghosh <[email protected]>

Apply isort and black reformatting

fe64ea8

Signed-off-by: subhankar-ghosh <[email protected]>

Update nemo/collections/tts/models/magpietts.py

e6a248d

Co-authored-by: Copilot <[email protected]> Signed-off-by: Subhankar Ghosh <[email protected]>

Update nemo/collections/tts/models/magpietts.py

294439e

Co-authored-by: Copilot <[email protected]> Signed-off-by: Subhankar Ghosh <[email protected]>

subhankar-ghosh added 2 commits December 20, 2025 00:14

Remove redundant code from inference.py

1e9004b

Signed-off-by: subhankar-ghosh <[email protected]>

Merge branch 'magpietts_os_longform' of github.com:NVIDIA-NeMo/NeMo i…

99ffb47

…nto magpietts_os_longform

blisc requested changes Dec 22, 2025

View reviewed changes

Adding longform test cases.

7faab9e

Signed-off-by: subhankar-ghosh <[email protected]>

subhankar-ghosh requested review from chtruong814, ko3n1g, pablo-garay and thomasdhc as code owners December 22, 2025 20:56

github-actions bot added the CI label Dec 22, 2025

subhankar-ghosh requested a review from blisc December 22, 2025 20:58

subhankar-ghosh added the Run CICD label Dec 22, 2025

subhankar-ghosh temporarily deployed to test December 22, 2025 20:59 — with GitHub Actions Inactive

blisc approved these changes Dec 22, 2025

View reviewed changes

subhankar-ghosh enabled auto-merge (squash) December 22, 2025 21:14

chtruong814 approved these changes Dec 22, 2025

View reviewed changes

github-actions bot removed the Run CICD label Dec 23, 2025

subhankar-ghosh merged commit 6442018 into main Dec 23, 2025
165 of 167 checks passed

subhankar-ghosh deleted the magpietts_os_longform branch December 23, 2025 23:40

-    for i, char in enumerate(paragraph):
-        # Check if current char is a separator and next char is a space
-        # This avoids splitting abbreviations like "Dr." or "a.m."
-        next_char = paragraph[i + 1] if i + 1 < len(paragraph) else ""
-        if char in sentence_separators and next_char == " ":
-            sentences.append(paragraph[last_sep_idx + 1 : i + 1].strip())
-            last_sep_idx = i + 1
+    # Separate single- and multi-character separators so we can correctly
+    # handle cases like "..." while preserving the original behavior.
+    multi_char_separators = sorted(
+        [sep for sep in sentence_separators if len(sep) > 1],
+        key=len,
+        reverse=True,
+    )
+    single_char_separators = [sep for sep in sentence_separators if len(sep) == 1]
+    i = 0
+    while i < len(paragraph):
+        # First, check for multi-character separators starting at position i.
+        matched_multi = False
+        for sep in multi_char_separators:
+            if paragraph.startswith(sep, i):
+                end_idx = i + len(sep) - 1
+                next_char_idx = end_idx + 1
+                next_char = paragraph[next_char_idx] if next_char_idx < len(paragraph) else ""
+                # Only split when the separator is followed by a space, to avoid
+                # splitting within abbreviations or numbers.
+                if next_char == " ":
+                    sentences.append(paragraph[last_sep_idx + 1 : end_idx + 1].strip())
+                    last_sep_idx = end_idx + 1
+                i = end_idx + 1
+                matched_multi = True
+                break
+        if matched_multi:
+            continue
+        # Fallback to single-character separator handling.
+        char = paragraph[i]
+        next_char = paragraph[i + 1] if i + 1 < len(paragraph) else ""
+        # Check if current char is a separator and next char is a space.
+        # This avoids splitting abbreviations like "Dr." or "a.m."
+        if char in single_char_separators and next_char == " ":
+            sentences.append(paragraph[last_sep_idx + 1 : i + 1].strip())
+            last_sep_idx = i + 1
+        i += 1

	"fr": ["french_chartokenizer", "french"],
	"fr": ["french_phoneme", "french"],

[TTS][MagpieTTS] Longform TTS using MagpieTTS #15210

[TTS][MagpieTTS] Longform TTS using MagpieTTS #15210

Conversation

subhankar-ghosh commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

blisc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blisc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

subhankar-ghosh commented Dec 22, 2025

Uh oh!

github-actions bot commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

subhankar-ghosh commented Dec 19, 2025 •

edited

Loading