Skip to content

Conversation

@subhankar-ghosh
Copy link
Collaborator

@subhankar-ghosh subhankar-ghosh commented Dec 19, 2025

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add Longform TTS support in MagpieTTS.

Collection: [Note which collection this PR will affect]
TTS

Changelog

  • LongFormTTSInferenceDataset - A new dataset class subclass of MagpieTTSDataset, that splits input text into chunks based on sentence delimiters. This minimal class handles inference-only data: context_audio_codes/audio, speaker_indices, and chunked text.
  • --longform_mode option - Added to examples/tts/magpietts_inference.py to enable longform inference with auto-detection based on text length. Nothing changes for user running inference script.
  • Longform methods in MagpieTTSModel - Core logic in generate_long_form_speech and construct_longform_inference_prior, plus helper methods for managing state across chunks.
  • Dataclass-based state management - Uses LongformConfig for immutable constants and LongformChunkState for mutable state that persists across chunk iterations (per @blisc's suggestion).
  • Unified inference runner - MagpieInferenceRunner now handles both standard and longform inference with auto-detection, eliminating the need for a separate LongFormInferenceRunner class.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
subhankar-ghosh and others added 4 commits December 19, 2025 02:43
Signed-off-by: subhankar-ghosh <[email protected]>
…on raised in special method

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: Subhankar Ghosh <[email protected]>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds longform TTS support to MagpieTTS, enabling the model to generate speech from long text inputs by processing them sentence-by-sentence with chunked inference.

Key Changes:

  • Implements sentence-level text chunking with new utility functions (split_by_sentence, chunk_and_tokenize_text_by_sentence)
  • Introduces LongFormTTSInferenceDataset for preparing longform data and LongFormInferenceRunner for orchestrating batch inference
  • Adds generate_long_form_speech method to MagpieTTSModel with sliding window logic, attention prior construction, and chunk-based decoding

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 17 comments.

Show a summary per file
File Description
nemo/collections/tts/parts/utils/tts_dataset_utils.py Adds utility functions for splitting text into sentences and tokenizing chunks with EOS tokens
nemo/collections/tts/modules/magpietts_inference/inference.py Implements LongFormInferenceRunner class for managing longform batch inference with code accumulation across chunks
nemo/collections/tts/models/magpietts.py Adds core longform generation logic including LongformDecoderState dataclass, generate_long_form_speech method, and attention prior helper methods
nemo/collections/tts/data/text_to_speech_dataset.py Implements LongFormTTSInferenceDataset for loading and preprocessing longform samples with sentence chunking
examples/tts/magpietts_inference.py Adds --longform CLI argument and integrates longform runner into inference pipeline

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +402 to +409
for i, char in enumerate(paragraph):
# Check if current char is a separator and next char is a space
# This avoids splitting abbreviations like "Dr." or "a.m."
next_char = paragraph[i + 1] if i + 1 < len(paragraph) else ""
if char in sentence_separators and next_char == " ":
sentences.append(paragraph[last_sep_idx + 1 : i + 1].strip())
last_sep_idx = i + 1

Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition checks if char is in sentence_separators, but sentence_separators is a list containing strings like '.', '?', '!', '...'. When iterating through characters, char is a single character string. However, '...' is a 3-character string and will never match a single character. This means the ellipsis separator ('...') will not be detected correctly.

To fix this, you should either:

  1. Check for multi-character separators first (longest match first)
  2. Remove '...' from the default list since it cannot match single characters
  3. Use a different approach like regex or string methods for matching
Suggested change
for i, char in enumerate(paragraph):
# Check if current char is a separator and next char is a space
# This avoids splitting abbreviations like "Dr." or "a.m."
next_char = paragraph[i + 1] if i + 1 < len(paragraph) else ""
if char in sentence_separators and next_char == " ":
sentences.append(paragraph[last_sep_idx + 1 : i + 1].strip())
last_sep_idx = i + 1
# Separate single- and multi-character separators so we can correctly
# handle cases like "..." while preserving the original behavior.
multi_char_separators = sorted(
[sep for sep in sentence_separators if len(sep) > 1],
key=len,
reverse=True,
)
single_char_separators = [sep for sep in sentence_separators if len(sep) == 1]
i = 0
while i < len(paragraph):
# First, check for multi-character separators starting at position i.
matched_multi = False
for sep in multi_char_separators:
if paragraph.startswith(sep, i):
end_idx = i + len(sep) - 1
next_char_idx = end_idx + 1
next_char = paragraph[next_char_idx] if next_char_idx < len(paragraph) else ""
# Only split when the separator is followed by a space, to avoid
# splitting within abbreviations or numbers.
if next_char == " ":
sentences.append(paragraph[last_sep_idx + 1 : end_idx + 1].strip())
last_sep_idx = end_idx + 1
i = end_idx + 1
matched_multi = True
break
if matched_multi:
continue
# Fallback to single-character separator handling.
char = paragraph[i]
next_char = paragraph[i + 1] if i + 1 < len(paragraph) else ""
# Check if current char is a separator and next char is a space.
# This avoids splitting abbreviations like "Dr." or "a.m."
if char in single_char_separators and next_char == " ":
sentences.append(paragraph[last_sep_idx + 1 : i + 1].strip())
last_sep_idx = i + 1
i += 1

Copilot uses AI. Check for mistakes.
"de": ["german_phoneme", "german"],
"es": ["spanish_phoneme", "spanish"],
"fr": ["french_phoneme", "french"],
"fr": ["french_chartokenizer", "french"],
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tokenizer for French has been changed from "french_phoneme" to "french_chartokenizer". This appears to be an unrelated change that shouldn't be part of a PR focused on longform TTS functionality. If this is an intentional fix, it should either be in a separate PR or explicitly mentioned in the PR description. If unintended, this change should be reverted.

Suggested change
"fr": ["french_chartokenizer", "french"],
"fr": ["french_phoneme", "french"],

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you merge with main?

subhankar-ghosh and others added 3 commits December 19, 2025 06:33
Co-authored-by: Copilot <[email protected]>
Signed-off-by: Subhankar Ghosh <[email protected]>
Co-authored-by: Copilot <[email protected]>
Signed-off-by: Subhankar Ghosh <[email protected]>
Copy link
Collaborator

@blisc blisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clean up some of the classes especially on LongFormInferenceRunner. Additionally, I want a design for longform that is seamless to the user. Users should not to decide apriori whether they need longform generation or the non-longform path.

Comment on lines 175 to 180
# Create appropriate inference runner based on longform flag
if longform:
logging.info("Using longform inference mode (sentence-by-sentence processing)")
runner = LongFormInferenceRunner(model, inference_config)
else:
runner = MagpieInferenceRunner(model, inference_config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Can we natively switch to longform once we go over the 20s of decoder generation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to predetermine LongForm or standard generation because for LongForm we need to save and update history variables, which are necessary for the window mechanism. However, to make the user experience seemless I can try to write a logic that would determine LongForm or standard generation given the number of words in the input text (~40-50 words in 20sec). What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to initialize these parameters on the fly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does batch processing, so determining if each datapoint is longform or short and running the corresponding Runner would be complicated. MagpieInferenceRunner cannot do longform, MagpieTTSDataset cannot be used for longform. But it might be possible to do standard and longform with LongFormInferenceRunner.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try:

  1. The auto-detect logic -> based on the input manifest, if any of the entries is longform (len(text) > 50 words) we use longform else standard inference.
  2. Merge the two inference runners into one single runner.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blisc Please check implementation now. It is much cleaner and reusing most of the existing code. User experience is also seamless as they do not need to decide between longform path or standard path.

"de": ["german_phoneme", "german"],
"es": ["spanish_phoneme", "spanish"],
"fr": ["french_phoneme", "french"],
"fr": ["french_chartokenizer", "french"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you merge with main?

Copy link
Collaborator

@blisc blisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +3630 to +3689
def _penalize_attention_sinks(
self,
attn_prior: torch.Tensor,
batch_idx: int,
attended_timestep_counter: Dict[int, int],
left_offset: int,
eps_sq: float,
) -> None:
"""
Penalize timesteps that have been over-attended (attention sinks).
When a position is attended more than the threshold, suppress all
positions up to and including it to force the model to move forward.
Args:
attn_prior: Prior tensor to modify in-place. Shape: (B, 1, T_text).
batch_idx: Index of current batch item.
attended_timestep_counter: Dict tracking attention counts per timestep.
left_offset: Chunk offset for this batch item.
eps_sq: Squared epsilon for strong suppression.
"""
threshold = self.longform_config.attention_sink_threshold

for timestep, count in attended_timestep_counter.items():
if timestep > left_offset and count >= threshold:
logging.debug(f"Attention sink at timestep {timestep} for batch {batch_idx}, count: {count}")
relative_pos = timestep - left_offset
attn_prior[batch_idx, 0, : relative_pos + 1] = eps_sq

def _update_text_completion_state(
self,
batch_idx: int,
attended_pos: int,
text_len: int,
is_finished: bool,
unfinished_texts: Dict[int, bool],
finished_texts_counter: Dict[int, int],
) -> None:
"""
Update tracking state for text completion detection.
A text is considered "near end" when the attended position is within
`longform_near_end_threshold` positions of the text end.
Args:
batch_idx: Index of current batch item.
attended_pos: Currently attended text position (chunk-relative).
text_len: Length of text for this batch item.
is_finished: Whether this batch item has already finished.
unfinished_texts: Dict to update in-place.
finished_texts_counter: Dict to update in-place.
"""
is_near_end = attended_pos >= text_len - self.longform_config.near_end_threshold

# Text is unfinished if not near end AND not already marked finished
unfinished_texts[batch_idx] = not is_near_end and not is_finished

# Start counting when near end or already finished
if is_near_end or is_finished:
finished_texts_counter.setdefault(batch_idx, 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions look very similar to the non-longform counterparts. While not a blocker for this PR, we should attempt to merge both codepaths together when possible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to incorporate this in the next PR.

Signed-off-by: subhankar-ghosh <[email protected]>
@subhankar-ghosh
Copy link
Collaborator Author

@subhankar-ghosh subhankar-ghosh enabled auto-merge (squash) December 22, 2025 21:14
@github-actions github-actions bot removed the Run CICD label Dec 23, 2025
@github-actions
Copy link
Contributor

[🤖]: Hi @subhankar-ghosh 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

@subhankar-ghosh subhankar-ghosh merged commit 6442018 into main Dec 23, 2025
165 of 167 checks passed
@subhankar-ghosh subhankar-ghosh deleted the magpietts_os_longform branch December 23, 2025 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants