-
Notifications
You must be signed in to change notification settings - Fork 3.3k
[TTS][MagpieTTS] Longform TTS using MagpieTTS #15210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
…on raised in special method Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: Subhankar Ghosh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds longform TTS support to MagpieTTS, enabling the model to generate speech from long text inputs by processing them sentence-by-sentence with chunked inference.
Key Changes:
- Implements sentence-level text chunking with new utility functions (
split_by_sentence,chunk_and_tokenize_text_by_sentence) - Introduces
LongFormTTSInferenceDatasetfor preparing longform data andLongFormInferenceRunnerfor orchestrating batch inference - Adds
generate_long_form_speechmethod toMagpieTTSModelwith sliding window logic, attention prior construction, and chunk-based decoding
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
| nemo/collections/tts/parts/utils/tts_dataset_utils.py | Adds utility functions for splitting text into sentences and tokenizing chunks with EOS tokens |
| nemo/collections/tts/modules/magpietts_inference/inference.py | Implements LongFormInferenceRunner class for managing longform batch inference with code accumulation across chunks |
| nemo/collections/tts/models/magpietts.py | Adds core longform generation logic including LongformDecoderState dataclass, generate_long_form_speech method, and attention prior helper methods |
| nemo/collections/tts/data/text_to_speech_dataset.py | Implements LongFormTTSInferenceDataset for loading and preprocessing longform samples with sentence chunking |
| examples/tts/magpietts_inference.py | Adds --longform CLI argument and integrates longform runner into inference pipeline |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for i, char in enumerate(paragraph): | ||
| # Check if current char is a separator and next char is a space | ||
| # This avoids splitting abbreviations like "Dr." or "a.m." | ||
| next_char = paragraph[i + 1] if i + 1 < len(paragraph) else "" | ||
| if char in sentence_separators and next_char == " ": | ||
| sentences.append(paragraph[last_sep_idx + 1 : i + 1].strip()) | ||
| last_sep_idx = i + 1 | ||
|
|
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition checks if char is in sentence_separators, but sentence_separators is a list containing strings like '.', '?', '!', '...'. When iterating through characters, char is a single character string. However, '...' is a 3-character string and will never match a single character. This means the ellipsis separator ('...') will not be detected correctly.
To fix this, you should either:
- Check for multi-character separators first (longest match first)
- Remove '...' from the default list since it cannot match single characters
- Use a different approach like regex or string methods for matching
| for i, char in enumerate(paragraph): | |
| # Check if current char is a separator and next char is a space | |
| # This avoids splitting abbreviations like "Dr." or "a.m." | |
| next_char = paragraph[i + 1] if i + 1 < len(paragraph) else "" | |
| if char in sentence_separators and next_char == " ": | |
| sentences.append(paragraph[last_sep_idx + 1 : i + 1].strip()) | |
| last_sep_idx = i + 1 | |
| # Separate single- and multi-character separators so we can correctly | |
| # handle cases like "..." while preserving the original behavior. | |
| multi_char_separators = sorted( | |
| [sep for sep in sentence_separators if len(sep) > 1], | |
| key=len, | |
| reverse=True, | |
| ) | |
| single_char_separators = [sep for sep in sentence_separators if len(sep) == 1] | |
| i = 0 | |
| while i < len(paragraph): | |
| # First, check for multi-character separators starting at position i. | |
| matched_multi = False | |
| for sep in multi_char_separators: | |
| if paragraph.startswith(sep, i): | |
| end_idx = i + len(sep) - 1 | |
| next_char_idx = end_idx + 1 | |
| next_char = paragraph[next_char_idx] if next_char_idx < len(paragraph) else "" | |
| # Only split when the separator is followed by a space, to avoid | |
| # splitting within abbreviations or numbers. | |
| if next_char == " ": | |
| sentences.append(paragraph[last_sep_idx + 1 : end_idx + 1].strip()) | |
| last_sep_idx = end_idx + 1 | |
| i = end_idx + 1 | |
| matched_multi = True | |
| break | |
| if matched_multi: | |
| continue | |
| # Fallback to single-character separator handling. | |
| char = paragraph[i] | |
| next_char = paragraph[i + 1] if i + 1 < len(paragraph) else "" | |
| # Check if current char is a separator and next char is a space. | |
| # This avoids splitting abbreviations like "Dr." or "a.m." | |
| if char in single_char_separators and next_char == " ": | |
| sentences.append(paragraph[last_sep_idx + 1 : i + 1].strip()) | |
| last_sep_idx = i + 1 | |
| i += 1 |
| "de": ["german_phoneme", "german"], | ||
| "es": ["spanish_phoneme", "spanish"], | ||
| "fr": ["french_phoneme", "french"], | ||
| "fr": ["french_chartokenizer", "french"], |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tokenizer for French has been changed from "french_phoneme" to "french_chartokenizer". This appears to be an unrelated change that shouldn't be part of a PR focused on longform TTS functionality. If this is an intentional fix, it should either be in a separate PR or explicitly mentioned in the PR description. If unintended, this change should be reverted.
| "fr": ["french_chartokenizer", "french"], | |
| "fr": ["french_phoneme", "french"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you merge with main?
Co-authored-by: Copilot <[email protected]> Signed-off-by: Subhankar Ghosh <[email protected]>
Co-authored-by: Copilot <[email protected]> Signed-off-by: Subhankar Ghosh <[email protected]>
Co-authored-by: Copilot <[email protected]> Signed-off-by: Subhankar Ghosh <[email protected]>
blisc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's clean up some of the classes especially on LongFormInferenceRunner. Additionally, I want a design for longform that is seamless to the user. Users should not to decide apriori whether they need longform generation or the non-longform path.
examples/tts/magpietts_inference.py
Outdated
| # Create appropriate inference runner based on longform flag | ||
| if longform: | ||
| logging.info("Using longform inference mode (sentence-by-sentence processing)") | ||
| runner = LongFormInferenceRunner(model, inference_config) | ||
| else: | ||
| runner = MagpieInferenceRunner(model, inference_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? Can we natively switch to longform once we go over the 20s of decoder generation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to predetermine LongForm or standard generation because for LongForm we need to save and update history variables, which are necessary for the window mechanism. However, to make the user experience seemless I can try to write a logic that would determine LongForm or standard generation given the number of words in the input text (~40-50 words in 20sec). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not possible to initialize these parameters on the fly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does batch processing, so determining if each datapoint is longform or short and running the corresponding Runner would be complicated. MagpieInferenceRunner cannot do longform, MagpieTTSDataset cannot be used for longform. But it might be possible to do standard and longform with LongFormInferenceRunner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try:
- The auto-detect logic -> based on the input manifest, if any of the entries is longform (
len(text) > 50 words) we use longform else standard inference. - Merge the two inference runners into one single runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blisc Please check implementation now. It is much cleaner and reusing most of the existing code. User experience is also seamless as they do not need to decide between longform path or standard path.
| "de": ["german_phoneme", "german"], | ||
| "es": ["spanish_phoneme", "spanish"], | ||
| "fr": ["french_phoneme", "french"], | ||
| "fr": ["french_chartokenizer", "french"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you merge with main?
Co-authored-by: Copilot <[email protected]> Signed-off-by: Subhankar Ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
…nto magpietts_os_longform
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Co-authored-by: Copilot <[email protected]> Signed-off-by: Subhankar Ghosh <[email protected]>
Co-authored-by: Copilot <[email protected]> Signed-off-by: Subhankar Ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
…nto magpietts_os_longform
blisc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add back https://github.com/NVIDIA-NeMo/NeMo/blob/magpietts_2508/tests/functional_tests/L2_TTS_InferEvaluateStreaming_Magpietts_ZeroShot.sh and then we can merge
| def _penalize_attention_sinks( | ||
| self, | ||
| attn_prior: torch.Tensor, | ||
| batch_idx: int, | ||
| attended_timestep_counter: Dict[int, int], | ||
| left_offset: int, | ||
| eps_sq: float, | ||
| ) -> None: | ||
| """ | ||
| Penalize timesteps that have been over-attended (attention sinks). | ||
| When a position is attended more than the threshold, suppress all | ||
| positions up to and including it to force the model to move forward. | ||
| Args: | ||
| attn_prior: Prior tensor to modify in-place. Shape: (B, 1, T_text). | ||
| batch_idx: Index of current batch item. | ||
| attended_timestep_counter: Dict tracking attention counts per timestep. | ||
| left_offset: Chunk offset for this batch item. | ||
| eps_sq: Squared epsilon for strong suppression. | ||
| """ | ||
| threshold = self.longform_config.attention_sink_threshold | ||
|
|
||
| for timestep, count in attended_timestep_counter.items(): | ||
| if timestep > left_offset and count >= threshold: | ||
| logging.debug(f"Attention sink at timestep {timestep} for batch {batch_idx}, count: {count}") | ||
| relative_pos = timestep - left_offset | ||
| attn_prior[batch_idx, 0, : relative_pos + 1] = eps_sq | ||
|
|
||
| def _update_text_completion_state( | ||
| self, | ||
| batch_idx: int, | ||
| attended_pos: int, | ||
| text_len: int, | ||
| is_finished: bool, | ||
| unfinished_texts: Dict[int, bool], | ||
| finished_texts_counter: Dict[int, int], | ||
| ) -> None: | ||
| """ | ||
| Update tracking state for text completion detection. | ||
| A text is considered "near end" when the attended position is within | ||
| `longform_near_end_threshold` positions of the text end. | ||
| Args: | ||
| batch_idx: Index of current batch item. | ||
| attended_pos: Currently attended text position (chunk-relative). | ||
| text_len: Length of text for this batch item. | ||
| is_finished: Whether this batch item has already finished. | ||
| unfinished_texts: Dict to update in-place. | ||
| finished_texts_counter: Dict to update in-place. | ||
| """ | ||
| is_near_end = attended_pos >= text_len - self.longform_config.near_end_threshold | ||
|
|
||
| # Text is unfinished if not near end AND not already marked finished | ||
| unfinished_texts[batch_idx] = not is_near_end and not is_finished | ||
|
|
||
| # Start counting when near end or already finished | ||
| if is_near_end or is_finished: | ||
| finished_texts_counter.setdefault(batch_idx, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two functions look very similar to the non-longform counterparts. While not a blocker for this PR, we should attempt to merge both codepaths together when possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to incorporate this in the next PR.
Signed-off-by: subhankar-ghosh <[email protected]>
Added. |
|
[🤖]: Hi @subhankar-ghosh 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
Important
The
Update branchbutton must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
Add Longform TTS support in MagpieTTS.
Collection: [Note which collection this PR will affect]
TTS
Changelog
generate_long_form_speechandconstruct_longform_inference_prior, plus helper methods for managing state across chunks.Usage
# Add a code snippet demonstrating how to use thisGitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information