Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
44c4644
Longform TTS using magpietts
subhankar-ghosh Dec 19, 2025
335007f
Apply isort and black reformatting
subhankar-ghosh Dec 19, 2025
e3584d8
Using LongformDecoderState in LongForm Magpietts
subhankar-ghosh Dec 19, 2025
eff4703
Using LongformDecoderState in LongForm Magpietts
subhankar-ghosh Dec 19, 2025
73ddc3d
Apply isort and black reformatting
subhankar-ghosh Dec 19, 2025
d2714a7
Potential fix for code scanning alert no. 16815: Non-standard excepti…
subhankar-ghosh Dec 19, 2025
f7fce00
Update nemo/collections/tts/data/text_to_speech_dataset.py
subhankar-ghosh Dec 19, 2025
132434b
Update nemo/collections/tts/models/magpietts.py
subhankar-ghosh Dec 19, 2025
8a4e144
Update nemo/collections/tts/models/magpietts.py
subhankar-ghosh Dec 19, 2025
0f41d39
Update nemo/collections/tts/data/text_to_speech_dataset.py
subhankar-ghosh Dec 20, 2025
312f127
Combining Inference runner, using data classes in longform
subhankar-ghosh Dec 20, 2025
bd187b5
Merge branch 'magpietts_os_longform' of github.com:NVIDIA-NeMo/NeMo i…
subhankar-ghosh Dec 20, 2025
577eff9
Apply isort and black reformatting
subhankar-ghosh Dec 20, 2025
a8610d8
make LongFormTTSInferenceDataset a subclass of MagpieTTSDataset
subhankar-ghosh Dec 20, 2025
9e139c4
make LongFormTTSInferenceDataset a subclass of MagpieTTSDataset
subhankar-ghosh Dec 20, 2025
fe64ea8
Apply isort and black reformatting
subhankar-ghosh Dec 20, 2025
e6a248d
Update nemo/collections/tts/models/magpietts.py
subhankar-ghosh Dec 20, 2025
294439e
Update nemo/collections/tts/models/magpietts.py
subhankar-ghosh Dec 20, 2025
1e9004b
Remove redundant code from inference.py
subhankar-ghosh Dec 20, 2025
99ffb47
Merge branch 'magpietts_os_longform' of github.com:NVIDIA-NeMo/NeMo i…
subhankar-ghosh Dec 20, 2025
7faab9e
Adding longform test cases.
subhankar-ghosh Dec 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 30 additions & 3 deletions examples/tts/magpietts_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,11 @@
compute_mean_with_confidence_interval,
evaluate_generated_audio_dir,
)
from nemo.collections.tts.modules.magpietts_inference.inference import InferenceConfig, MagpieInferenceRunner
from nemo.collections.tts.modules.magpietts_inference.inference import (
InferenceConfig,
LongFormInferenceRunner,
MagpieInferenceRunner,
)
from nemo.collections.tts.modules.magpietts_inference.utils import (
ModelLoadConfig,
get_experiment_name_from_checkpoint_path,
Expand Down Expand Up @@ -129,6 +133,7 @@ def run_inference_and_evaluation(
log_exp_name: bool = False,
clean_up_disk: bool = False,
skip_evaluation: bool = False,
longform: bool = False,
) -> Tuple[Optional[float], Optional[float]]:
"""Run inference and optional evaluation on specified datasets.

Expand All @@ -144,6 +149,7 @@ def run_inference_and_evaluation(
log_exp_name: Whether to include experiment name in output paths.
clean_up_disk: Whether to clean up output directory after completion.
skip_evaluation: Whether to skip evaluation (inference only mode).
longform: Whether to use longform inference (processes text sentence-by-sentence).

Returns:
Tuple of (mean CER across datasets, mean SSIM across datasets).
Expand All @@ -166,8 +172,12 @@ def run_inference_and_evaluation(
# Build full checkpoint identifier
full_checkpoint_name = f"{checkpoint_name}_{inference_config.build_identifier()}_SV_{eval_config.sv_model}"

# Create inference runner
runner = MagpieInferenceRunner(model, inference_config)
# Create appropriate inference runner based on longform flag
if longform:
logging.info("Using longform inference mode (sentence-by-sentence processing)")
runner = LongFormInferenceRunner(model, inference_config)
else:
runner = MagpieInferenceRunner(model, inference_config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Can we natively switch to longform once we go over the 20s of decoder generation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to predetermine LongForm or standard generation because for LongForm we need to save and update history variables, which are necessary for the window mechanism. However, to make the user experience seemless I can try to write a logic that would determine LongForm or standard generation given the number of words in the input text (~40-50 words in 20sec). What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to initialize these parameters on the fly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does batch processing, so determining if each datapoint is longform or short and running the corresponding Runner would be complicated. MagpieInferenceRunner cannot do longform, MagpieTTSDataset cannot be used for longform. But it might be possible to do standard and longform with LongFormInferenceRunner.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try:

  1. The auto-detect logic -> based on the input manifest, if any of the entries is longform (len(text) > 50 words) we use longform else standard inference.
  2. Merge the two inference runners into one single runner.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blisc Please check implementation now. It is much cleaner and reusing most of the existing code. User experience is also seamless as they do not need to decide between longform path or standard path.


# Tracking metrics across datasets
datasets = list(dataset_meta_info.keys())
Expand Down Expand Up @@ -396,6 +406,17 @@ def create_argument_parser() -> argparse.ArgumentParser:
infer_group.add_argument('--batch_size', type=int, default=32)
infer_group.add_argument('--use_cfg', action='store_true', help='Enable classifier-free guidance')
infer_group.add_argument('--cfg_scale', type=float, default=2.5)
infer_group.add_argument(
'--longform',
action='store_true',
help='Enable longform inference for long text inputs (processes text sentence-by-sentence)',
)
infer_group.add_argument(
'--longform_max_decoder_steps',
type=int,
default=50000,
help='Maximum decoder steps for longform inference',
)

# Attention prior arguments
prior_group = parser.add_argument_group('Attention Prior')
Expand Down Expand Up @@ -495,12 +516,16 @@ def main():
parser.error("You must provide either:\n" " 1. --hparams_files and --checkpoint_files\n" " 2. --nemo_files")

# Build configurations
# Use higher max_decoder_steps for longform inference
max_decoder_steps = args.longform_max_decoder_steps if args.longform else 440

inference_config = InferenceConfig(
temperature=args.temperature,
topk=args.topk,
batch_size=args.batch_size,
use_cfg=args.use_cfg,
cfg_scale=args.cfg_scale,
max_decoder_steps=max_decoder_steps,
apply_attention_prior=args.apply_attention_prior,
attention_prior_epsilon=args.attention_prior_epsilon,
attention_prior_lookahead_window=args.attention_prior_lookahead_window,
Expand Down Expand Up @@ -556,6 +581,7 @@ def main():
log_exp_name=args.log_exp_name,
clean_up_disk=args.clean_up_disk,
skip_evaluation=not args.run_evaluation,
longform=args.longform,
)

else: # nemo mode
Expand All @@ -581,6 +607,7 @@ def main():
log_exp_name=args.log_exp_name,
clean_up_disk=args.clean_up_disk,
skip_evaluation=not args.run_evaluation,
longform=args.longform,
)

# Check quality targets
Expand Down
Loading
Loading