[lhotse] Added support for re-weighting datasets with temperature on the fly. #15200

XuesongYang · 2025-12-17T00:30:33Z

Summary

This PR introduces temperature-based dataset reweighting to the Lhotse dataloader, enabling dynamic adjustment of sampling probabilities across datasets without manually recalculating weights. It removes the need to update dataset weights when adding or removing datasets. Simply specify the desired temperature, and the weights are automatically adjusted at runtime.

Key Changes

1. New `temperature_reweighting()` function

Applies temperature scaling to dataset weights using the formula:

ŵᵢ = wᵢ^τ / Σⱼ wⱼ^τ.

τ = 1.0: Preserves original weight ratios (neutral)
τ = 0.0: Equalizes all datasets regardless of original weights
0 < τ < 1.0: Over-samples smaller datasets relative to larger ones

2. Flexible `reweight_temperature` config option

Supports multiple input formats with automatic normalization:

Scalar: Broadcasts to all nesting levels (with warning)
List matching depth: Applied per-level as specified
List too short: Extended by repeating last value (with warning)
List too long: Trimmed to max depth (with warning)

Example - preserve top-level ratios but equalize within sub-groups:

reweight_temperature: [1.0, 0.0]  # Level 1: preserve, Level 2: equalize

3. Comprehensive test coverage

Unit tests for temperature_reweighting() and count_input_cfg_levels()
Integration tests for dataloader with various temperature configurations
Validation tests for scalar/list normalization behavior

Documentation

See the new "Dataset Reweighting with Temperature" section in the audio configuration docs: docs/source/audio/configs.rst

examples/tts/conf/magpietts/magpietts_lhotse.yaml

Copilot

Pull request overview

This PR introduces temperature-based dataset reweighting to the Lhotse dataloader, enabling dynamic adjustment of sampling probabilities across datasets through a configurable temperature parameter. This eliminates the need for manual weight recalculation when combining datasets with different sizes.

Key changes:

New temperature_reweighting() function that applies temperature scaling to weights using the formula (w_i ^ temp) / sum(w_j ^ temp)
New reweight_temperature configuration option that supports hierarchical temperature application across nested dataset groups
Comprehensive test suite with 19 tests covering various input types and edge cases

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`nemo/collections/common/data/lhotse/cutset.py`	Adds the `temperature_reweighting()` function and integrates it into the dataset loading pipeline with temperature propagation through nested groups
`examples/tts/conf/magpietts/magpietts_lhotse.yaml`	Adds example configuration demonstrating hierarchical temperature usage with `[1.0, 0.0]`
`tests/collections/common/test_lhotse_temperature_reweighting.py`	Adds comprehensive unit and integration tests for the temperature reweighting functionality

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

examples/tts/conf/magpietts/magpietts_lhotse.yaml

nemo/collections/common/data/lhotse/cutset.py

tests/collections/common/test_lhotse_temperature_reweighting.py

nemo/collections/common/data/lhotse/cutset.py

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/collections/common/test_lhotse_temperature_reweighting.py

nemo/collections/common/data/lhotse/cutset.py

tests/collections/common/test_lhotse_temperature_reweighting.py

examples/tts/conf/magpietts/magpietts_lhotse.yaml

nemo/collections/common/data/lhotse/cutset.py

examples/tts/conf/magpietts/magpietts_lhotse.yaml

…ights predeifined in train_ds.dataset.input_cfg YAML configs. This feature would save the effort of flattening the dataset distribution every time when adding new datasets. Signed-off-by: Xuesong Yang <16880-xueyang@users.noreply.gitlab-master.nvidia.com>

Signed-off-by: Xuesong Yang <16880-xueyang@users.noreply.gitlab-master.nvidia.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Signed-off-by: Xuesong Yang <16880-xueyang@users.noreply.gitlab-master.nvidia.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

…onfig. Otherwise, it would not be passed to propagate_attrs. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

…t length could be shorter or longer than the max depth of recursion group. added tests. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

nemo/collections/common/data/lhotse/dataloader.py

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

nemo/collections/common/data/lhotse/dataloader.py:19

Import of 'List' is not used.

from typing import Any, Optional, Sequence, Union

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

examples/tts/conf/magpietts/magpietts_lhotse.yaml

nemo/collections/common/data/lhotse/cutset.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

subhankar-ghosh · 2026-01-13T18:41:36Z

examples/tts/conf/magpietts/magpietts_lhotse.yaml

      shuffle: true
      num_workers: 6
      pin_memory: true
+      reweight_temperature: [1.0, 0.0]  # Temperature for re-weighting datasets. 1 is a neutral value. Lower temperature over-samples smaller datasets, and vice versa.


This is my understanding, correct me if I am wrong: Level 1 (Language): preserves the weight, Level 2 (datasets within each language): equalizes.
Based on this understanding, don't we want to normalize the weights based on number of shar files/size of the datasets at level 2?

you're right. But this is an example demonstrating how to use this new param. In practice, we should override this one based on your needs.

Copilot AI review requested due to automatic review settings December 17, 2025 00:30

github-actions bot added TTS common labels Dec 17, 2025

Copilot started reviewing on behalf of XuesongYang December 17, 2025 00:31 View session

XuesongYang commented Dec 17, 2025

View reviewed changes

examples/tts/conf/magpietts/magpietts_lhotse.yaml Outdated Show resolved Hide resolved

Copilot AI reviewed Dec 17, 2025

View reviewed changes

github-advanced-security bot found potential problems Dec 17, 2025

View reviewed changes

tests/collections/common/test_lhotse_temperature_reweighting.py Fixed Show fixed Hide fixed

XuesongYang commented Dec 17, 2025

View reviewed changes

nemo/collections/common/data/lhotse/cutset.py Outdated Show resolved Hide resolved

XuesongYang requested a review from Copilot December 17, 2025 01:01

Copilot started reviewing on behalf of XuesongYang December 17, 2025 01:01 View session

Copilot AI reviewed Dec 17, 2025

View reviewed changes

tests/collections/common/test_lhotse_temperature_reweighting.py Show resolved Hide resolved

nemo/collections/common/data/lhotse/cutset.py Show resolved Hide resolved

XuesongYang commented Dec 17, 2025

View reviewed changes

tests/collections/common/test_lhotse_temperature_reweighting.py Show resolved Hide resolved

XuesongYang added the Run CICD label Dec 17, 2025

XuesongYang requested review from blisc, pzelasko and rlangman December 17, 2025 01:10

XuesongYang temporarily deployed to test December 17, 2025 01:11 — with GitHub Actions Inactive

blisc requested changes Dec 17, 2025

View reviewed changes

examples/tts/conf/magpietts/magpietts_lhotse.yaml Show resolved Hide resolved

nemo/collections/common/data/lhotse/cutset.py Show resolved Hide resolved

pzelasko reviewed Dec 17, 2025

View reviewed changes

examples/tts/conf/magpietts/magpietts_lhotse.yaml Show resolved Hide resolved

chtruong814 added Run CICD and removed Run CICD labels Dec 17, 2025

chtruong814 temporarily deployed to test December 17, 2025 21:27 — with GitHub Actions Inactive

Xuesong Yang and others added 6 commits January 12, 2026 21:00

Apply suggestion from @XuesongYang

0ca1f93

Signed-off-by: Xuesong Yang <16880-xueyang@users.noreply.gitlab-master.nvidia.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Update tests/collections/common/test_lhotse_temperature_reweighting.py

19ce594

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Apply suggestion from @XuesongYang

ed87198

Signed-off-by: Xuesong Yang <16880-xueyang@users.noreply.gitlab-master.nvidia.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Apply suggestion from @XuesongYang

9787d7c

Signed-off-by: Xuesong Yang <16880-xueyang@users.noreply.gitlab-master.nvidia.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

added safeguard and updated unit tests.

2ceb45e

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

XuesongYang added 5 commits January 12, 2026 21:00

Apply suggestion from @XuesongYang

18c83a0

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

added tests to ensure dataloader applied reweighting as expected.

2a60233

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

bugfix: added reweighting_temperature into default LhotseDataLoadingC…

b7c5a0a

…onfig. Otherwise, it would not be passed to propagate_attrs. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

added support of flexible formats of temperatures (scalar, list). Lis…

f6321f7

…t length could be shorter or longer than the max depth of recursion group. added tests. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

added documentations

f524347

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

XuesongYang force-pushed the xueyang/pr-reweight-otf branch from 8c81ce9 to f524347 Compare January 13, 2026 05:05

github-actions bot added the audio label Jan 13, 2026

chtruong814 added Run CICD and removed Run CICD labels Jan 13, 2026

github-advanced-security bot found potential problems Jan 13, 2026

View reviewed changes

nemo/collections/common/data/lhotse/dataloader.py Fixed Show fixed Hide fixed

XuesongYang requested a review from Copilot January 13, 2026 05:23

XuesongYang removed the Run CICD label Jan 13, 2026

Copilot started reviewing on behalf of XuesongYang January 13, 2026 05:24 View session

Update nemo/collections/common/data/lhotse/dataloader.py

a4bdc54

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Copilot AI reviewed Jan 13, 2026

View reviewed changes

examples/tts/conf/magpietts/magpietts_lhotse.yaml Show resolved Hide resolved

nemo/collections/common/data/lhotse/cutset.py Outdated Show resolved Hide resolved

Update nemo/collections/common/data/lhotse/cutset.py

734ef25

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

XuesongYang requested review from blisc and pzelasko January 13, 2026 05:32

XuesongYang added the Run CICD label Jan 13, 2026

XuesongYang requested a review from stevehuang52 January 13, 2026 05:32

XuesongYang temporarily deployed to test January 13, 2026 05:33 — with GitHub Actions Inactive

subhankar-ghosh reviewed Jan 13, 2026

View reviewed changes

[lhotse] Added support for re-weighting datasets with temperature on the fly. #15200

Are you sure you want to change the base?

[lhotse] Added support for re-weighting datasets with temperature on the fly. #15200

Uh oh!

Conversation

XuesongYang commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

1. New temperature_reweighting() function

2. Flexible reweight_temperature config option

3. Comprehensive test coverage

Documentation

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

subhankar-ghosh Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

XuesongYang Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

XuesongYang commented Dec 17, 2025 •

edited

Loading

1. New `temperature_reweighting()` function

2. Flexible `reweight_temperature` config option