-
Notifications
You must be signed in to change notification settings - Fork 3.3k
[lhotse] Added support for re-weighting datasets with temperature on the fly. #15200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[lhotse] Added support for re-weighting datasets with temperature on the fly. #15200
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces temperature-based dataset reweighting to the Lhotse dataloader, enabling dynamic adjustment of sampling probabilities across datasets through a configurable temperature parameter. This eliminates the need for manual weight recalculation when combining datasets with different sizes.
Key changes:
- New
temperature_reweighting()function that applies temperature scaling to weights using the formula(w_i ^ temp) / sum(w_j ^ temp) - New
reweight_temperatureconfiguration option that supports hierarchical temperature application across nested dataset groups - Comprehensive test suite with 19 tests covering various input types and edge cases
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
nemo/collections/common/data/lhotse/cutset.py |
Adds the temperature_reweighting() function and integrates it into the dataset loading pipeline with temperature propagation through nested groups |
examples/tts/conf/magpietts/magpietts_lhotse.yaml |
Adds example configuration demonstrating hierarchical temperature usage with [1.0, 0.0] |
tests/collections/common/test_lhotse_temperature_reweighting.py |
Adds comprehensive unit and integration tests for the temperature reweighting functionality |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ights predeifined in train_ds.dataset.input_cfg YAML configs. This feature would save the effort of flattening the dataset distribution every time when adding new datasets. Signed-off-by: Xuesong Yang <16880-xueyang@users.noreply.gitlab-master.nvidia.com>
Signed-off-by: Xuesong Yang <16880-xueyang@users.noreply.gitlab-master.nvidia.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <16880-xueyang@users.noreply.gitlab-master.nvidia.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <16880-xueyang@users.noreply.gitlab-master.nvidia.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
…onfig. Otherwise, it would not be passed to propagate_attrs. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
…t length could be shorter or longer than the max depth of recursion group. added tests. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
8c81ce9 to
f524347
Compare
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
nemo/collections/common/data/lhotse/dataloader.py:19
- Import of 'List' is not used.
from typing import Any, Optional, Sequence, Union
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
| shuffle: true | ||
| num_workers: 6 | ||
| pin_memory: true | ||
| reweight_temperature: [1.0, 0.0] # Temperature for re-weighting datasets. 1 is a neutral value. Lower temperature over-samples smaller datasets, and vice versa. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my understanding, correct me if I am wrong: Level 1 (Language): preserves the weight, Level 2 (datasets within each language): equalizes.
Based on this understanding, don't we want to normalize the weights based on number of shar files/size of the datasets at level 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. But this is an example demonstrating how to use this new param. In practice, we should override this one based on your needs.
Summary
This PR introduces temperature-based dataset reweighting to the Lhotse dataloader, enabling dynamic adjustment of sampling probabilities across datasets without manually recalculating weights. It removes the need to update dataset weights when adding or removing datasets. Simply specify the desired temperature, and the weights are automatically adjusted at runtime.
Key Changes
1. New
temperature_reweighting()functionApplies temperature scaling to dataset weights using the formula:
ŵᵢ = wᵢ^τ / Σⱼ wⱼ^τ.
τ = 1.0: Preserves original weight ratios (neutral)τ = 0.0: Equalizes all datasets regardless of original weights0 < τ < 1.0: Over-samples smaller datasets relative to larger ones2. Flexible
reweight_temperatureconfig optionSupports multiple input formats with automatic normalization:
Example - preserve top-level ratios but equalize within sub-groups:
3. Comprehensive test coverage
temperature_reweighting()andcount_input_cfg_levels()Documentation
See the new "Dataset Reweighting with Temperature" section in the audio configuration docs:
docs/source/audio/configs.rst