-
Notifications
You must be signed in to change notification settings - Fork 18
Open
Description
Summary
Weighted sampling currently builds per-sample weights from the full training CSV, but the training dataset may be subsampled when --percent_data is below 1.0. This mismatch can cause the weighted sampler to request indices that are out of bounds for the reduced dataset, leading to crashes during training.
Steps to Reproduce
- Use a training CSV with a
splitcolumn and imbalanced classes. - Run
run_class_finetuning.pywith both--percent_data 0.3(or any value less than 1) and--weightsenabled. - Start training; the process will eventually fail with an
IndexError: single positional indexer is out-of-boundsoriginating fromdataset_trainaccess in the data loader workers.
Expected Behavior
- Weighted sampling should draw indices that are valid for the subsampled training dataset.
--percent_dataand--weightsshould be safe to use together without manual intervention.
Actual Behavior
- The sampler builds weights from the full set of training labels and can emit indices beyond the length of the subsampled dataset, causing an IndexError early in training.
Impact
- Users combining partial-data training with class rebalancing cannot train models; runs terminate immediately due to the out-of-bounds access.
Proposed Fix
- Derive sample weights from the already-subsampled
dataset_trainso that the weighted sampler and dataset remain aligned regardless of the--percent_datavalue. - Add regression coverage or assertions to ensure weighted sampling stays consistent with dataset length after subsampling.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels