Skip to content

Weighted sampler misalignment when using --percent_data #12

@somir-khan

Description

@somir-khan

Summary

Weighted sampling currently builds per-sample weights from the full training CSV, but the training dataset may be subsampled when --percent_data is below 1.0. This mismatch can cause the weighted sampler to request indices that are out of bounds for the reduced dataset, leading to crashes during training.

Steps to Reproduce

  1. Use a training CSV with a split column and imbalanced classes.
  2. Run run_class_finetuning.py with both --percent_data 0.3 (or any value less than 1) and --weights enabled.
  3. Start training; the process will eventually fail with an IndexError: single positional indexer is out-of-bounds originating from dataset_train access in the data loader workers.

Expected Behavior

  • Weighted sampling should draw indices that are valid for the subsampled training dataset.
  • --percent_data and --weights should be safe to use together without manual intervention.

Actual Behavior

  • The sampler builds weights from the full set of training labels and can emit indices beyond the length of the subsampled dataset, causing an IndexError early in training.

Impact

  • Users combining partial-data training with class rebalancing cannot train models; runs terminate immediately due to the out-of-bounds access.

Proposed Fix

  • Derive sample weights from the already-subsampled dataset_train so that the weighted sampler and dataset remain aligned regardless of the --percent_data value.
  • Add regression coverage or assertions to ensure weighted sampling stays consistent with dataset length after subsampling.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions