Weighted sampler misalignment when using `--percent_data`

## Summary
Weighted sampling currently builds per-sample weights from the full training CSV, but the training dataset may be subsampled when `--percent_data` is below 1.0. This mismatch can cause the weighted sampler to request indices that are out of bounds for the reduced dataset, leading to crashes during training.

## Steps to Reproduce
1. Use a training CSV with a `split` column and imbalanced classes.
2. Run `run_class_finetuning.py` with both `--percent_data 0.3` (or any value less than 1) and `--weights` enabled.
3. Start training; the process will eventually fail with an `IndexError: single positional indexer is out-of-bounds` originating from `dataset_train` access in the data loader workers.

## Expected Behavior
* Weighted sampling should draw indices that are valid for the subsampled training dataset.
* `--percent_data` and `--weights` should be safe to use together without manual intervention.

## Actual Behavior
* The sampler builds weights from the full set of training labels and can emit indices beyond the length of the subsampled dataset, causing an IndexError early in training.

## Impact
* Users combining partial-data training with class rebalancing cannot train models; runs terminate immediately due to the out-of-bounds access.

## Proposed Fix
* Derive sample weights from the already-subsampled `dataset_train` so that the weighted sampler and dataset remain aligned regardless of the `--percent_data` value.
* Add regression coverage or assertions to ensure weighted sampling stays consistent with dataset length after subsampling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weighted sampler misalignment when using `--percent_data` #12

Summary

Steps to Reproduce

Expected Behavior

Actual Behavior

Impact

Proposed Fix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Weighted sampler misalignment when using --percent_data #12

Description

Summary

Steps to Reproduce

Expected Behavior

Actual Behavior

Impact

Proposed Fix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Weighted sampler misalignment when using `--percent_data` #12