Skip to content

Conversation

@TueVNguyen
Copy link

@TueVNguyen TueVNguyen commented Jun 4, 2025

If we use the rmpad_with_pos_ids or rmpad mode, we will train on only a subset of the full dataset.
This is due to the behavior of the UnpackDataCollator() function, which selects only the first sample in each training batch.

Please see the comment in the script: test_rm_pad_dataset.py


# Before fix the bug, the result is:
# Total samples in the data loader: 1226.0
# Total samples in the dataset: 1000
# Total unique ids got from dataset: 248

# After fix the bug, the result is:
# Total samples in the data loader: 1237.0
# Total samples in the dataset: 1000
# Total unique ids  got from dataset: 992

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant