Skip to content

Comments

Improve sample packing#347

Open
sfc-gh-truwase wants to merge 15 commits intomainfrom
sfc-gh-truwase/sample_packing
Open

Improve sample packing#347
sfc-gh-truwase wants to merge 15 commits intomainfrom
sfc-gh-truwase/sample_packing

Conversation

@sfc-gh-truwase
Copy link
Collaborator

@sfc-gh-truwase sfc-gh-truwase commented Jan 29, 2026

Add options for

  • Packing algorithms
  • Global sorting
  • Enable/disable data shuffling

Builds on #327

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Tunji Ruwase <tunji.ruwase@snowflake.com>
…into sfc-gh-truwase/ds_wall_clock_metrics
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

pack_samples_mode: Literal["naive", "balance_length"] = "naive"

shuffle_samples: bool = True
Copy link
Collaborator

@sfc-gh-sbekman sfc-gh-sbekman Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this name is ambiguous as it doesn't tell when samples get shuffled - and we do want them to get shuffled always! if we don't we should fix that.

I think this should say dl_shuffle_samples or something such to indicate it's the DL that is being controlled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix naming.

and we do want them to get shuffled always! if we don't we should fix that.

Can you clarify why we want shuffling always, since it is a major cause of imbalance across the ranks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to say that the dataset must always be shuffled for proper training.

So if it doesn't get shuffled at the DL level then we need to shuffle it at the dataset level before we sort or pack, so if we are going for this work of this PR, we have 2 choices:

  1. first shuffle the dataset, then pack, then sort
  2. pack first, then shuffle the packed samples, then sort

I think (1) makes the most sense, since we want to shuffle at the source level, packing first is likely to result in less randomization.

And of course as flagged by your experiments we have an issue with multiple datasets not being blended but concatenated which leads to loss spikes when domain changes abruptly. But of course this is out of the scope of this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks for the clarification. I think we are talking about shuffling from different angles. The new dl_shuffle_samples is meant to give user a way to workaround the default shuffling behavior of torch distributed sampler. In other words, regardless of which of your algorithm proposals that we adopt in AT, it seems that when data is loaded for training it will be shuffled here.

Copy link
Collaborator Author

@sfc-gh-truwase sfc-gh-truwase Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if we are going for this work of this PR, we have 2 choices:

  • first shuffle the dataset, then pack, then sort
  • pack first, then shuffle the packed samples, then sort

My understanding is that with either option, the preprocessed dataset will first be saved to disk before loading for the actual training. Is that correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are talking about shuffling from different angles.

Yes and no. If you're taking away the shuffle at DL level, we need to make sure shuffling happens elsewhere.

the preprocessed dataset will first be saved to disk before loading for the actual training. Is that correct?

That's correct, we cache the result. Only. if hparams change it'll get rebuilt.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no. If you're taking away the shuffle at DL level, we need to make sure shuffling happens elsewhere.

Got it. That means the perf balancing recommendation of this PR to avoid shuffling is not a practical solution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the dataset happens to be pre-shuffled enough, then the training quality shouldn't be impacted, but if it isn't all sorts of skewed learning may occur. So I'd check with the modeling guys first.

But it should be easy to overcome, if dl_shuffle is false, do ds.shuffle first before doing the packing, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is how packing/sorting affects the random order of the pre-shuffled dataset. I realize your proposal to sort within chunks, as opposed to globally, should help to retain some randomness. But it is unclear to me how effective that would be. I will get more data and follow up with you. Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember that when a shuffled data is packed even a few small samples together, the new long samples' content is already shuffled, so sorting these longer samples will not undo the shuffling effect. Does it make sense?

The randomness will only be lost if most samples are already of max_len, which is highly unlikely.

Regardless, it massively beats the global sorting.

So shuffle first then pack should be a solid strategy to retain randomness.

@sfc-gh-sbekman
Copy link
Collaborator

This looks good, Tunji.

My main concern is that this PR is introducing curriculum effects which the operator is unlikely to be aware of - I think by default it shouldn't sort the whole dataset, but do it in chunks of user-defined number of packed samples. My intuition suggests ~500 as a default.

sfc-gh-truwase and others added 2 commits January 29, 2026 17:34
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
@sfc-gh-truwase
Copy link
Collaborator Author

My main concern is that this PR is introducing curriculum effects which the operator is unlikely to be aware of

Yes, this is a good point. I will take a stab at local sorting in my second pass.

Comment on lines +248 to +251
@callback_wrapper("create_dataloader_no_shuffle")
def create_dataloader_no_shuffle(self, dataset: DatasetType) -> DataLoader:
"""Create a torch DataLoader from the dataset."""
return self._create_dataloader(dataset, sampler_shuffle=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to require a new method for this? Why not extend the existing create_dataloader? This would avoid the need to add this new method for each data factory (_validate_class_method(cls, "create_dataloader_no_shuffle", ["self", "dataset"]))

def create_dataloader(self, dataset: DatasetType, shuffle: bool = True):
    return self._create_dataloader(sampler_shuffle=shuffle)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously extended the create_dataloader with optional shuffle flag. However, this caused UT failure on

_validate_class_method(cls, "create_dataloader", ["self", "dataset"])

It seems the validation does not support optional args, or at least I don't know how to achieve that.

I also didn't want shuffle to be mandatory for create_dataloader. But if this is preferred, I can make that change.

What do you prefer?

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
sort_packed_samples_order: Literal["ascend", "descend"] = "descend"
""" Sorting order for packed samples. """

sort_packed_samples_scope: Literal["local", "global"] = "local"
Copy link
Collaborator

@sfc-gh-sbekman sfc-gh-sbekman Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure local vs global is an intuitive self-documenting mnemonic in this context - requires doc reading to understand what each implies.

Perhaps batched vs all? that is sort each batch separately, vs sort all?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local to me implies gpu/rank-local or some such.

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants