Improve sample packing by sfc-gh-truwase · Pull Request #347 · snowflakedb/ArcticTraining

sfc-gh-truwase · 2026-01-29T11:33:22Z

Add options for

Packing algorithms
Global sorting
Enable/disable data shuffling

Builds on #327

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

Signed-off-by: Tunji Ruwase <tunji.ruwase@snowflake.com>

…into sfc-gh-truwase/ds_wall_clock_metrics

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

arctic_training/data/sft_factory.py

sfc-gh-sbekman · 2026-01-29T22:00:38Z

arctic_training/data/sft_factory.py


+    pack_samples_mode: Literal["naive", "balance_length"] = "naive"
+
+    shuffle_samples: bool = True


this name is ambiguous as it doesn't tell when samples get shuffled - and we do want them to get shuffled always! if we don't we should fix that.

I think this should say dl_shuffle_samples or something such to indicate it's the DL that is being controlled.

Will fix naming.

and we do want them to get shuffled always! if we don't we should fix that.

Can you clarify why we want shuffling always, since it is a major cause of imbalance across the ranks.

I was trying to say that the dataset must always be shuffled for proper training.

So if it doesn't get shuffled at the DL level then we need to shuffle it at the dataset level before we sort or pack, so if we are going for this work of this PR, we have 2 choices:

first shuffle the dataset, then pack, then sort

pack first, then shuffle the packed samples, then sort

I think (1) makes the most sense, since we want to shuffle at the source level, packing first is likely to result in less randomization.

And of course as flagged by your experiments we have an issue with multiple datasets not being blended but concatenated which leads to loss spikes when domain changes abruptly. But of course this is out of the scope of this PR.

Got it. Thanks for the clarification. I think we are talking about shuffling from different angles. The new dl_shuffle_samples is meant to give user a way to workaround the default shuffling behavior of torch distributed sampler. In other words, regardless of which of your algorithm proposals that we adopt in AT, it seems that when data is loaded for training it will be shuffled here.

so if we are going for this work of this PR, we have 2 choices:

first shuffle the dataset, then pack, then sort

pack first, then shuffle the packed samples, then sort

My understanding is that with either option, the preprocessed dataset will first be saved to disk before loading for the actual training. Is that correct?

I think we are talking about shuffling from different angles.

Yes and no. If you're taking away the shuffle at DL level, we need to make sure shuffling happens elsewhere.

the preprocessed dataset will first be saved to disk before loading for the actual training. Is that correct?

That's correct, we cache the result. Only. if hparams change it'll get rebuilt.

Yes and no. If you're taking away the shuffle at DL level, we need to make sure shuffling happens elsewhere.

Got it. That means the perf balancing recommendation of this PR to avoid shuffling is not a practical solution.

If the dataset happens to be pre-shuffled enough, then the training quality shouldn't be impacted, but if it isn't all sorts of skewed learning may occur. So I'd check with the modeling guys first.

But it should be easy to overcome, if dl_shuffle is false, do ds.shuffle first before doing the packing, no?

My concern is how packing/sorting affects the random order of the pre-shuffled dataset. I realize your proposal to sort within chunks, as opposed to globally, should help to retain some randomness. But it is unclear to me how effective that would be. I will get more data and follow up with you. Thanks!

remember that when a shuffled data is packed even a few small samples together, the new long samples' content is already shuffled, so sorting these longer samples will not undo the shuffling effect. Does it make sense?

The randomness will only be lost if most samples are already of max_len, which is highly unlikely.

Regardless, it massively beats the global sorting.

So shuffle first then pack should be a solid strategy to retain randomness.

arctic_training/data/sft_factory.py

sfc-gh-sbekman · 2026-01-29T22:03:56Z

This looks good, Tunji.

My main concern is that this PR is introducing curriculum effects which the operator is unlikely to be aware of - I think by default it shouldn't sort the whole dataset, but do it in chunks of user-defined number of packed samples. My intuition suggests ~500 as a default.

Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

sfc-gh-truwase · 2026-01-29T22:47:12Z

My main concern is that this PR is introducing curriculum effects which the operator is unlikely to be aware of

Yes, this is a good point. I will take a stab at local sorting in my second pass.

sfc-gh-mwyatt · 2026-01-30T17:27:48Z

arctic_training/data/factory.py

+    @callback_wrapper("create_dataloader_no_shuffle")
+    def create_dataloader_no_shuffle(self, dataset: DatasetType) -> DataLoader:
+        """Create a torch DataLoader from the dataset."""
+        return self._create_dataloader(dataset, sampler_shuffle=False)


Do we need to require a new method for this? Why not extend the existing create_dataloader? This would avoid the need to add this new method for each data factory (_validate_class_method(cls, "create_dataloader_no_shuffle", ["self", "dataset"]))

def create_dataloader(self, dataset: DatasetType, shuffle: bool = True): return self._create_dataloader(sampler_shuffle=shuffle)

I previously extended the create_dataloader with optional shuffle flag. However, this caused UT failure on

ArcticTraining/arctic_training/data/factory.py

Line 75 in f472557

_validate_class_method(cls, "create_dataloader", ["self", "dataset"])

It seems the validation does not support optional args, or at least I don't know how to achieve that.

I also didn't want shuffle to be mandatory for create_dataloader. But if this is preferred, I can make that change.

What do you prefer?

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

sfc-gh-sbekman · 2026-02-03T15:59:27Z

arctic_training/data/sft_factory.py

    sort_packed_samples_order: Literal["ascend", "descend"] = "descend"
    """ Sorting order for packed samples. """

+    sort_packed_samples_scope: Literal["local", "global"] = "local"


I'm not sure local vs global is an intuitive self-documenting mnemonic in this context - requires doc reading to understand what each implies.

Perhaps batched vs all? that is sort each batch separately, vs sort all?

local to me implies gpu/rank-local or some such.

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

sfc-gh-truwase added 6 commits December 4, 2025 17:51

Plot DS wall clock timers in W&B

14b90bd

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

BC safety

ff821ba

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

Fix bug

e47c58c

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

Debug sample lens

e20671e

Signed-off-by: Tunji Ruwase <tunji.ruwase@snowflake.com>

Merge branch 'main' of https://github.com/snowflakedb/ArcticTraining …

61fb86e

…into sfc-gh-truwase/ds_wall_clock_metrics

Sample packing improvements

5cc68ec

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

sfc-gh-truwase requested review from sfc-gh-mwyatt, sfc-gh-sbekman and sfc-gh-zhyao January 29, 2026 11:33

sfc-gh-truwase requested a review from sfc-gh-jrasley as a code owner January 29, 2026 11:33

sfc-gh-truwase added 5 commits January 29, 2026 11:37

Cleanup

41c56e7

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

Formatting and cleanup

12a4fbf

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

Formatting and cleanup

4b8f8f6

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

Formatting

9945e6a

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

UT fix

b0a2c65

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

sfc-gh-sbekman reviewed Jan 29, 2026

View reviewed changes

arctic_training/data/sft_factory.py Outdated Show resolved Hide resolved

arctic_training/data/sft_factory.py Outdated Show resolved Hide resolved

sfc-gh-sbekman reviewed Jan 29, 2026

View reviewed changes

sfc-gh-truwase and others added 2 commits January 29, 2026 17:34

Update arctic_training/data/sft_factory.py

e5b8e4d

Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>

Fix naming and cleanup

982daaa

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

sfc-gh-mwyatt reviewed Jan 30, 2026

View reviewed changes

Local sorting

f1c13e7

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

sfc-gh-sbekman reviewed Feb 3, 2026

View reviewed changes

PR feedback

fa91599

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>


		pack_samples_mode: Literal["naive", "balance_length"] = "naive"

		shuffle_samples: bool = True

Comments

Conversation

sfc-gh-truwase commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sfc-gh-sbekman Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-truwase Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sfc-gh-sbekman commented Jan 29, 2026

Uh oh!

sfc-gh-truwase commented Jan 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-sbekman Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sfc-gh-truwase commented Jan 29, 2026 •

edited

Loading

sfc-gh-sbekman Jan 29, 2026 •

edited

Loading

sfc-gh-truwase Jan 30, 2026 •

edited

Loading

sfc-gh-sbekman Feb 3, 2026 •

edited

Loading