Conversation
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Tunji Ruwase <tunji.ruwase@snowflake.com>
…into sfc-gh-truwase/ds_wall_clock_metrics
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
arctic_training/data/sft_factory.py
Outdated
|
|
||
| pack_samples_mode: Literal["naive", "balance_length"] = "naive" | ||
|
|
||
| shuffle_samples: bool = True |
There was a problem hiding this comment.
this name is ambiguous as it doesn't tell when samples get shuffled - and we do want them to get shuffled always! if we don't we should fix that.
I think this should say dl_shuffle_samples or something such to indicate it's the DL that is being controlled.
There was a problem hiding this comment.
Will fix naming.
and we do want them to get shuffled always! if we don't we should fix that.
Can you clarify why we want shuffling always, since it is a major cause of imbalance across the ranks.
There was a problem hiding this comment.
I was trying to say that the dataset must always be shuffled for proper training.
So if it doesn't get shuffled at the DL level then we need to shuffle it at the dataset level before we sort or pack, so if we are going for this work of this PR, we have 2 choices:
- first shuffle the dataset, then pack, then sort
- pack first, then shuffle the packed samples, then sort
I think (1) makes the most sense, since we want to shuffle at the source level, packing first is likely to result in less randomization.
And of course as flagged by your experiments we have an issue with multiple datasets not being blended but concatenated which leads to loss spikes when domain changes abruptly. But of course this is out of the scope of this PR.
There was a problem hiding this comment.
Got it. Thanks for the clarification. I think we are talking about shuffling from different angles. The new dl_shuffle_samples is meant to give user a way to workaround the default shuffling behavior of torch distributed sampler. In other words, regardless of which of your algorithm proposals that we adopt in AT, it seems that when data is loaded for training it will be shuffled here.
There was a problem hiding this comment.
so if we are going for this work of this PR, we have 2 choices:
- first shuffle the dataset, then pack, then sort
- pack first, then shuffle the packed samples, then sort
My understanding is that with either option, the preprocessed dataset will first be saved to disk before loading for the actual training. Is that correct?
There was a problem hiding this comment.
I think we are talking about shuffling from different angles.
Yes and no. If you're taking away the shuffle at DL level, we need to make sure shuffling happens elsewhere.
the preprocessed dataset will first be saved to disk before loading for the actual training. Is that correct?
That's correct, we cache the result. Only. if hparams change it'll get rebuilt.
There was a problem hiding this comment.
Yes and no. If you're taking away the shuffle at DL level, we need to make sure shuffling happens elsewhere.
Got it. That means the perf balancing recommendation of this PR to avoid shuffling is not a practical solution.
There was a problem hiding this comment.
If the dataset happens to be pre-shuffled enough, then the training quality shouldn't be impacted, but if it isn't all sorts of skewed learning may occur. So I'd check with the modeling guys first.
But it should be easy to overcome, if dl_shuffle is false, do ds.shuffle first before doing the packing, no?
There was a problem hiding this comment.
My concern is how packing/sorting affects the random order of the pre-shuffled dataset. I realize your proposal to sort within chunks, as opposed to globally, should help to retain some randomness. But it is unclear to me how effective that would be. I will get more data and follow up with you. Thanks!
There was a problem hiding this comment.
remember that when a shuffled data is packed even a few small samples together, the new long samples' content is already shuffled, so sorting these longer samples will not undo the shuffling effect. Does it make sense?
The randomness will only be lost if most samples are already of max_len, which is highly unlikely.
Regardless, it massively beats the global sorting.
So shuffle first then pack should be a solid strategy to retain randomness.
|
This looks good, Tunji. My main concern is that this PR is introducing curriculum effects which the operator is unlikely to be aware of - I think by default it shouldn't sort the whole dataset, but do it in chunks of user-defined number of packed samples. My intuition suggests ~500 as a default. |
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Yes, this is a good point. I will take a stab at local sorting in my second pass. |
| @callback_wrapper("create_dataloader_no_shuffle") | ||
| def create_dataloader_no_shuffle(self, dataset: DatasetType) -> DataLoader: | ||
| """Create a torch DataLoader from the dataset.""" | ||
| return self._create_dataloader(dataset, sampler_shuffle=False) |
There was a problem hiding this comment.
Do we need to require a new method for this? Why not extend the existing create_dataloader? This would avoid the need to add this new method for each data factory (_validate_class_method(cls, "create_dataloader_no_shuffle", ["self", "dataset"]))
def create_dataloader(self, dataset: DatasetType, shuffle: bool = True):
return self._create_dataloader(sampler_shuffle=shuffle)
There was a problem hiding this comment.
I previously extended the create_dataloader with optional shuffle flag. However, this caused UT failure on
It seems the validation does not support optional args, or at least I don't know how to achieve that.
I also didn't want shuffle to be mandatory for create_dataloader. But if this is preferred, I can make that change.
What do you prefer?
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
arctic_training/data/sft_factory.py
Outdated
| sort_packed_samples_order: Literal["ascend", "descend"] = "descend" | ||
| """ Sorting order for packed samples. """ | ||
|
|
||
| sort_packed_samples_scope: Literal["local", "global"] = "local" |
There was a problem hiding this comment.
I'm not sure local vs global is an intuitive self-documenting mnemonic in this context - requires doc reading to understand what each implies.
Perhaps batched vs all? that is sort each batch separately, vs sort all?
There was a problem hiding this comment.
local to me implies gpu/rank-local or some such.
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Add options for
Builds on #327