Sharding for multi-GPU training

Now, as a follow-up of #1630: a very nice next step/feature would be if we can use this sharding feature in general for any kind of multi GPU training. Similar as the `horovod_dataset_distribution="shard"` option we had for TF (which was implemented very inefficiently though, by just selecting every Nth seq from the dataset, i.e. the dataset still iterated through all the data). So maybe, to distinguish or making it more explicit where the sharding is done, we should not just call this `"shard"`, but `"dataset_sharding"` or so.

We should also not reuse the `horovod_dataset_distribution` option (which is intended only for Horovod), but maybe generically `distributed_dataset_distribution` or so? Or it could be part of `torch_distributed`, just `dataset_distribution` in it? (In principle, we could reuse the feature later also for TF or other backend engines. But having it currently in `torch_distributed` is also fine.)

The `dataset_distribution` default would be "random_seed_offset" (i.e. like `horovod_dataset_distribution="random_seed_offset"`), which is the current behavior of PyTorch distributed training. (We could change the default via a new behavior version if we want to...)

(Also note, similar comment as I made in #1612: There are some implicit assumptions here: That the worker index and rank is static. This might not always be the case. But it might be possible to just update the shard index / num shards dynamically for the next sub-epoch. Just to keep this in mind. I don't think we need to take care of this now.)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sharding for multi-GPU training #1634

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sharding for multi-GPU training #1634

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions