Skip to content

Sharding for multi-GPU training #1634

@albertz

Description

@albertz

Now, as a follow-up of #1630: a very nice next step/feature would be if we can use this sharding feature in general for any kind of multi GPU training. Similar as the horovod_dataset_distribution="shard" option we had for TF (which was implemented very inefficiently though, by just selecting every Nth seq from the dataset, i.e. the dataset still iterated through all the data). So maybe, to distinguish or making it more explicit where the sharding is done, we should not just call this "shard", but "dataset_sharding" or so.

We should also not reuse the horovod_dataset_distribution option (which is intended only for Horovod), but maybe generically distributed_dataset_distribution or so? Or it could be part of torch_distributed, just dataset_distribution in it? (In principle, we could reuse the feature later also for TF or other backend engines. But having it currently in torch_distributed is also fine.)

The dataset_distribution default would be "random_seed_offset" (i.e. like horovod_dataset_distribution="random_seed_offset"), which is the current behavior of PyTorch distributed training. (We could change the default via a new behavior version if we want to...)

(Also note, similar comment as I made in #1612: There are some implicit assumptions here: That the worker index and rank is static. This might not always be the case. But it might be possible to just update the shard index / num shards dynamically for the next sub-epoch. Just to keep this in mind. I don't think we need to take care of this now.)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions