Dataset: implement global `dataset_distribution` option #1676

NeoLegends · 2025-01-16T13:30:19Z

NeoLegends · 2025-02-06T10:47:16Z

@albertz Do you think this needs a test around the config processing?

albertz · 2025-02-06T11:27:17Z

returnn/datasets/basic.py

+        """
+        from returnn.config import get_global_config
+
+        config = get_global_config(raise_exception=False)


I don't like that we access the global config here. I know this follows similar code as _get_default_random_seed_offset but I also don't like it there. Why is this needed? This should come from outside, or not? Specifically at the place where we call init_dataset. E.g. in the __main__. There we also call Dataset.kwargs_update_from_config.

Also, the code is wrong. Distributed training is only one possible source which defines/influences the shard index and num shards. But there are other reasons, for example the MultiProcDataset, or PyTorch DataLoader num_workers.

MultiProcDataset

This is already setting the num_shards and shard index for its children. The code always was designed in such a way that it would only look at the global config in case there was no value already set. But I agree, now it's better.

PyTorch DataLoader num_workers

I think this is actually not that trivial to implement because the torch engine already is given initialized datasets and it's difficult to change the sharding config after having initialized a dataset. So factoring in the PyTorch num_workers needs to be done during the initial dataset initialization, which mixes up pytorch code with data initialization code a bit, and I feel that is going to be a bit messy. Do you know a good way? Maybe this is fine after all.

I think we cannot achieve both

But e.g. there could be a test with PyTorch DataLoader num_workers=2 which checks that all data from the dataset was properly covered.

and

I know this follows similar code as _get_default_random_seed_offset but I also don't like it there. Why is this needed? This should come from outside, or not?

because of

I think this is actually not that trivial to implement because the torch engine already is given initialized datasets and it's difficult to change the sharding config after having initialized a dataset

However, I think it's worth it to have proper support for torch_dataloader_opts = {"num_workers": n} with n > 1 because this makes it much simpler for the end user to have multi-process data loading and this feature can replace MultiProcDataset for simple use cases. So I think I need to revert back on the changes where num_shards and shard_index are set from the outside, and rather fetch them from inside the dataset, when they are needed. At that point the torch worker info is also available, which means that we can take it into account properly.

I think this is actually not that trivial to implement because the torch engine already is given initialized datasets and it's difficult to change the sharding config after having initialized a dataset.

Why difficult? We maybe just need a clean dataset API for that, some setter function set_num_shards_and_shard_idx or so. And then in ReturnnDatasetIterDataPipe.reset or so we just need to call that.

Hm, I was originally not a fan of the mutability of these properties, but it seems ok now.

I think we cannot avoid such API like set_num_shards_and_shard_idx because of how the PyTorch data pipeline works.

albertz · 2025-02-06T11:29:23Z

@albertz Do you think this needs a test around the config processing?

I'm not exactly sure what you mean by that.

But e.g. there could be a test with PyTorch DataLoader num_workers=2 which checks that all data from the dataset was properly covered.

Icemole

Just a few minor comments from my side. The functionality looks good!

Icemole · 2025-07-16T09:26:34Z

returnn/datasets/basic.py

+        assert 0 <= shard_index < num_shards
+        self.num_shards = num_shards
+        self.shard_index = shard_index


Suggested change

assert 0 <= shard_index < num_shards

self.num_shards = num_shards

self.shard_index = shard_index

self.set_shard_idx_and_num_shards(shard_index, num_shards)

Slightly cleaner and reuses code, but it's also fine as is. Your choice.

That function has additional asserts that rely on correct initialization. Not for now.

That function has additional asserts that rely on correct initialization. Not for now.

I don't understand the comment. What correct initialization? What do you mean by "not for now"?

Icemole · 2025-07-16T09:42:30Z

returnn/datasets/distrib_files.py

+            # RETURNN will set sharding info on the dataset if the global config is set.
+            # If it's not set, however, we need to respect the existing `distrib_shard_files` property
+            # for backwards compatibility and load the sharding info ourselves.
+            return CachedDataset2._get_sharding_rank_and_size(config)


Shouldn't you call the local _get_rank_and_size as per the previous change in https://github.com/rwth-i6/returnn/pull/1676/files#diff-44adfde5339aa94bf0770b09138330d5ea06d6b8e3f3b975bf270058f8c0c4baL188?

Icemole · 2025-07-16T09:45:14Z

returnn/datasets/multi_proc.py

-                sub_dataset = {**self.dataset, "_num_shards": self.num_workers, "_shard_index": i}
+                sub_dataset = {
+                    **self.dataset,
+                    "num_shards": self.num_workers * self.num_shards,
+                    "shard_index": (self.shard_index * self.num_workers) + i,
+                }


I might not know the context of this comment. Isn't sharding already allowed as per your change here?

returnn/torch/data/returnn_dataset_wrapper.py

albertz · 2025-07-17T10:08:09Z

Sorry for introducing the small conflict, but my change should fix #1678 and #1737 already, and shouldn't really cause any issues to merge with the PR here.

albertz · 2025-07-17T10:08:37Z

Btw, also see #1738. Not sure if this is relevant here.

albertz · 2025-07-17T16:26:10Z

Can you summarize what this PR does? I will also try to write some summarizes here myself, but please edit your main description of the PR to cover that as well.

albertz · 2025-07-17T16:30:19Z

(Summarize) Added feature: when torch.utils.data.DataLoader is used with num_workers>1, this will set the sharding accordingly. (This is independent of the newly introduced global dataset_distribution option.)

Btw, some questions regarding this:

Just to confirm: this is independent of the newly introduced global dataset_distribution option?

What happens when this is used together with distributed training? Will it set num_shards = distrib_world_size * dataloader_num_workers then?

Is the order of seqs you get from the DataLoader deterministic?

Will it always be complete? E.g. if one worker returns more seqs than the other (e.g. total num seqs is 11, and 2 workers), will the DataLoader finish only until all the workers have finished?

albertz · 2025-07-17T16:34:31Z

tests/test_Dataset.py

+def test_dataset_sharding():
+    from returnn.datasets.audio import OggZipDataset
+
+    with create_ogg_zip_txt_only_dataset_mult_seqs_opts(num_seqs=10) as dataset_opts:


I think the test would be a bit nicer if the num_seqs is uneven, not divisible by num_shards.

Suggested change

with create_ogg_zip_txt_only_dataset_mult_seqs_opts(num_seqs=10) as dataset_opts:

with create_ogg_zip_txt_only_dataset_mult_seqs_opts(num_seqs=11) as dataset_opts:

albertz · 2025-07-17T16:53:48Z

returnn/torch/data/returnn_dataset_wrapper.py

+                self.dataset.set_shard_idx_and_num_shards(
+                    self.dataset.shard_index + worker_info.id, self.dataset.num_shards * worker_info.num_workers
+                )


I don't understand. Why does this consider the existing shard_index/num_shards? I would expect that you overwrite those here.

Suggested change

self.dataset.set_shard_idx_and_num_shards(

self.dataset.shard_index + worker_info.id, self.dataset.num_shards * worker_info.num_workers

)

self.dataset.set_shard_idx_and_num_shards(worker_info.id, worker_info.num_workers)

Or do you expect that the existing shard_index/num_shards where set with the distributed rank/size, and this here additionally adds further sharding for workerid/num_workers?

But then this code written in this way is very confusing... I think this should be done differently somehow. Not sure how...

We could also do the distributed logic directly here. We pass on the rank/size info anyway to the subproc children via the env var _RETURNN_TORCH_DISTRIBUTED_INIT_INFO.

The question is, how to handle dataset_distribution. We could also just check the global config here at this point (even though I don't like accessing the global config too much... maybe I get a better idea). Then all the logic about what sharding options to set (or whether to set it at all) is all here in this place, together with the dataloader num workers.

albertz · 2025-07-17T16:55:28Z

returnn/datasets/basic.py

+        num_shards: int = 1,
+        shard_index: int = 0,


I don't understand: Why are those public (not prefixed with _)? These are never supposed to be set by the user. Those are either internally set via DataLoader num_workers, or via parent dataset logic like MultiProcDataset, or via distributed training logic somehow, or so. But never directly by the user.

albertz · 2025-07-17T16:57:36Z

returnn/datasets/basic.py

        state = {attr: getattr(self, attr) for attr in ["epoch", "zpad"]}
        return Dataset._create_from_reduce, (self.__class__, kwargs, state)

+    def set_shard_idx_and_num_shards(self, shard_index: int, num_shards: int):


Small inconsistency: idx vs index.

albertz · 2025-07-17T17:02:33Z

returnn/datasets/basic.py

+        assert dd_cfg in ["random_seed_offset", "shard"]
+        shard_index, num_shards = Dataset._get_sharding_rank_and_size(config) if dd_cfg == "shard" else 0, 1
+        set_or_remove("num_shards", num_shards)
+        set_or_remove("shard_index", shard_index)


I'm not sure whether it is a good idea to do this logic in kwargs_update_from_config. You don't want to apply this just for every dataset. E.g. for the dev/test/eval datasets, we would not want this logic, at least not right now.

I think this should come from the outside, not from within.

albertz · 2025-07-17T17:04:55Z

returnn/datasets/distrib_files.py

+            # RETURNN will set sharding info on the dataset if the global config is set.
+            # If it's not set, however, we need to respect the existing `distrib_shard_files` property
+            # for backwards compatibility and load the sharding info ourselves.
+            return CachedDataset2._get_sharding_rank_and_size(config)


Suggested change

return CachedDataset2._get_sharding_rank_and_size(config)

return cls._get_sharding_rank_and_size(config)

?
Or:

Suggested change

return CachedDataset2._get_sharding_rank_and_size(config)

return Dataset._get_sharding_rank_and_size(config)

?
But CachedDataset2 doesn't really make sense?

albertz · 2025-07-17T18:23:52Z

The _get_random_seed_for_epoch, shouldn't it also consider num_shards/shard_index? Or only in the case of dataset_distribution=="dataset_distribution"?

Or not because random_seed_offset already covers this part? (But I find it a bit inconsistent that epoch/partition_epoch is handled here but shard_index/num_shards elsewhere...)

albertz · 2025-07-17T19:06:26Z

(Summary) New global config option dataset_distribution, which can be either set to "random_seed_offset" (default) or "shard". This is for distributed training. "shard" will enable sharding for the dataset, so on N GPUs, processing one full epoch will only go through the data once, unlike with "random_seed_offset", where one full epoch sees all the data N times (each worker with different random seed).

albertz · 2025-07-18T06:50:42Z

returnn/torch/data/returnn_dataset_wrapper.py

+                )
+                self.dataset.set_shard_idx_and_num_shards(
+                    self.dataset.shard_index + worker_info.id, self.dataset.num_shards * worker_info.num_workers
+                )


Instead of having all this logic here, we maybe should better move it to ReturnnDatasetIterDataPipe.reset though?

albertz · 2025-07-20T19:14:57Z

returnn/datasets/distrib_files.py

+            log.print_deprecation_warning(
+                f"{self.__class__.__name__}' `distrib_shard_files` config option is set. "
+                "Use global config option `dataset_distribution` instead "
+                "for the same behavior across more types of datasets."
            )


I'm not sure this is really necessary to mark this as deprecated.

Suggested change

log.print_deprecation_warning(

f"{self.__class__.__name__}' `distrib_shard_files` config option is set. "

"Use global config option `dataset_distribution` instead "

"for the same behavior across more types of datasets."

)

albertz · 2025-07-20T19:15:39Z

returnn/datasets/distrib_files.py

+        :param distrib_shard_files: deprecated. Replaced by global config option ``dataset_distribution="shard"``.
+
+            Set to true to shard the data across worker processes in distributed training scenaria.


I don't think this is necessary to mark it as deprecated.

Suggested change

:param distrib_shard_files: deprecated. Replaced by global config option ``dataset_distribution="shard"``.

Set to true to shard the data across worker processes in distributed training scenaria.

:param distrib_shard_files: deprecated. set to true to shard the data across worker processes in

distributed training scenaria

NeoLegends added the enhancement label Jan 16, 2025

NeoLegends self-assigned this Jan 16, 2025

NeoLegends requested review from a team and albertz as code owners January 16, 2025 13:30

NeoLegends changed the title ~~Dataset: implement global sharding option~~ Dataset: implement global dataset_distribution option Jan 16, 2025

black

924045e

NeoLegends force-pushed the moritz-shard-mgpu branch from 9a10fef to 924045e Compare January 16, 2025 13:31

NeoLegends requested a review from Icemole January 16, 2025 13:33

This comment was marked as outdated.

Sign in to view

albertz marked this pull request as draft January 19, 2025 19:28

Merge branch 'master' into moritz-shard-mgpu

bb03813

NeoLegends mentioned this pull request Feb 4, 2025

DistributeFilesDataset _num_shards issue #1678

Closed

albertz mentioned this pull request Feb 4, 2025

DFDataset: do not pickle sharding info #1685

Closed

add sharding test

9410000

albertz reviewed Feb 6, 2025

View reviewed changes

NeoLegends added 9 commits February 7, 2025 14:42

Merge branch 'master' into moritz-shard-mgpu

3991edb

initialize num_shards and shard_index in kwargs_update_from_config

c0bce2b

sharding parameters cannot be None anymore

6261afc

we no longer need the @properties

f6a8942

black

c3ad46e

Merge branch 'master' into moritz-shard-mgpu

c3b2c04

Merge branch 'master' into moritz-shard-mgpu

485b6f4

Merge branch 'master' into moritz-shard-mgpu

653969c

take torch num_workers into account for sharding

7d753f5

NeoLegends force-pushed the moritz-shard-mgpu branch from 8dca377 to 7d753f5 Compare March 5, 2025 09:50

NeoLegends added 2 commits March 5, 2025 11:32

set sharding config in torch data pipe

2607de2

MultiProcDataset: support sharding on "sharding_method": "dedicated"

5a14a5e

NeoLegends marked this pull request as ready for review March 5, 2025 10:41

NeoLegends added 2 commits May 6, 2025 10:16

Merge branch 'master' into moritz-shard-mgpu

90353a9

black

1011ace

NeoLegends force-pushed the moritz-shard-mgpu branch from 6e5f6d4 to 1c96291 Compare May 13, 2025 09:22

Merge branch 'master' into moritz-shard-mgpu

3b28635

NeoLegends force-pushed the moritz-shard-mgpu branch from 1c96291 to 3b28635 Compare May 13, 2025 09:22

NeoLegends added 3 commits May 20, 2025 14:25

fix lints

a3e6ad5

Merge branch 'master' into moritz-shard-mgpu

713fde9

Merge branch 'master' into moritz-shard-mgpu

148191d

NeoLegends mentioned this pull request Jul 16, 2025

DistributeFilesDataset: distrib_shard_files=True leads to assertion error #1737

Closed

Icemole reviewed Jul 16, 2025

View reviewed changes

Icemole requested review from JackTemaki, curufinwe and michelwi July 16, 2025 10:24

This comment was marked as resolved.

Sign in to view

albertz reviewed Jul 17, 2025

View reviewed changes

albertz reviewed Jul 18, 2025

View reviewed changes

albertz reviewed Jul 20, 2025

View reviewed changes

	with create_ogg_zip_txt_only_dataset_mult_seqs_opts(num_seqs=10) as dataset_opts:
	with create_ogg_zip_txt_only_dataset_mult_seqs_opts(num_seqs=11) as dataset_opts:

	return CachedDataset2._get_sharding_rank_and_size(config)
	return cls._get_sharding_rank_and_size(config)

	return CachedDataset2._get_sharding_rank_and_size(config)
	return Dataset._get_sharding_rank_and_size(config)

		:param distrib_shard_files: deprecated. Replaced by global config option ``dataset_distribution="shard"``.

		Set to true to shard the data across worker processes in distributed training scenaria.

Dataset: implement global dataset_distribution option #1676

Are you sure you want to change the base?

Dataset: implement global dataset_distribution option #1676

Uh oh!

Conversation

NeoLegends commented Jan 16, 2025 • edited by albertz Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

NeoLegends commented Feb 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NeoLegends Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertz commented Feb 6, 2025

Uh oh!

Icemole left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

albertz commented Jul 17, 2025

Uh oh!

albertz commented Jul 17, 2025

Uh oh!

albertz commented Jul 17, 2025

Uh oh!

albertz commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertz commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albertz commented Jul 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Dataset: implement global `dataset_distribution` option #1676

Dataset: implement global `dataset_distribution` option #1676

NeoLegends commented Jan 16, 2025 •

edited by albertz

Loading

NeoLegends Feb 10, 2025 •

edited

Loading

albertz commented Jul 17, 2025 •

edited

Loading

albertz commented Jul 17, 2025 •

edited

Loading