Skip to content

DistributeFilesDataset: distrib_shard_files=True leads to assertion error #1737

@Icemole

Description

@Icemole

Hi, I wanted to train a model with a DistributeFilesDataset with distrib_shard_files=True and 8 GPUs, reproducing a successful training that I made in the past, but this time with the latest RETURNN version. However, to my surprise I got the following assertion error:

  File "/nas/.../work/i6_core/tools/git/CloneGitRepositoryJob.NZ6DFanzMdKP/output/returnn/returnn/datasets/distrib_files.py", line 177, in DistributeFilesDataset.__init__
      self = <local> <DistributeFilesDataset 'train' epoch=None>
      self._num_shards = <local> 8
      self._shard_index = <local> 6
AssertionError: <DistributeFilesDataset 'train' epoch=None>: Cannot use both dataset-sharding via properties _num_shards and _shard index and DistributeFilesDataset's own sharding implementation based on the trainings rank and size.

The specific assertion being triggered is this one, but as far as I know, I'm not modifying the number of shards and the shard index in any way (perhaps it's being done internally by some tool, either in our codebase or in RETURNN).

The last version in which I confirmed that the option works is in dc1a941. I suspect of 61705b1, which is the last commit that heavily reworked the sharding options in the DistributeFilesDataset (and the one that added the assertion). Perhaps the number of shards/shard index values had been already set, and the assertion just materialized this fact? As per some lines after, the number of shards/shard index values are being set to the size and rank respectively, which would be what the assertion error already displays.

I'm currently testing a training with 61705b1 and the parent commit (e5f0b87).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions