Skip to content

DistributeFilesDataset _num_shards issue #1678

@Judyxujj

Description

@Judyxujj

For the latest RETURNN, when I use DistributeFilesDataset, I have this error.

File "/nas/models/asr/am/multilingual/16kHz/2024-11-08--jxu-best-rq-pretrain/work/i6_core/tools/git/CloneGitRepositoryJob.LD5f1wKK7LPo/output/returnn/returnn/datasets/basic.py", line 227, in Dataset._create_from_reduce                     
  line: ds = cls(**kwargs)                                                                                                                     
  locals:                                                                                                                              
   ds = <not found>                                                                                                                        
   cls = <local> <class 'returnn.datasets.distrib_files.DistributeFilesDataset'>                                                                                          
   kwargs = <local> {'files': ['/ssd/jxu/nas/data/speech/FR_FR/16kHz/EPPS/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-25/EPPS-batch.1.v1.hdf.15', '/ssd/jxu/nas/data/speech/EN_US/16kHz/NEWS.HQ/corpus/batch.2.NPR.v3/hdf-raw_wav.16kHz.split-261/NEWS.HQ-batch.2.NP
R.v3.hdf.7', '/ssd/jxu/nas/data/speech/IT_IT/16kHz/IT.parli..., len = 25                                                                                               
 File "/nas/models/asr/am/multilingual/16kHz/2024-11-08--jxu-best-rq-pretrain/work/i6_core/tools/git/CloneGitRepositoryJob.LD5f1wKK7LPo/output/returnn/returnn/datasets/distrib_files.py", line 171, in DistributeFilesDataset.__init__               
  line: assert self._num_shards == 1 and self._shard_index == 0, ( # ensure defaults are set                                                                                    
       f"{self}: Cannot use both dataset-sharding via properties _num_shards and _shard index "                                                                                
       f"and {self.__class__.__name__}'s own sharding implementation based on the trainings rank and size."                                                                          
     )                                                                                                                              
  locals:                                                                                                                              
   self = <local> <DistributeFilesDataset 'train' epoch=None>                                                                                                   
   self._num_shards = <local> 8                                                                                                                  
   self._shard_index = <local> 6 

The DistributeFilesDataset is inherited from CachedDataset2, which is again inherited from Dataset, the the _num_shards should be set to 1 in the init function. I am not sure how self._num_shards is changed to num of gpus in my case.

(cc @NeoLegends, @michelwi)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions