-
Notifications
You must be signed in to change notification settings - Fork 137
Description
System Info / 系統信息
https://pytorch.org/docs/stable/data.html#multi-process-data-loading
For iterable-style datasets, since each worker process gets a replica of the dataset object, naive multi-process loading will often result in duplicated data. Using torch.utils.data.get_worker_info() and/or worker_init_fn, users may configure each replica independently. (See IterableDataset documentations for how to achieve this. ) For similar reasons, in multi-process loading, the drop_last argument drops the last non-full batch of each worker’s iterable-style dataset replica.
IterableDatasetPreprocessingWrapper in finetrainer currently does not have this logic:
https://github.com/a-r-r-o-w/finetrainers/blob/694361f8f1fbba2e4b2428c1ec3958929bf70610/finetrainers/data/dataset.py#L677
Information / 问题信息
- The official example scripts / 官方的示例脚本
- My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
NA
Expected behavior / 期待表现
NA