Multiprocess dataloader for IterableDataset bug

### System Info / 系統信息

https://pytorch.org/docs/stable/data.html#multi-process-data-loading

For iterable-style datasets, since each worker process gets a replica of the [dataset](https://pytorch.org/docs/stable/utils.html#module-torch.utils.data.dataset) object, naive multi-process loading will often result in duplicated data. Using [torch.utils.data.get_worker_info()](https://pytorch.org/docs/stable/data.html#torch.utils.data.get_worker_info) and/or worker_init_fn, users may configure each replica independently. (See [IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) documentations for how to achieve this. ) For similar reasons, in multi-process loading, the drop_last argument drops the last non-full batch of each worker’s iterable-style dataset replica.

IterableDatasetPreprocessingWrapper in finetrainer currently does not have this logic:
https://github.com/a-r-r-o-w/finetrainers/blob/694361f8f1fbba2e4b2428c1ec3958929bf70610/finetrainers/data/dataset.py#L677

### Information / 问题信息

- [x] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务

### Reproduction / 复现过程

NA

### Expected behavior / 期待表现

NA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multiprocess dataloader for IterableDataset bug #394

System Info / 系統信息

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multiprocess dataloader for IterableDataset bug #394

Description

System Info / 系統信息

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions