PostprocessingDataset with multi-processing

This could be an alternative to `MultiProcDataset`. In most cases (`OggZipDataset` etc), the data loading part of the dataset is not really the bottleneck, but any postprocessing is the bottleneck, and the reason to use `MultiProcDataset`.

So, the idea is to have a single source dataset (e.g. `OggZipDataset`), but then only do the post-processing with multi-processing. (To safe memory, e.g. see #1498.)

For that, we can add such functionality to `PostprocessingDataset`, some `num_workers: int` option. If not used, it would do everything in the process itself, and otherwise, it would spawn the number of worker procs. This should be fairly straightforward. At least for the `map_seq` case. For `map_seq_stream` probably not really, but we don't need that.

My use case would be a standard `OggZipDataset`, but then doing the speed perturbation with `PostprocessingDataset` with multi-processing.

(cc @NeoLegends @dorian-K)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PostprocessingDataset with multi-processing #1701

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PostprocessingDataset with multi-processing #1701

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions