Skip to content

The child process retrieves the dataset directly from the main process instead of executing memory_mapped_arrow_table_from_file. #7902

@HQF2017

Description

@HQF2017

Feature request

The child process retrieves the dataset directly from the main process instead of executing memory_mapped_arrow_table_from_file.

Motivation

Because my local disk space is insufficient, I can only store a dataset on a remote Ceph server and process it using datasets.
I used the data-juicer[https://github.com/datajuicer/data-juicer] framework as an outer layer which uses datasets, but it doesn't support streaming datasets. I then encountered a problem: for each load, map, and filter operation, I had to wait for a large number of child processes to execute memory_mapped_arrow_table_from_file. Since the actual file was on the remote Ceph server, this operation was limited by network I/O.
I don't know if it's a problem with my usage or if this is how datasets are currently designed.However, I think that if the instances obtained after datasets.load_datasets are directly passed to the child process instead of re-executing memory_mapped_arrow_table_from_file, it might solve my problem.Or datasets already support this capability, but I just didn't know it?

Your contribution

。。。

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions