Feature request
The child process retrieves the dataset directly from the main process instead of executing memory_mapped_arrow_table_from_file.
Motivation
Because my local disk space is insufficient, I can only store a dataset on a remote Ceph server and process it using datasets.
I used the data-juicer[https://github.com/datajuicer/data-juicer] framework as an outer layer which uses datasets, but it doesn't support streaming datasets. I then encountered a problem: for each load, map, and filter operation, I had to wait for a large number of child processes to execute memory_mapped_arrow_table_from_file. Since the actual file was on the remote Ceph server, this operation was limited by network I/O.
I don't know if it's a problem with my usage or if this is how datasets are currently designed.However, I think that if the instances obtained after datasets.load_datasets are directly passed to the child process instead of re-executing memory_mapped_arrow_table_from_file, it might solve my problem.Or datasets already support this capability, but I just didn't know it?
Your contribution
。。。