-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Feature request
At the moment .from_generator can only create a dataset that lives in the cache. The cached dataset cannot be loaded with load_from_disk because the cache folder is missing state.json. So the only way to convert this cached dataset to a regular is to use save_to_disk which needs to create a copy of the cached dataset. For large datasets this can end up wasting a lot of space. In my case the saving operation failed so I am stuck with a large cached dataset and no clear way to convert to a Dataset that I can use. The requested feature is to provide a way to be able to load a cached dataset using .load_from_disk. Alternatively .from_generator can create the dataset at a specified location so that it can be loaded from there with .load_from_disk.
Motivation
I have the following workflow which has exposed some awkwardness about the Datasets saving/caching.
- I created a cached dataset using
.from_generatorwhich was cached in a folder. This dataset is rather large (~600GB) with many shards. - I tried to save this dataset using
.save_to_diskto another location so that I can use later as aDataset. This essentially creates another copy (for a total of 1.2TB!) of what is already in the cache... In my case the saving operation keeps dying for some reason and I am stuck with a cached dataset and no copy. - Now I am trying to "save" the existing cached dataset but it is not clear how to access the cached files after
.from_generatorhas finished e.g. from a different process. I should not be even looking at the cache but I really do not want to waste another 2hr to generate the set so that if fails agains (I already did this couple of times).
- I tried
.load_from_diskbut it does not work with cached files and complains that this is not aDataset(!). - I looked at
.from_filewhich takes one file but the cached file has many (shards) so I am not sure how to make this work. - I tried
.load_datasetbut this seems to either try to "download" a copy (of a file which is already in the local file system!) which I will then need to save or I need to usestreaming=Falseto create anIterableDatasetwhich then I need to convert (using the cache) toDatasetso that I can save it. With both options I will end up with 3 copies of the same dataset for a total of ~2TB! I am hoping here is another way to do this...
Maybe I am missing something here: I looked at docs and forums but no luck. I have a bunch of arrow files cached by Dataset.from_generator and no clean way to make them into a Dataset that I can use.
This all could be so much easer if load_from_disk can recognize the cached files and produce a Dataset: after the cache is created I would not have to "save" it again and I can just load it when I need. At the moment load_from_disk needs state.json which is lacking in the cache folder. So perhaps .from_generator could be made to "finalize" (e.g. create state.json) the dataset once it is done so that it can be loaded easily. Or provide .from_generator with a save_to_dir parameter in addition to cache_dir which can be used for the whole process including creating the state.json at the end.
As a proof of concept I just created state.json by hand and load_from_disk worked using the cache! So it seems to be the missing piece here.
Your contribution
Time permitting I can look into .from_generator to see if adding state.json is feasible.