Skip to content

Dataset load_from_disk is too slowΒ #2547

@avacaondata

Description

@avacaondata

@lhoestq

Describe the bug

It's not normal that I have to wait 7-8 hours for a dataset to be loaded from disk, as there are no preprocessing steps, it's only loading it with load_from_disk. I have 96 cpus, however only 1 is used for this, which is inefficient. Moreover, its usage is at 1%... This is happening in the context of a language model training, therefore I'm wasting 100$ each time I have to load the dataset from disk again (because the spot instance was stopped by aws and I need to relaunch it for example).

Steps to reproduce the bug

Just get the oscar in spanish (around 150GGB) and try to first save in disk and then load the processed dataset. It's not dependent on the task you're doing, it just depends on the size of the text dataset.

Expected results

I expect the dataset to be loaded in a normal time, by using the whole machine for loading it, I mean if you store the dataset in multiple files (.arrow) and then load it from multiple files, you can use multiprocessing for that and therefore don't waste so much time.

Environment info

  • datasets version: 1.8.0
  • Platform: Ubuntu 18
  • Python version: 3.8

I've seen you're planning to include a streaming mode for load_dataset, but that only saves the downloading and processing time, that's not being a problem for me, you cannot save the pure loading from disk time, therefore that's not a solution for my use case or for anyone who wants to use your library for training a language model.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions