Skip to content

save_to_disk() freezes when saving on s3 bucket with multiprocessing #6936

@ycattan

Description

@ycattan

Describe the bug

I'm trying to save a Dataset using the save_to_disk() function with:

  • num_proc > 1
  • dataset_path being a s3 bucket path e.g. "s3://{bucket_name}/{dataset_folder}/"

The hf progress bar shows up but the saving does not seem to start.
When using one processor only (num_proc=1), everything works fine.
When saving the dataset on local disk (as opposed to s3 bucket) with num_proc > 1, everything works fine.

Thank you for your help! :)

Steps to reproduce the bug

I tried without any storage options:

from datasets import load_dataset

sandbox_ds = load_dataset("openai_humaneval")
sandbox_ds["test"].save_to_disk(
    "s3://bucket-name/test_multiprocessing_saving/",
    num_proc=4,
)

and with the specific s3fs storage options:

from datasets import load_dataset
from s3fs import S3FileSystem

def get_s3fs():
    return S3FileSystem()

sandbox_ds = load_dataset("openai_humaneval")
sandbox_ds["test"].save_to_disk(
    "s3://bucket-name/test_multiprocessing_saving/",
    num_proc=4,
    storage_options=get_s3fs().storage_options, # also tried: storage_options=S3FileSystem().storage_options
)

I'm guessing I might use storage_options parameter wrongly, but I didn't find anything online that made it work.

NB: Behavior is the same when trying to save the whole DatasetDict.

Expected behavior

Progress bar fills in and saving is carried out.

Environment info

datasets==2.18.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions