Skip to content

Unable to stream dataset with many files #120

@JasonLo

Description

@JasonLo

When streaming a large dataset with HuggingFace's datasets via pelicanfs, if the dataset contains a large number of files (around 5,000), it often raises a 'file not found' error. The specific file mentioned in the error can vary between runs.

I have rule out the possibility of s3 without using pelicanfs:

from datasets import load_datasets

storage_options = {
    "key": os.getenv("S3_ACCESS_KEY_ID"),
    "secret": os.getenv("S3_SECRET_ACCESS_KEY"),
    "client_kwargs": {"endpoint_url": "https://web.s3.wisc.edu"},
}

ds = load_dataset(
    "parquet", data_files="s3://pelican-data-loader/data/datasets--bigcode--the-stack-dedup/**/*.parquet", storage_options=storage_options, streaming=True
)
for x in ds.with_format("torch")["train"].take(2):
    print(x)

# This run fine in 10-15s

But when I use pelicanfs to do the same thing:

from pelicanfs.core import PelicanFileSystem

pelfs = PelicanFileSystem("pelican://uwdf-director.chtc.wisc.edu")
parquet_files = pelfs.glob("/wisc.edu/dsi/pytorch/data/datasets--bigcode--the-stack-dedup/**/*.parquet")
parquet_files = [f"pelican://uwdf-director.chtc.wisc.edu{path}" for path in parquet_files]  # append pelican prefix

def test_pelican(parquet_files: list[str], n: int | None = None) -> None:
    if n is None:
        data_files = parquet_files
    else:
        data_files = sample(parquet_files, n)

    ds = load_dataset("parquet", data_files=data_files, streaming=True)
    for x in ds.with_format("torch")["train"].take(2):
        print(x)


test_pelican(parquet_files=parquet_files, n=10)  # This always works
test_pelican(parquet_files=parquet_files, n=1000)  # This often fails

For full test details: https://github.com/UW-Madison-DSI/pelican-data-loader/blob/0bfdd1a0188cb5fb150d48b795a04c78215a1a8b/notebooks/speed_test.ipynb

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions