-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
When streaming a large dataset with HuggingFace's datasets via pelicanfs, if the dataset contains a large number of files (around 5,000), it often raises a 'file not found' error. The specific file mentioned in the error can vary between runs.
I have rule out the possibility of s3 without using pelicanfs:
from datasets import load_datasets
storage_options = {
"key": os.getenv("S3_ACCESS_KEY_ID"),
"secret": os.getenv("S3_SECRET_ACCESS_KEY"),
"client_kwargs": {"endpoint_url": "https://web.s3.wisc.edu"},
}
ds = load_dataset(
"parquet", data_files="s3://pelican-data-loader/data/datasets--bigcode--the-stack-dedup/**/*.parquet", storage_options=storage_options, streaming=True
)
for x in ds.with_format("torch")["train"].take(2):
print(x)
# This run fine in 10-15sBut when I use pelicanfs to do the same thing:
from pelicanfs.core import PelicanFileSystem
pelfs = PelicanFileSystem("pelican://uwdf-director.chtc.wisc.edu")
parquet_files = pelfs.glob("/wisc.edu/dsi/pytorch/data/datasets--bigcode--the-stack-dedup/**/*.parquet")
parquet_files = [f"pelican://uwdf-director.chtc.wisc.edu{path}" for path in parquet_files] # append pelican prefix
def test_pelican(parquet_files: list[str], n: int | None = None) -> None:
if n is None:
data_files = parquet_files
else:
data_files = sample(parquet_files, n)
ds = load_dataset("parquet", data_files=data_files, streaming=True)
for x in ds.with_format("torch")["train"].take(2):
print(x)
test_pelican(parquet_files=parquet_files, n=10) # This always works
test_pelican(parquet_files=parquet_files, n=1000) # This often failsFor full test details: https://github.com/UW-Madison-DSI/pelican-data-loader/blob/0bfdd1a0188cb5fb150d48b795a04c78215a1a8b/notebooks/speed_test.ipynb
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels