Skip to content

fix(deps): update dependency datasets to v5#14287

Open
renovate-bot wants to merge 1 commit into
GoogleCloudPlatform:mainfrom
renovate-bot:renovate/datasets-5.x
Open

fix(deps): update dependency datasets to v5#14287
renovate-bot wants to merge 1 commit into
GoogleCloudPlatform:mainfrom
renovate-bot:renovate/datasets-5.x

Conversation

@renovate-bot
Copy link
Copy Markdown
Contributor

ℹ️ Note

This PR body was truncated due to platform limits.

This PR contains the following updates:

Package Change Age Confidence
datasets ==4.0.0==5.0.0 age confidence

Release Notes

huggingface/datasets (datasets)

v5.0.0

Compare Source

Datasets Features

Agent traces
  • Parse Agent traces messages for SFT using teich by @​lhoestq in #​8232

    • Agent traces from claude_code/pi/codex and others can now be loaded with load_dataset
    • Using the teich library (new optional dependency), traces are parsed to messages to enable training on traces using e.g. trl
    • Load the data:
    >>> from datasets import load_dataset
    >>> ds = load_dataset("lhoestq/agent-traces-example", split="train")
    >>> ds[0]["messages"]
    [{'role': 'user', 'content': 'Download a random dataset from Hugging Face, use DuckDB to inspect it, and come back with a short report about it. Be concise and include: dataset name, what files/format you found, row count or rough size if you can determine it,...'
     ...]
    • Train on agent traces:
    trl sft --dataset-name lhoestq/agent-traces-example ...
Next-level shuffling in streaming mode
  • Use multiple input shards for shuffle buffer by @​lhoestq in #​8194

    ds = load_dataset(..., streaming=True)
    ds = ds.shuffle(seed=42)
    # or configure local buffer shuffling manually, default is:
    ds = ds.shuffle(seed=42, buffer_size=1000, max_buffer_input_shards=10)

    before👎: image

    after✨: image

    toy example comparison

    from datasets import IterableDataset
    
    ds = IterableDataset.from_dict({"i": range(123_456_789)}, num_shards=1024)
    ds = ds.shuffle(seed=42)
    
    print("Cold start ids:")
    print(list(ds.take(10)["i"]))
    print("Nominal regime ids:")
    print(list(ds.skip(10_000).take(10)["i"]))

    before👎:

    Cold start ids:
    [6148853, 6149537, 6149418, 6149202, 6149197, 6149622, 6148849, 6149461, 6148965, 6148858]
    Nominal regime ids:
    [6149537, 6149418, 6149202, 6149197, 6149622, 6148849, 6149461, 6148965, 6148858, 6149290]
    

    after✨:

    Cold start ids:
    [7836668, 9283505, 95847927, 482299, 9283471, 482341, 112003312, 59920157, 43764666, 95847871]
    Nominal regime ids:
    [9283505, 95847927, 482299, 9283471, 482341, 112003312, 59920157, 43764666, 95847871, 16758448]
    

    Note: ds.state_dict() and ds.load_state_dict() are still supported for this improved shuffling :) enabling dataset checkpointing

    Note 2: it uses threads to fetch the first examples in parallel from the input shards

    Note 3: This is a BREAKING CHANGE: the default shuffling mechanism now uses multiple input shards. You can get the old mechanism by passing max_buffer_input_shards=1 to IterableDataset.shuffle()

New batching features for robotics datasets
  • Add batch(by_column=...) by @​lhoestq in #​8172

    from datasets import Dataset
    
    ds = Dataset.from_dict({"episode": [0] * 10 + [1] * 10, "frame": list(range(10)) * 2})
    # ds = ds.to_iterable_dataset()
    ds = ds.batch(by_column="episode")
    for x in ds:
        print(x)
    # {'episode': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'frame': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}
    # {'episode': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'frame': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}
New supported formats

Other improvements and bug fixes

New Contributors

Full Changelog: huggingface/datasets@4.8.5...5.0.0

v4.8.5

Compare Source

Main bug fixes
Other improvements and bug fixes
New Contributors

Full Changelog: huggingface/datasets@4.8.4...4.8.5

v4.8.4

Compare Source

What's Changed

Full Changelog: huggingface/datasets@4.8.3...4.8.4

v4.8.3

Compare Source

What's Changed

Full Changelog: huggingface/datasets@4.8.2...4.8.3

v4.8.2

Compare Source

What's Changed

Full Changelog: huggingface/datasets@4.8.1...4.8.2

v4.8.1

Compare Source

What's Changed

Full Changelog: huggingface/datasets@4.8.0...4.8.1

v4.8.0

Compare Source

Dataset Features

  • Read (and write) from HF Storage Buckets: load raw data, process and save to Dataset Repos by @​lhoestq in #​8064

    from datasets import load_dataset
    # load raw data from a Storage Bucket on HF
    ds = load_dataset("buckets/username/data-bucket", data_files=["*.jsonl"])
    # or manually, using hf:// paths
    ds = load_dataset("json", data_files=["hf://buckets/username/data-bucket/*.jsonl"])
    # process, filter
    ds = ds.map(...).filter(...)
    # publish the AI-ready dataset
    ds.push_to_hub("username/my-dataset-ready-for-training")

    This also fixes multiprocessed push_to_hub on macos that was causing segfault (now it uses spawn instead of fork).
    And it bumps dill and multiprocess versions to support python 3.14

  • Datasets streaming iterable packaged improvements and fixes by @​Michael-RDev in #​8068

    • added max_shard_size to IterableDataset.push_to_hub (but requires iterating twice to know the full dataset twice - improvements are welcome)
    • more arrow-native iterable operations for IterableDataset
    • better support of glob patterns in archives, e.g. zip://*.jsonl::hf://datasets/username/dataset-name/data.zip
    • fixes for to_pandas, videofolder, load_dataset_builder kwargs

What's Changed

New Contributors

Full Changelog: huggingface/datasets@4.7.0...4.8.0

v4.7.0

Compare Source

Datasets Features

  • Add Json() type by @​lhoestq in #​8027
    • JSON Lines files that contain arbitrary JSON objects like tool calling datasets are now supported. When there is a field or subfield containing mixed types (e.g. mix of str/int/float/dict/list or dictionaries with arbitrary keys), the Json()type is used to store such data that would normally not be supported in Arrow/Parquet
    • Use the Json() type in Features() for any dataset, it is supported in any functions that accepts features=like load_dataset(), .map(), .cast(), .from_dict(), .from_list()
    • Use on_mixed_types="use_json" to automatically set the Json() type on mixed types in .from_dict(), .from_list() and .map()

Examples:

You can use on_mixed_types="use_json" or specify features= with a [Json] type:

>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]})
Traceback (most recent call last):
  ...
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64

>>> features = Features({"a": Json()})
>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]}, features=features)
>>> ds.features
{'a': Json()}
>>> list(ds["a"])
[0, "foo", {"subfield": "bar"}]

This is also useful for lists of dictionaries with arbitrary keys and values, to avoid filling missing fields with None:

>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]})
>>> ds.features
{'a': List({'b': Value('int64'), 'c': Value('int64')})}
>>> list(ds["a"])
[[{'b': 0, 'c': None}, {'b': None, 'c': 0}]]  # missing fields are filled with None

>>> features = Features({"a": List(Json())})
>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]}, features=features)
>>> ds.features
{'a': List(Json())}
>>> list(ds["a"])
[[{'b': 0}, {'c': 0}]]  # OK

Another example with tool calling data and the on_mixed_types="use_json" argument (useful to not have to specify features= manually):

>>> messages = [
...     {"role": "user", "content": "Turn on the living room lights and play my electronic music playlist."},
...     {"role": "assistant", "tool_calls": [
...         {"type": "function", "function": {
...             "name": "control_light",
...             "arguments": {"room": "living room", "state": "on"}
...         }},
...         {"type": "function", "function": {
...             "name": "play_music",
...             "arguments": {"playlist": "electronic"}  # mixed-type here since keys ["playlist"] and ["room", "state"] are different
...         }}]
...     },
...     {"role": "tool", "name": "control_light", "content": "The lights in the living room are now on."},
...     {"role": "tool", "name": "play_music", "content": "The music is now playing."},
...     {"role": "assistant", "content": "Done!"}
... ]
>>> ds = Dataset.from_dict({"messages": [messages]}, on_mixed_types="use_json")
>>> ds.features
{'messages': List({'role': Value('string'), 'content': Value('string'), 'tool_calls': List(Json()), 'name': Value('string')})}
>>> ds[0][1]["tool_calls"][0]["function"]["arguments"]
{"room": "living room", "state": "on"}

What's Changed

New Contributors

Full Changelog: huggingface/datasets@4.6.1...4.7.0

v4.6.1

Compare Source

Bug fix

Full Changelog: huggingface/datasets@4.6.0...4.6.1

v4.6.0

Compare Source

Dataset Features

  • Support Image, Video and Audio types in Lance datasets

    >>> from datasets import load_dataset
    >>> ds = load_dataset("lance-format/Openvid-1M", streaming=True, split="train")
    >>> ds.features
    {'video_blob': Video(),
     'video_path': Value('string'),
     'caption': Value('string'),
     'aesthetic_score': Value('float64'),
     'motion_score': Value('float64'),
     'temporal_consistency_score': Value('float64'),
     'camera_motion': Value('string'),
     'frame': Value('int64'),
     'fps': Value('float64'),
     'seconds': Value('float64'),
     'embedding': List(Value('float32'), length=1024)}
  • Push to hub now supports Video types

     >>> from datasets import Dataset, Video
    >>> ds = Dataset.from_dict({"video": ["path/to/video.mp4"]})
    >>> ds = ds.cast_column("video", Video())
    >>> ds.push_to_hub("username/my-video-dataset")
  • Write image/audio/video blobs as is in parquet (PLAIN) in push_to_hub() by @​lhoestq in #​7976

    • this enables cross-format Xet deduplication for image/audio/video, e.g. deduplicate videos between Lance, WebDataset, Parquet files and plain video files and make downloads and uploads faster to Hugging Face
    • E.g. if you convert a Lance video dataset to a Parquet video dataset on Hugging Face, the upload will be much faster since videos don't need to be reuploaded. Under the hood, the Xet storage reuses the binary chunks from the videos in Lance format for the videos in Parquet format
    • See more info here: https://huggingface.co/docs/hub/en/xet/deduplication

image

  • Add IterableDataset.reshard() by @​lhoestq in #​7992

    Reshard the dataset if possible, i.e. split the current shards further into more shards.
    This increases the number of shards and the resulting dataset has num_shards >= previous_num_shards.
    Equality may happen if no shard can be split further.

    The resharding mechanism depends on the dataset file format:

    • Parquet: shard per row group instead of per file
    • Other: not implemented yet (contributions are welcome !)
    >>> from datasets import load_dataset
    >>> ds = load_dataset("fancyzhx/amazon_polarity", split="train", streaming=True)
    >>> ds
    IterableDataset({
        features: ['label', 'title', 'content'],
        num_shards: 4
    })
    >>> ds.reshard()
    IterableDataset({
        features: ['label', 'title', 'content'],
        num_shards: 3600
    })

What's Changed

New Contributors

Full Changelog: huggingface/datasets@4.5.0...4.6.0

v4.5.0

Compare Source

Dataset Features

  • Add lance format support by @​eddyxu in #​7913

    • Support for both Lance dataset (including metadata / manifests) and standalone .lance files
    • e.g. with lance-format/fineweb-edu
    from datasets import load_dataset
    
    ds = load_dataset("lance-format/fineweb-edu", streaming=True)
    for example in ds["train"]:
        ...

What's Changed

New Contributors

Full Changelog: huggingface/datasets@4.4.2...4.5.0

v4.4.2

Compare Source

Bug fixes

Minor additions

New Contributors

Full Changelog: huggingface/datasets@4.4.1...4.4.2

v4.4.1

Compare Source

Bug fixes and improvements

Full Changelog: huggingface/datasets@4.4.0...4.4.1

v4.4.0

Compare Source

Dataset Features

  • Add nifti support by @​CloseChoice in #​7815

    • Load medical imaging datasets from Hugging Face:
    ds = load_dataset("username/my_nifti_dataset")
    ds["train"][0]  # {"nifti": <nibabel.nifti1.Nifti1Image>}
    • Load medical imaging datasets from your disk:
    files = ["/path/to/scan_001.nii.gz", "/path/to/scan_002.nii.gz"]
    ds = Dataset.from_dict({"nifti": files}).cast_column("nifti", Nifti())
    ds["train"][0]  # {"nifti": <nibabel.nifti1.Nifti1Image>}
  • Add num channels to audio by @​CloseChoice in #​7840

# samples have shape (num_channels, num_samples)
ds = ds.cast_column("audio", Audio())  # default, use all channels
ds = ds.cast_column("audio", Audio(num_channels=2))  # use stereo
ds = ds.cast_column("audio", Audio(num_channels=1))  # use mono

What's Changed

New Contributors

Full Changelog: huggingface/datasets@4.3.0...4.4.0

v4.3.0

Compare Source

Dataset Features

Enable large scale distributed dataset streaming:

These improvements require huggingface_hub>=1.1.0 to take full effect

What's Changed

New Contributors

Full Changelog: huggingface/datasets@4.2.0...4.3.0

v4.2.0

Compare Source

Dataset Features

  • Sample without replacement option when interleaving datasets by @​radulescupetru in #​7786

    ds = interleave_datasets(datasets, stopping_strategy="all_exhausted_without_replacement")
  • Parquet: add on_bad_files argument to error/warn/skip bad files by @​lhoestq in #​7806

    ds = load_dataset(parquet_dataset_id, on_bad_files="warn")
  • Add parquet scan options and docs by @​lhoestq in #​7801

    • docs to select columns and filter data efficiently
    ds = load_dataset(parquet_dataset_id, columns=["col_0", "col_1"])
    ds = load_dataset(parquet_dataset_id, filters=[("col_0", "==", 0)])
    • new argument to control buffering and caching when streaming
    fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(cache_options=pyarrow.CacheOptions(prefetch_limit=1, range_size_limit=128 << 20))
    ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)

What's Changed

New Contributors

Full Changelog: huggingface/datasets@4.1.1...4.2.0

v4.1.1

Compare Source

What's Changed

New Contributors

Full Changelog: huggingface/datasets@4.1.0...4.1.1

v4.1.0

Compare Source

Dataset Features

  • feat: use content defined chunking by @​kszucs in #​7589

    • internally uses use_content_defined_chunking=True when writing Parquet files
    • this enables fast deduped uploads to Hugging Face !
    # Now faster thanks to content defined chunking
    ds.push_to_hub("username/dataset_name")
    • this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file / version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.
    • with this change, the new default row group size for Parquet is set to 100MB
  • Concurrent push_to_hub by @​lhoestq in #​7708

  • Concurrent IterableDataset push_to_hub by @​lhoestq in #​7710

  • HDF5 support by @​klamike in #​7690

    • load HDF5 datasets in one line of code
    ds = load_dataset("username/dataset-with-hdf5-files")
    • each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows

Other improvements and bug fixes

Note

PR body was truncated to here.


Configuration

📅 Schedule: (UTC)

  • Branch creation
    • At any time (no schedule defined)
  • Automerge
    • At any time (no schedule defined)

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Never, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

@renovate-bot renovate-bot requested review from a team as code owners June 5, 2026 16:18
@trusted-contributions-gcf trusted-contributions-gcf Bot added kokoro:force-run Add this label to force Kokoro to re-run the tests. owlbot:run Add this label to trigger the Owlbot post processor. labels Jun 5, 2026
@product-auto-label product-auto-label Bot added samples Issues that are directly related to samples. api: people-and-planet-ai labels Jun 5, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the datasets dependency version from 4.0.0 to 5.0.0 in the pyproject.toml file of the weather-model serving component. There are no review comments, and I have no feedback to provide.

@kokoro-team kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: people-and-planet-ai owlbot:run Add this label to trigger the Owlbot post processor. samples Issues that are directly related to samples.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants