fix(deps): update dependency datasets to v5 by renovate-bot · Pull Request #14287 · GoogleCloudPlatform/python-docs-samples

renovate-bot · 2026-06-05T16:18:12Z

ℹ️ Note

This PR body was truncated due to platform limits.

This PR contains the following updates:

Package	Change	Age	Confidence
datasets	`==4.0.0` → `==5.0.0`

Release Notes

huggingface/datasets (datasets)

`v5.0.0`

Compare Source

Datasets Features

Agent traces

Parse Agent traces messages for SFT using teich by @lhoestq in #8232

Agent traces from claude_code/pi/codex and others can now be loaded with load_dataset
Using the teich library (new optional dependency), traces are parsed to messages to enable training on traces using e.g. trl
Load the data:

>>> from datasets import load_dataset
>>> ds = load_dataset("lhoestq/agent-traces-example", split="train")
>>> ds[0]["messages"]
[{'role': 'user', 'content': 'Download a random dataset from Hugging Face, use DuckDB to inspect it, and come back with a short report about it. Be concise and include: dataset name, what files/format you found, row count or rough size if you can determine it,...'
 ...]

Train on agent traces:

trl sft --dataset-name lhoestq/agent-traces-example ...

find all the Agent traces datasets on HF here: https://huggingface.co/datasets?format=format:agent-traces&sort=trending

Next-level shuffling in streaming mode

Use multiple input shards for shuffle buffer by @lhoestq in #8194

ds = load_dataset(..., streaming=True)
ds = ds.shuffle(seed=42)
# or configure local buffer shuffling manually, default is:
ds = ds.shuffle(seed=42, buffer_size=1000, max_buffer_input_shards=10)

before👎:

after✨:

toy example comparison

from datasets import IterableDataset

ds = IterableDataset.from_dict({"i": range(123_456_789)}, num_shards=1024)
ds = ds.shuffle(seed=42)

print("Cold start ids:")
print(list(ds.take(10)["i"]))
print("Nominal regime ids:")
print(list(ds.skip(10_000).take(10)["i"]))

before👎:

Cold start ids:
[6148853, 6149537, 6149418, 6149202, 6149197, 6149622, 6148849, 6149461, 6148965, 6148858]
Nominal regime ids:
[6149537, 6149418, 6149202, 6149197, 6149622, 6148849, 6149461, 6148965, 6148858, 6149290]

after✨:

Cold start ids:
[7836668, 9283505, 95847927, 482299, 9283471, 482341, 112003312, 59920157, 43764666, 95847871]
Nominal regime ids:
[9283505, 95847927, 482299, 9283471, 482341, 112003312, 59920157, 43764666, 95847871, 16758448]

Note: ds.state_dict() and ds.load_state_dict() are still supported for this improved shuffling :) enabling dataset checkpointing

Note 2: it uses threads to fetch the first examples in parallel from the input shards

Note 3: This is a BREAKING CHANGE: the default shuffling mechanism now uses multiple input shards. You can get the old mechanism by passing max_buffer_input_shards=1 to IterableDataset.shuffle()

New batching features for robotics datasets

Add batch(by_column=...) by @lhoestq in #8172

from datasets import Dataset

ds = Dataset.from_dict({"episode": [0] * 10 + [1] * 10, "frame": list(range(10)) * 2})
# ds = ds.to_iterable_dataset()
ds = ds.batch(by_column="episode")
for x in ds:
    print(x)
# {'episode': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'frame': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}
# {'episode': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'frame': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}

New supported formats

Add Apache Iceberg format support by @frankliee in #8148
feat: add TsFile (Apache IoTDB) packaged builder with per-device wide format by @JackieTien97 in #8160
feat: add 3D mesh support and MeshFolder builder by @Vinay-Umrethe in #8055
Add .conll / .conllu dataset format loader (CoNLL-2003 / 2000 / U) by @CrypticCortex in #8219

Other improvements and bug fixes

Pass library_name/version to HfApi in dataset push and delete paths by @davanstrien in #8161
Fix storage_options lookup for streaming Lance datasets by @ericjaebeom in #8166
add agent trace prompt, sent_at, count fields by @cfahlgren1 in #8163
fix: add num_proc argument to Dataset.to_sql by @EricSaikali in #7791
Support fsspec 2026.4.0 by @lhoestq in #8175
Fix Parquet streaming hangs at the end of script by @lhoestq in #8176
ClassLabel docs: Correct value for unknown labels by @l-uuz in #7645
fix parquet reshard by @lhoestq in #8193
Fix parquet columns arg by @lhoestq in #8210
update readme by @lhoestq in #8208
update single seg repos in ci by @lhoestq in #8213
Fix single lance file form pylance 7.0 by @lhoestq in #8225
fix(map): fix progress bar exceeding total when load_from_cache_file=False by @Nitin-Rajasekar in #8170
fix: embed_external_files=True for mesh support by @Vinay-Umrethe in #8224
Fix iterable skip over full Arrow blocks by @my17th2 in #8236
Keep None as a real null in Json() columns instead of the string "null" by @adityasingh2400 in #8231
Support composed splits in streaming datasets by @lanarkite99 in #8220

New Contributors

@ericjaebeom made their first contribution in #8166
@EricSaikali made their first contribution in #7791
@l-uuz made their first contribution in #7645
@CrypticCortex made their first contribution in #8219
@frankliee made their first contribution in #8148
@Vinay-Umrethe made their first contribution in #8055
@Nitin-Rajasekar made their first contribution in #8170
@JackieTien97 made their first contribution in #8160
@my17th2 made their first contribution in #8236
@adityasingh2400 made their first contribution in #8231
@lanarkite99 made their first contribution in #8220

Full Changelog: huggingface/datasets@4.8.5...5.0.0

`v4.8.5`

Compare Source

Main bug fixes

fix: decode Json() values before calling DataFrame.to_json() (#8116) by @Brianzhengca in #8122
Fix: decode JSON type before to_list or to_dict is called by @ItsTania in #8137
Fix batching for table-formatted datasets by @bluehyena in #8126
Fix iterable map resume state by @Brianzhengca in #8147
don't embed remote files in download_and_prepare to parquet by @lhoestq in #8150

Other improvements and bug fixes

Parse agent traces by @lhoestq in #8113
🔒 Pin GitHub Actions to commit SHAs by @paulinebm in #8114
chore: bump doc-builder SHA for PR upload workflow by @rtrompier in #8134
Remove print statement in JSON processing by @lhoestq in #8136
Don't include files list DatasetInfo (and remove old stuff) by @lhoestq in #8128
update ci uer by @lhoestq in #8139
fix warning in ci by @lhoestq in #8140
fix mask in embed_storage for remote files by @lhoestq in #8151
fix original_files missing in ci json test by @lhoestq in #8152
Fix null in embed storage by @lhoestq in #8154
Fix base_path in integration tests by @lhoestq in #8155

New Contributors

@paulinebm made their first contribution in #8114
@Brianzhengca made their first contribution in #8122
@bluehyena made their first contribution in #8126
@rtrompier made their first contribution in #8134
@ItsTania made their first contribution in #8137

Full Changelog: huggingface/datasets@4.8.4...4.8.5

`v4.8.4`

Compare Source

What's Changed

Support latest torchvision by @lhoestq in #8087
fix regression when loading JSON with one file = one object by @lhoestq in #8086

Full Changelog: huggingface/datasets@4.8.3...4.8.4

`v4.8.3`

Compare Source

What's Changed

Fix split_dataset_by_node step by @lhoestq in #8081
Fix docstring of Json.cast_storage by @albertvillanova in #8080

Full Changelog: huggingface/datasets@4.8.2...4.8.3

`v4.8.2`

Compare Source

What's Changed

Json type for empty struct by @lhoestq in #8074

Full Changelog: huggingface/datasets@4.8.1...4.8.2

`v4.8.1`

Compare Source

What's Changed

Fix formatted iter arrow double yield by @HaukurPall in #8063

Full Changelog: huggingface/datasets@4.8.0...4.8.1

`v4.8.0`

Compare Source

Dataset Features

Read (and write) from HF Storage Buckets: load raw data, process and save to Dataset Repos by @lhoestq in #8064

from datasets import load_dataset
# load raw data from a Storage Bucket on HF
ds = load_dataset("buckets/username/data-bucket", data_files=["*.jsonl"])
# or manually, using hf:// paths
ds = load_dataset("json", data_files=["hf://buckets/username/data-bucket/*.jsonl"])
# process, filter
ds = ds.map(...).filter(...)
# publish the AI-ready dataset
ds.push_to_hub("username/my-dataset-ready-for-training")

This also fixes multiprocessed push_to_hub on macos that was causing segfault (now it uses spawn instead of fork).
And it bumps dill and multiprocess versions to support python 3.14

Datasets streaming iterable packaged improvements and fixes by @Michael-RDev in #8068
- added max_shard_size to IterableDataset.push_to_hub (but requires iterating twice to know the full dataset twice - improvements are welcome)
- more arrow-native iterable operations for IterableDataset
- better support of glob patterns in archives, e.g. zip://*.jsonl::hf://datasets/username/dataset-name/data.zip
- fixes for to_pandas, videofolder, load_dataset_builder kwargs

What's Changed

fix reshard_data_sources by @lhoestq in #8061
Improve error message for invalid data_files pattern format by @kushalkkb in #8060
fix null filling in missing jsonl columns by @lhoestq in #8069

New Contributors

@kushalkkb made their first contribution in #8060
@Michael-RDev made their first contribution in #8068

Full Changelog: huggingface/datasets@4.7.0...4.8.0

`v4.7.0`

Compare Source

Datasets Features

Add Json() type by @lhoestq in #8027
- JSON Lines files that contain arbitrary JSON objects like tool calling datasets are now supported. When there is a field or subfield containing mixed types (e.g. mix of str/int/float/dict/list or dictionaries with arbitrary keys), the Json()type is used to store such data that would normally not be supported in Arrow/Parquet
- Use the Json() type in Features() for any dataset, it is supported in any functions that accepts features=like load_dataset(), .map(), .cast(), .from_dict(), .from_list()
- Use on_mixed_types="use_json" to automatically set the Json() type on mixed types in .from_dict(), .from_list() and .map()

Examples:

You can use on_mixed_types="use_json" or specify features= with a [Json] type:

>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]})
Traceback (most recent call last):
  ...
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64

>>> features = Features({"a": Json()})
>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]}, features=features)
>>> ds.features
{'a': Json()}
>>> list(ds["a"])
[0, "foo", {"subfield": "bar"}]

This is also useful for lists of dictionaries with arbitrary keys and values, to avoid filling missing fields with None:

>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]})
>>> ds.features
{'a': List({'b': Value('int64'), 'c': Value('int64')})}
>>> list(ds["a"])
[[{'b': 0, 'c': None}, {'b': None, 'c': 0}]]  # missing fields are filled with None

>>> features = Features({"a": List(Json())})
>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]}, features=features)
>>> ds.features
{'a': List(Json())}
>>> list(ds["a"])
[[{'b': 0}, {'c': 0}]]  # OK

Another example with tool calling data and the on_mixed_types="use_json" argument (useful to not have to specify features= manually):

>>> messages = [
...     {"role": "user", "content": "Turn on the living room lights and play my electronic music playlist."},
...     {"role": "assistant", "tool_calls": [
...         {"type": "function", "function": {
...             "name": "control_light",
...             "arguments": {"room": "living room", "state": "on"}
...         }},
...         {"type": "function", "function": {
...             "name": "play_music",
...             "arguments": {"playlist": "electronic"}  # mixed-type here since keys ["playlist"] and ["room", "state"] are different
...         }}]
...     },
...     {"role": "tool", "name": "control_light", "content": "The lights in the living room are now on."},
...     {"role": "tool", "name": "play_music", "content": "The music is now playing."},
...     {"role": "assistant", "content": "Done!"}
... ]
>>> ds = Dataset.from_dict({"messages": [messages]}, on_mixed_types="use_json")
>>> ds.features
{'messages': List({'role': Value('string'), 'content': Value('string'), 'tool_calls': List(Json()), 'name': Value('string')})}
>>> ds[0][1]["tool_calls"][0]["function"]["arguments"]
{"room": "living room", "state": "on"}

What's Changed

Fix typos in iterable_dataset.py by @omkar-334 in #8049
Fix non-deterministic by sorting metadata extensions (#8034) by @Nexround in #8039
Use num_examples instead of len(self) for iterable_dataset's SplitInfo by @HaukurPall in #8041
Fix silent data loss in push_to_hub when num_proc > num_shards by @HaukurPall in #8044
Don't extract bad files by @lhoestq in #8056
fix(iterable_dataset): preserve features when chaining filter() on typed IterableDataset by @s-zx in #8053
fix: handle nested null types in feature alignment for multi-proc map by @ain-soph in #8047
Fix unstable tokenizer fingerprinting (enables map cache reuse) by @KOKOSde in #7982
Limit dataset listing to first 20 entries in readme by @lhoestq in #8057

New Contributors

@omkar-334 made their first contribution in #8049
@Nexround made their first contribution in #8039
@HaukurPall made their first contribution in #8041
@s-zx made their first contribution in #8053
@ain-soph made their first contribution in #8047
@KOKOSde made their first contribution in #7982

Full Changelog: huggingface/datasets@4.6.1...4.7.0

`v4.6.1`

Compare Source

Bug fix

Remove tmp file in push to hub by @lhoestq in #8030

Full Changelog: huggingface/datasets@4.6.0...4.6.1

`v4.6.0`

Compare Source

Dataset Features

Support Image, Video and Audio types in Lance datasets

Infer types from lance blobs by @lhoestq in #7966

>>> from datasets import load_dataset
>>> ds = load_dataset("lance-format/Openvid-1M", streaming=True, split="train")
>>> ds.features
{'video_blob': Video(),
 'video_path': Value('string'),
 'caption': Value('string'),
 'aesthetic_score': Value('float64'),
 'motion_score': Value('float64'),
 'temporal_consistency_score': Value('float64'),
 'camera_motion': Value('string'),
 'frame': Value('int64'),
 'fps': Value('float64'),
 'seconds': Value('float64'),
 'embedding': List(Value('float32'), length=1024)}

Push to hub now supports Video types

push_to_hub() for videos by @lhoestq in #7971

 >>> from datasets import Dataset, Video
>>> ds = Dataset.from_dict({"video": ["path/to/video.mp4"]})
>>> ds = ds.cast_column("video", Video())
>>> ds.push_to_hub("username/my-video-dataset")

Write image/audio/video blobs as is in parquet (PLAIN) in push_to_hub() by @lhoestq in #7976
- this enables cross-format Xet deduplication for image/audio/video, e.g. deduplicate videos between Lance, WebDataset, Parquet files and plain video files and make downloads and uploads faster to Hugging Face
- E.g. if you convert a Lance video dataset to a Parquet video dataset on Hugging Face, the upload will be much faster since videos don't need to be reuploaded. Under the hood, the Xet storage reuses the binary chunks from the videos in Lance format for the videos in Parquet format
- See more info here: https://huggingface.co/docs/hub/en/xet/deduplication

Add IterableDataset.reshard() by @lhoestq in #7992

Reshard the dataset if possible, i.e. split the current shards further into more shards.
This increases the number of shards and the resulting dataset has num_shards >= previous_num_shards.
Equality may happen if no shard can be split further.

The resharding mechanism depends on the dataset file format:
- Parquet: shard per row group instead of per file
- Other: not implemented yet (contributions are welcome !)
```
>>> from datasets import load_dataset
>>> ds = load_dataset("fancyzhx/amazon_polarity", split="train", streaming=True)
>>> ds
IterableDataset({
    features: ['label', 'title', 'content'],
    num_shards: 4
})
>>> ds.reshard()
IterableDataset({
    features: ['label', 'title', 'content'],
    num_shards: 3600
})
```

What's Changed

Fix load_from_disk progress bar with redirected stdout by @omarfarhoud in #7919
Revert "feat: avoid some copies in torch formatter (#7787)" by @lhoestq in #7961
docs: fix grammar and add type hints in splits.py by @Edge-Explorer in #7960
Fix interleave_datasets with all_exhausted_without_replacement strategy by @prathamk-tw in #7955
Add examples for Lance datasets by @prrao87 in #7950
Support null in json string cols by @lhoestq in #7963
handle blob lance by @lhoestq in #7964
Count examples in lance by @lhoestq in #7969
Use temp files in push_to_hub to save memory by @lhoestq in #7979
Drop python 3.9 by @lhoestq in #7980
Support pandas 3 by @lhoestq in #7981
Remove unused data files optims by @lhoestq in #7985
Remove pre-release workaround in CI for transformers v5 and huggingface_hub v1 by @hanouticelina in #7989
very basic support for more hf urls by @lhoestq in #8003
Bump fsspec upper bound to 2026.2.0 (fixes #7994) by @jayzuccarelli in #7995
Fix: make environment variable naming consistent (issue #7998) by @AnkitAhlawat7742 in #8000
More IterableDataset.from_x methods and docs and polars.Lazyframe support by @lhoestq in #8009
Support empty shard in from_generator by @lhoestq in #8023
Allow import polars in map() by @lhoestq in #8024

New Contributors

@omarfarhoud made their first contribution in #7919
@Edge-Explorer made their first contribution in #7960
@prathamk-tw made their first contribution in #7955
@prrao87 made their first contribution in #7950
@hanouticelina made their first contribution in #7989
@jayzuccarelli made their first contribution in #7995
@AnkitAhlawat7742 made their first contribution in #8000

Full Changelog: huggingface/datasets@4.5.0...4.6.0

`v4.5.0`

Compare Source

Dataset Features

Add lance format support by @eddyxu in #7913
- Support for both Lance dataset (including metadata / manifests) and standalone .lance files
- e.g. with lance-format/fineweb-edu
```
from datasets import load_dataset

ds = load_dataset("lance-format/fineweb-edu", streaming=True)
for example in ds["train"]:
    ...
```

What's Changed

Raise early for invalid revision in load_dataset by @Scott-Simmons in #7929
fix low but large example indexerror by @CloseChoice in #7912
Fix method to retrieve attributes from file object by @lhoestq in #7938
add _OverridableIOWrapper by @lhoestq in #7942
Add _generate_shards by @lhoestq in #7943

New Contributors

@eddyxu made their first contribution in #7913
@Scott-Simmons made their first contribution in #7929

Full Changelog: huggingface/datasets@4.4.2...4.5.0

`v4.4.2`

Compare Source

Bug fixes

Fix embed storage nifti by @CloseChoice in #7853
ArXiv -> HF Papers by @qgallouedec in #7855
fix some broken links by @julien-c in #7859
Nifti visualization support by @CloseChoice in #7874
Replace papaya with niivue by @CloseChoice in #7878
Fix 7846: add_column and add_item erroneously(?) require new_fingerprint parameter by @sajmaru in #7884
fix(fingerprint): treat TMPDIR as strict API and fail (Issue #7877) by @ada-ggf25 in #7891
encode nifti correctly when uploading lazily by @CloseChoice in #7892
fix(nifti): enable lazy loading for Nifti1ImageWrapper by @The-Obstacle-Is-The-Way in #7887

Minor additions

Add type overloads to load_dataset for better static type inference by @Aditya2755 in #7888
Add inspect_ai eval logs support by @lhoestq in #7899
Save input shard lengths by @lhoestq in #7897
Don't save original_shard_lengths by default for backward compat by @lhoestq in #7906

New Contributors

@sajmaru made their first contribution in #7884
@Aditya2755 made their first contribution in #7888
@ada-ggf25 made their first contribution in #7891
@The-Obstacle-Is-The-Way made their first contribution in #7887

Full Changelog: huggingface/datasets@4.4.1...4.4.2

`v4.4.1`

Compare Source

Bug fixes and improvements

Better streaming retries (504 and 429) by @lhoestq in #7847
DOC: remove mode parameter in docstring of pdf and video feature by @CloseChoice in #7848

Full Changelog: huggingface/datasets@4.4.0...4.4.1

`v4.4.0`

Compare Source

Dataset Features

Add nifti support by @CloseChoice in #7815

Load medical imaging datasets from Hugging Face:

ds = load_dataset("username/my_nifti_dataset")
ds["train"][0]  # {"nifti": <nibabel.nifti1.Nifti1Image>}

Load medical imaging datasets from your disk:

files = ["/path/to/scan_001.nii.gz", "/path/to/scan_002.nii.gz"]
ds = Dataset.from_dict({"nifti": files}).cast_column("nifti", Nifti())
ds["train"][0]  # {"nifti": <nibabel.nifti1.Nifti1Image>}

Documentation: https://huggingface.co/docs/datasets/nifti_dataset

Add num channels to audio by @CloseChoice in #7840

# samples have shape (num_channels, num_samples)
ds = ds.cast_column("audio", Audio())  # default, use all channels
ds = ds.cast_column("audio", Audio(num_channels=2))  # use stereo
ds = ds.cast_column("audio", Audio(num_channels=1))  # use mono

Python 3.14 support by @lhoestq in #7836

What's Changed

Fix random seed on shuffle and interleave_datasets by @CloseChoice in #7823
fix ci compressionfs by @lhoestq in #7830
fix: better args passthrough for _batch_setitems() by @sghng in #7817
Fix: Properly render [!TIP] block in stream.shuffle documentation by @art-test-stack in #7833
resolves the ValueError: Unable to avoid copy while creating an array by @ArjunJagdale in #7831
fix column with transform by @lhoestq in #7843
support fsspec 2025.10.0 by @lhoestq in #7844

New Contributors

@sghng made their first contribution in #7817
@art-test-stack made their first contribution in #7833

Full Changelog: huggingface/datasets@4.3.0...4.4.0

`v4.3.0`

Compare Source

Dataset Features

Enable large scale distributed dataset streaming:

Keep hffs cache in workers when streaming by @lhoestq in #7820
Retry open hf file by @lhoestq in #7822

These improvements require huggingface_hub>=1.1.0 to take full effect

What's Changed

fix conda deps by @lhoestq in #7810
Add pyarrow's binary view to features by @delta003 in #7795
Fix polars cast column image by @CloseChoice in #7800
Allow streaming hdf5 files by @lhoestq in #7814
Fix batch_size default description in to_polars docstrings by @albertvillanova in #7824
docs: document_dataset PDFs & OCR by @ethanknights in #7812
Add custom fingerprint support to from_generator by @simonreise in #7533
picklable batch_fn by @lhoestq in #7826

New Contributors

@delta003 made their first contribution in #7795
@CloseChoice made their first contribution in #7800
@ethanknights made their first contribution in #7812
@simonreise made their first contribution in #7533

Full Changelog: huggingface/datasets@4.2.0...4.3.0

`v4.2.0`

Compare Source

Dataset Features

Sample without replacement option when interleaving datasets by @radulescupetru in #7786

ds = interleave_datasets(datasets, stopping_strategy="all_exhausted_without_replacement")

Parquet: add on_bad_files argument to error/warn/skip bad files by @lhoestq in #7806
```
ds = load_dataset(parquet_dataset_id, on_bad_files="warn")
```

Add parquet scan options and docs by @lhoestq in #7801

docs to select columns and filter data efficiently

ds = load_dataset(parquet_dataset_id, columns=["col_0", "col_1"])
ds = load_dataset(parquet_dataset_id, filters=[("col_0", "==", 0)])

new argument to control buffering and caching when streaming

fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(cache_options=pyarrow.CacheOptions(prefetch_limit=1, range_size_limit=128 << 20))
ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)

What's Changed

Document HDF5 support by @klamike in #7740
update tips in docs by @lhoestq in #7790
feat: avoid some copies in torch formatter by @drbh in #7787
Support huggingface_hub v0.x and v1.x by @Wauplin in #7783
Define CI future by @lhoestq in #7799
More Parquet streaming docs by @lhoestq in #7803
Less api calls when resolving data_files by @lhoestq in #7805
typo by @lhoestq in #7807

New Contributors

@drbh made their first contribution in #7787

Full Changelog: huggingface/datasets@4.1.1...4.2.0

`v4.1.1`

Compare Source

What's Changed

fix iterate nested field by @lhoestq in #7775
Add support for arrow iterable when concatenating or interleaving by @radulescupetru in #7771
fix empty dataset to_parquet by @lhoestq in #7779

New Contributors

@radulescupetru made their first contribution in #7771

Full Changelog: huggingface/datasets@4.1.0...4.1.1

`v4.1.0`

Compare Source

Dataset Features

feat: use content defined chunking by @kszucs in #7589
- internally uses use_content_defined_chunking=True when writing Parquet files
- this enables fast deduped uploads to Hugging Face !
```
# Now faster thanks to content defined chunking
ds.push_to_hub("username/dataset_name")
```
- this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file / version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.
- with this change, the new default row group size for Parquet is set to 100MB
Concurrent push_to_hub by @lhoestq in #7708
Concurrent IterableDataset push_to_hub by @lhoestq in #7710
HDF5 support by @klamike in #7690
- load HDF5 datasets in one line of code
```
ds = load_dataset("username/dataset-with-hdf5-files")
```
- each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows

Other improvements and bug fixes

Convert to string when needed + faster .zstd by @lhoestq in #7683
fix audio cast storage from array + sampling_rate by @lhoestq in #7684
Fix misleading add_column() usage example in docstring by @ArjunJagdale in #7648
Allow dataset row indexing with np.int types (#7423) by @DavidRConnell in #7438
Update fsspec max version to current release 2025.7.0 by @rootAvish in #7701
Update dataset_dict push_to_hub by @lhoestq in #7711
Retry intermediate commits too by @lhoestq in #7712
num_proc=0 behave like None, num_proc=1 uses one worker (not main process) and clarify num_proc documentation by @tanuj-rai in #7702
Update cli.mdx to refer to the new "hf" CLI by @evalstate in #7713
fix num_proc=1 ci test by @lhoestq in #7714
Docs: Use Image(mode="F") for PNG/JPEG depth maps by @lhoestq in #7715
typo by @lhoestq in #7716
fix largelist repr by @lhoestq in #7735
Grammar fix: correct "showed" to "shown" in fingerprint.py by @brchristian in #7730
Fix type hint train_test_split by @qgallouedec in #7736
fix(webdataset): don't .lower() field_name by @YassineYousfi in #7726
Refactor HDF5 and preserve tree structure by @klamike in #7743
docs: Add column overwrite example to batch mapping guide by @Sanjaykumar030 in #7737
Audio: use TorchCodec instead of Soundfile for encoding by [@lhoestq](https://r

✂ Note

PR body was truncated to here.

Configuration

📅 Schedule: (UTC)

Branch creation
- At any time (no schedule defined)
Automerge
- At any time (no schedule defined)

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Never, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

gemini-code-assist

Code Review

This pull request updates the datasets dependency version from 4.0.0 to 5.0.0 in the pyproject.toml file of the weather-model serving component. There are no review comments, and I have no feedback to provide.

fix(deps): update dependency datasets to v5

af241cd

renovate-bot requested review from a team as code owners June 5, 2026 16:18

trusted-contributions-gcf Bot added kokoro:force-run Add this label to force Kokoro to re-run the tests. owlbot:run Add this label to trigger the Owlbot post processor. labels Jun 5, 2026

product-auto-label Bot added samples Issues that are directly related to samples. api: people-and-planet-ai labels Jun 5, 2026

gemini-code-assist Bot reviewed Jun 5, 2026

View reviewed changes

kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(deps): update dependency datasets to v5#14287

fix(deps): update dependency datasets to v5#14287
renovate-bot wants to merge 1 commit into
GoogleCloudPlatform:mainfrom
renovate-bot:renovate/datasets-5.x

renovate-bot commented Jun 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

renovate-bot commented Jun 5, 2026

Release Notes

Datasets Features

Agent traces

Next-level shuffling in streaming mode

New batching features for robotics datasets

New supported formats

Other improvements and bug fixes

New Contributors

Main bug fixes

Other improvements and bug fixes

New Contributors

What's Changed

What's Changed

What's Changed

What's Changed

Dataset Features

What's Changed

New Contributors

Datasets Features

What's Changed

New Contributors

Bug fix

Dataset Features

What's Changed

New Contributors

Dataset Features

What's Changed

New Contributors

Bug fixes

Minor additions

New Contributors

Bug fixes and improvements

Dataset Features

What's Changed

New Contributors

Dataset Features

What's Changed

New Contributors

Dataset Features

What's Changed

New Contributors

What's Changed

New Contributors

Dataset Features

Other improvements and bug fixes

Configuration

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants