Fix chunk casting and schema unification in dataset by ArjunJagdale · Pull Request #7873 · huggingface/datasets

ArjunJagdale · 2025-11-19T18:43:47Z

Updated chunk handling to cast to expected schema when features are provided or to unify schemas when not. This ensures proper schema alignment for the yielded batches.

fixes #7872

This PR fixes a bug where IterableDataset created from a generator with explicit features parameter would fail during arrow operations (like .to_pandas()) when the data contains missing or null values.

Problem

When an IterableDataset is created with explicit features but the generator yields data with missing values (e.g., empty lists), PyArrow would infer different schemas for different batches based on the actual data rather than using the provided schema. This caused ArrowInvalid errors when trying to concatenate batches with mismatched schemas.

Example error:

pyarrow.lib.ArrowInvalid: Schema at index 1 was different: 
a: int64
b: list
vs
a: int64
b: list>

Solution

Modified RebatchedArrowExamplesIterable._iter_arrow() to:

Cast chunks to the expected schema when explicit features are provided
Unify schemas across chunks when no explicit features are set
Gracefully handle cast failures by falling back to the original chunk

This ensures that the user-provided schema is respected throughout the iteration process.

Testing

Verified the fix with the following test case:

import datasets
from datasets import features


def test_to_pandas_works_with_explicit_schema():
    common_features = features.Features(
        {
            "a": features.Value("int64"),
            "b": features.List({"c": features.Value("int64")}),
        }
    )

    def row_generator():
        data = [{"a": 1, "b": []}, {"a": 1, "b": [{"c": 1}]}]
        for row in data:
            yield row

    d = datasets.IterableDataset.from_generator(row_generator, features=common_features)

    print("Iterating…")
    for _ in d.to_pandas():
        pass

test_to_pandas_works_with_explicit_schema()

Before Patch -

@ArjunJagdale ➜ /workspaces/datasets (main) $ python test_arjun.py
Iterating…
Traceback (most recent call last):
  File "/workspaces/datasets/test_arjun.py", line 24, in <module>
    test_to_pandas_works_with_explicit_schema()
  File "/workspaces/datasets/test_arjun.py", line 21, in test_to_pandas_works_with_explicit_schema
    for _ in d.to_pandas():
  File "/workspaces/datasets/src/datasets/iterable_dataset.py", line 3736, in to_pandas
    table = pa.concat_tables(list(self.with_format("arrow").iter(batch_size=1000)))
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/datasets/src/datasets/iterable_dataset.py", line 2596, in iter
    for key, pa_table in iterator:
  File "/workspaces/datasets/src/datasets/iterable_dataset.py", line 2111, in _iter_arrow
    for key, pa_table in self.ex_iterable._iter_arrow():
  File "/workspaces/datasets/src/datasets/iterable_dataset.py", line 632, in _iter_arrow
    yield new_key, pa.Table.from_batches(chunks_buffer)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 5039, in pyarrow.lib.Table.from_batches
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Schema at index 1 was different: 
a: int64
b: list<item: null>
vs
a: int64
b: list<item: struct<c: int64>>

After Patch -

@ArjunJagdale ➜ /workspaces/datasets (main) $ python test_arjun.py
Iterating…
@ArjunJagdale ➜ /workspaces/datasets (main) $

Updated chunk handling to cast to expected schema when features are provided or to unify schemas when not. This ensures proper schema alignment for the yielded batches.

ArjunJagdale · 2025-11-22T19:51:30Z

@lhoestq would like to hear from you!

jonathanasdf · 2026-02-04T21:46:16Z

Hi @lhoestq can you please review this? Thanks!

HuggingFaceDocBuilderDev · 2026-02-05T17:54:01Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArjunJagdale added 2 commits November 20, 2025 00:11

Fix chunk casting and schema unification in dataset

fa8ba23

Updated chunk handling to cast to expected schema when features are provided or to unify schemas when not. This ensures proper schema alignment for the yielded batches.

Update iterable_dataset.py

a12c854

Merge branch 'main' into patch-21

42c11f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix chunk casting and schema unification in dataset#7873

Fix chunk casting and schema unification in dataset#7873
ArjunJagdale wants to merge 3 commits intohuggingface:mainfrom
ArjunJagdale:patch-21

ArjunJagdale commented Nov 19, 2025

Uh oh!

ArjunJagdale commented Nov 22, 2025

Uh oh!

jonathanasdf commented Feb 4, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ArjunJagdale commented Nov 19, 2025

Problem

Example error:

Solution

Testing

Uh oh!

ArjunJagdale commented Nov 22, 2025

Uh oh!

jonathanasdf commented Feb 4, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants