Skip to content

IndexError when saving few large examples to disk #7911

@CloseChoice

Description

@CloseChoice

Describe the bug

I ran into this issue when processing a file (900MB) with just one example but simplified for a quicker reproducer below. The problem is that, if num_shards is not explicitly set, we calculate it manually using https://github.com/huggingface/datasets/blob/main/src/datasets/utils/py_utils.py#L96 with the default config.MAX_SHARD_SIZE which is 500MB. If a single example is now larger than this, we run into an index error since the shards should be processed individually.

An easy workaround is:

dataset.save_to_disk(output_path, max_shard_size="1GB") or dataset.save_to_disk(output_path, num_shards=1).

I believe this should be fixed and can happen in edge cases for image data, especially when just testing single partitions. The fix would be rather easy, just using a num_shards = min(num_examples, <previously_calculated_num_shards>)

Steps to reproduce the bug

from datasets import Dataset

target_size = 2 * 1024 * 1024  # 2 MB in bytes

base_text = (
    "This is a sample sentence that will be repeated many times to create a large dataset. "
    * 100
)
large_text = ""

while len(large_text.encode("utf-8")) < target_size:
    large_text += base_text

actual_size = len(large_text.encode("utf-8"))
size_mb = actual_size / (1024 * 1024)

data = {"text": [large_text], "label": [0], "id": [1]}

dataset = Dataset.from_dict(data)

output_path = "./sample_dataset"
# make sure this is split into 2 shards
dataset.save_to_disk(output_path, max_shard_size="1MB")

this results in


```bash
Saving the dataset (1/3 shards): 100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 162.96 examples/s]
Traceback (most recent call last):
  File "/home/tpitters/programming/toy-mmu/create_dataset.py", line 27, in <module>
    dataset.save_to_disk(output_path, max_shard_size="1MB")
    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tpitters/programming/toy-mmu/.venv/lib/python3.13/site-packages/datasets/arrow_dataset.py", line 1640, in save_to_disk
    for kwargs in kwargs_per_job:
                  ^^^^^^^^^^^^^^
  File "/home/tpitters/programming/toy-mmu/.venv/lib/python3.13/site-packages/datasets/arrow_dataset.py", line 1617, in <genexpr>
    "shard": self.shard(num_shards=num_shards, index=shard_idx, contiguous=True),
             ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tpitters/programming/toy-mmu/.venv/lib/python3.13/site-packages/datasets/arrow_dataset.py", line 4987, in shard
    return self.select(
           ~~~~~~~~~~~^
        indices=indices,
        ^^^^^^^^^^^^^^^^
    ...<2 lines>...
        writer_batch_size=writer_batch_size,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/tpitters/programming/toy-mmu/.venv/lib/python3.13/site-packages/datasets/arrow_dataset.py", line 562, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tpitters/programming/toy-mmu/.venv/lib/python3.13/site-packages/datasets/fingerprint.py", line 442, in wrapper
    out = func(dataset, *args, **kwargs)
  File "/home/tpitters/programming/toy-mmu/.venv/lib/python3.13/site-packages/datasets/arrow_dataset.py", line 4104, in select
    return self._select_contiguous(start, length, new_fingerprint=new_fingerprint)
           ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tpitters/programming/toy-mmu/.venv/lib/python3.13/site-packages/datasets/arrow_dataset.py", line 562, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tpitters/programming/toy-mmu/.venv/lib/python3.13/site-packages/datasets/fingerprint.py", line 442, in wrapper
    out = func(dataset, *args, **kwargs)
  File "/home/tpitters/programming/toy-mmu/.venv/lib/python3.13/site-packages/datasets/arrow_dataset.py", line 4164, in _select_contiguous
    _check_valid_indices_value(start, len(self))
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/home/tpitters/programming/toy-mmu/.venv/lib/python3.13/site-packages/datasets/arrow_dataset.py", line 624, in _check_valid_indices_value
    raise IndexError(f"Index {index} out of range for dataset of size {size}.")
IndexError: Index 1 out of range for dataset of size 1.

Expected behavior

should pass

Environment info

datasets==4.4.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions