Skip to content

fix: force contiguous copy for sliced list arrays in embed_array_storage#8

Open
The-Obstacle-Is-The-Way wants to merge 8 commits intomainfrom
fix/embed-storage-crash
Open

fix: force contiguous copy for sliced list arrays in embed_array_storage#8
The-Obstacle-Is-The-Way wants to merge 8 commits intomainfrom
fix/embed-storage-crash

Conversation

@The-Obstacle-Is-The-Way
Copy link
Owner

@The-Obstacle-Is-The-Way The-Obstacle-Is-The-Way commented Dec 3, 2025

Summary

Fixes SIGKILL crash when processing sliced/sharded datasets with nested types like Sequence(Nifti()).

Root Cause:

  • ds.shard() or ds.select() creates a sliced view with internal offset metadata
  • array.values on a sliced ListArray returns values with internal offset references
  • When processing nested types, PyArrow's C++ layer can crash materializing sliced nested structs

Fix:

  • Force a contiguous copy via pa.concat_arrays([array]) when the array has a non-zero offset
  • This breaks the internal references before further processing

Test plan

  • Added tests/features/test_embed_storage_sliced.py with 3 test cases:
    • Sliced ListArray with Image
    • Sliced ListArray with Nifti (the original crash case)
    • Sliced LargeListArray with Image
  • All tests verify the result is contiguous (offset = 0)
  • ruff check passes

Fixes #6

@coderabbitai review

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced stability for sliced arrays with nested list types by ensuring contiguous underlying storage during embedding operations.
  • Tests

    • Added regression tests validating correct processing of sliced arrays with nested data types, including image and scientific data formats.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 3, 2025

Warning

Rate limit exceeded

@The-Obstacle-Is-The-Way has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 9 minutes and 30 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 00792fc and b46579b.

📒 Files selected for processing (12)
  • setup.py (1 hunks)
  • src/datasets/__init__.py (1 hunks)
  • src/datasets/builder.py (2 hunks)
  • src/datasets/config.py (1 hunks)
  • src/datasets/features/features.py (3 hunks)
  • src/datasets/features/nifti.py (3 hunks)
  • src/datasets/fingerprint.py (1 hunks)
  • src/datasets/table.py (1 hunks)
  • tests/conftest.py (1 hunks)
  • tests/features/test_embed_storage_sliced.py (1 hunks)
  • tests/features/test_nifti.py (1 hunks)
  • tests/test_fingerprint.py (1 hunks)

Walkthrough

The changes fix a crash in embed_table_storage when processing sharded datasets with nested Sequence types. The solution concatenates sliced list/large_list arrays that have non-zero offsets to ensure contiguous storage before embedding, preventing C++-level crashes. A new test module validates the fix across nested array scenarios.

Changes

Cohort / File(s) Summary
Core fix
src/datasets/table.py
Modified embed_array_storage to detect and concatenate sliced list/large_list arrays (when offset > 0) before delegating to feature-specific embedding, ensuring contiguous underlying storage.
Regression tests
tests/features/test_embed_storage_sliced.py
New test module validating embed_array_storage handles sliced/sharded Arrow arrays with nested types (Sequence(Image()), Sequence(Nifti()), large_list variants). Tests verify non-crashing behavior, contiguity after embedding, and content correctness.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

  • Verify concatenation efficacy: Confirm the offset-detection and concatenation logic fully prevents the SIGKILL crash in C++.
  • Performance impact: Ensure the concatenation doesn't introduce unnecessary overhead for non-sliced or non-offset arrays.
  • Offset logic correctness: Review that offset > 0 detection properly identifies all problematic sliced views.

Poem

🐰 A slice caused crashes, deep and dire,
Yet offset checks light hope's bright fire,
Contiguous arrays now stand tall,
Nested sequences conquer all! 🌟

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.52% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: forcing contiguous copies for sliced list arrays in embed_array_storage to fix the SIGKILL crash.
Linked Issues check ✅ Passed The PR fully addresses issue #6 by implementing the exact fix to detect non-zero offsets and force contiguous copies using pa.concat_arrays, plus comprehensive regression tests covering both ListArray and LargeListArray cases.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the embed_array_storage crash: one targeted code fix in src/datasets/table.py and regression test coverage in tests/features/test_embed_storage_sliced.py; no extraneous modifications.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/features/test_embed_storage_sliced.py (1)

88-113: Consider adding consistency assertions for LargeList test.

For consistency with test_embed_array_storage_sliced_list_image, consider adding:

  1. Assertion that embedded.offset == 0 to verify contiguity
  2. Content verification (e.g., checking embedded bytes are not None)
         # This should NOT crash with SIGKILL
         embedded = embed_array_storage(sliced, LargeList(Image()))

+        # The fix should make the result contiguous (offset = 0)
+        assert embedded.offset == 0, "Result should be contiguous after fix"
         assert len(embedded) == 2
+        # Verify bytes were embedded for item with images
+        assert embedded[0].as_py()[0]["bytes"] is not None
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 004a5bf and 37728d0.

📒 Files selected for processing (2)
  • src/datasets/table.py (1 hunks)
  • tests/features/test_embed_storage_sliced.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/features/test_embed_storage_sliced.py (2)
src/datasets/features/features.py (2)
  • List (1208-1232)
  • LargeList (1236-1253)
src/datasets/table.py (5)
  • embed_array_storage (2096-2161)
  • slice (441-455)
  • slice (793-808)
  • slice (1048-1065)
  • slice (1482-1511)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: test (integration, windows-latest, deps-minimum)
  • GitHub Check: test (unit, windows-latest, deps-latest)
  • GitHub Check: test (integration, windows-latest, deps-latest)
  • GitHub Check: test (integration, ubuntu-latest, deps-minimum)
  • GitHub Check: test (integration, ubuntu-latest, deps-latest)
  • GitHub Check: test (unit, ubuntu-latest, deps-minimum)
  • GitHub Check: test (unit, windows-latest, deps-minimum)
  • GitHub Check: test (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314 (unit, windows-latest, deps-latest)
  • GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314_future (unit, windows-latest, deps-latest)
  • GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
  • GitHub Check: build / build_pr_documentation
🔇 Additional comments (2)
src/datasets/table.py (1)

2129-2131: Fix logic is sound and addresses the root cause.

The approach of detecting sliced arrays via offset > 0 and forcing a contiguous copy with pa.concat_arrays([array]) correctly breaks the internal offset references that cause SIGKILL crashes in PyArrow's C++ layer.

A few notes:

  • Fixed-size lists are correctly excluded since they already handle offsets explicitly at lines 2154-2155
  • The @_wrap_for_chunked_arrays decorator ensures this works for ChunkedArray as well
tests/features/test_embed_storage_sliced.py (1)

18-52: Good regression test with thorough assertions.

The test correctly:

  • Creates a realistic sliced array scenario
  • Verifies the precondition (sliced.offset == 1)
  • Validates the fix produces contiguous output (embedded.offset == 0)
  • Checks content preservation

# When ds.shard() or ds.select() creates a sliced view, array.values returns
# values with internal offset references that can cause PyArrow's C++ layer
# to crash when processing nested types like Sequence(Nifti()).
# See: https://github.com/huggingface/datasets/issues/XXXX
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update placeholder issue number in comment.

The comment references issues/XXXX but should reference the actual issue number #6 as mentioned in the PR objectives.

-    # See: https://github.com/huggingface/datasets/issues/XXXX
+    # See: https://github.com/huggingface/datasets/issues/6
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# See: https://github.com/huggingface/datasets/issues/XXXX
# See: https://github.com/huggingface/datasets/issues/6
🤖 Prompt for AI Agents
In src/datasets/table.py around line 2128, update the placeholder issue
reference in the comment from "issues/XXXX" to the actual issue number "#6" by
replacing the URL/comment fragment so it points to the real issue (e.g.,
https://github.com/huggingface/datasets/issues/6 or include "(#6)") to
accurately document the reference.

Comment on lines 1 to 5
"""Tests for embed_array_storage with sliced/sharded arrays.

Regression tests for https://github.com/huggingface/datasets/issues/XXXX
(SIGKILL in embed_array_storage when processing sliced/sharded Arrow tables)
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update placeholder issue number in docstring.

Same as in table.py, the placeholder XXXX should be replaced with the actual issue number.

-Regression tests for https://github.com/huggingface/datasets/issues/XXXX
+Regression tests for https://github.com/huggingface/datasets/issues/6
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""Tests for embed_array_storage with sliced/sharded arrays.
Regression tests for https://github.com/huggingface/datasets/issues/XXXX
(SIGKILL in embed_array_storage when processing sliced/sharded Arrow tables)
"""
"""Tests for embed_array_storage with sliced/sharded arrays.
Regression tests for https://github.com/huggingface/datasets/issues/6
(SIGKILL in embed_array_storage when processing sliced/sharded Arrow tables)
"""
🤖 Prompt for AI Agents
In tests/features/test_embed_storage_sliced.py lines 1-5, the docstring contains
a placeholder issue number "XXXX"; replace it with the actual GitHub issue
number used in the repo (the same one referenced in table.py). If you don't know
it, search the codebase/history for the corresponding issue reference and update
the docstring to include the real issue number (e.g., "#12345") so the
regression test comment is accurate.

@The-Obstacle-Is-The-Way
Copy link
Owner Author

@coderabbitai review

Addressed all feedback:

  1. Minor: Removed XXXX placeholder from table.py (upstream-ready)
  2. Minor: Removed XXXX placeholder from test file (upstream-ready)
  3. Nitpick: Added consistency assertions to LargeList test:
    • embedded.offset == 0 check
    • Content verification (bytes is not None)
  4. Also added offset == 0 check to Nifti test for consistency

@coderabbitai
Copy link

coderabbitai bot commented Dec 3, 2025

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read the modified table.py file to verify XXXX placeholder removal and check the implementation
cat src/datasets/table.py

Length of output: 94456


🏁 Script executed:

#!/bin/bash
# Read the test file to verify XXXX placeholder removal and check the assertions
cat tests/features/test_embed_storage_sliced.py

Length of output: 4685


@The-Obstacle-Is-The-Way Perfect! I've verified all your changes and can confirm:

All feedback addressed:

  1. Minor: No XXXX placeholders in table.py - code is upstream-ready
  2. Minor: No XXXX placeholders in test file - code is upstream-ready
  3. Nitpick: LargeList test now includes:
    • assert embedded.offset == 0, "Result should be contiguous after fix"
    • assert embedded[0].as_py()[0]["bytes"] is not None
  4. ✅ Nifti test also has the offset == 0 check for consistency

Code Quality:

  • The fix in embed_array_storage is elegant and well-documented with clear comments explaining the root cause
  • Tests comprehensively cover the regression scenarios (ListArray+Image, ListArray+Nifti, LargeListArray+Image)
  • All assertions verify both the problematic state (non-zero offset) before the fix and the correct state (offset=0, valid content) after

The PR is ready for merge! 🎉

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
src/datasets/naming.py (1)

74-84: Update tests and document the breaking change in single-shard filename format.

The logic correctly avoids generating shard-specific filenames for single shards (returning prefix.suffix instead of prefix-00000-of-00001.suffix). However, this introduces a breaking change:

  • Tests in test_arrow_dataset.py (lines 4731, 4740, 4764, 4775) explicitly expect the old format with shard numbering: assert (dataset_path / "data-00000-of-00001.arrow").exists()
  • Tests in test_upstream_hub.py (lines 69, 86, 117, etc.) also expect train-00000-of-00001.parquet for single shards
  • Existing datasets saved with the old naming convention would fail to load, as filenames_for_dataset_split would generate different filenames

Update these test assertions to expect the new naming format, or ensure backward compatibility by supporting both naming patterns during load operations.

src/datasets/info.py (1)

121-122: Docstring inconsistency with type annotation.

The docstring still documents splits as dict, but the type annotation has changed to SplitDict. Consider updating the docstring for consistency:

-        splits (`dict`, *optional*):
+        splits ([`SplitDict`], *optional*):
src/datasets/packaged_modules/text/text.py (1)

109-109: Critical: Missing Key wrapper in paragraph sampling tail-batch.

Line 109 yields a plain tuple (file_idx, batch_idx) instead of Key(file_idx, batch_idx), creating type inconsistency within the same function. Lines 104 and 86 correctly use the Key wrapper, but this tail-batch case was missed.

Apply this diff:

                     if batch:
                         pa_table = pa.Table.from_arrays([pa.array([batch])], names=pa_table_names)
-                        yield (file_idx, batch_idx), self._cast_table(pa_table)
+                        yield Key(file_idx, batch_idx), self._cast_table(pa_table)
src/datasets/iterable_dataset.py (1)

294-425: Bug in ShuffledDataSourcesArrowExamplesIterable.shift_rngs: wrong attribute used

ShuffledDataSourcesArrowExamplesIterable only stores generate_tables_fn, but shift_rngs currently reconstructs the instance with self.generate_examples_fn, which doesn’t exist. This will raise AttributeError when shift_ex_examples_rngs is applied to a shuffled arrow-backed iterable (e.g. in PyTorch worker processes).

Apply this diff to fix the bug:

 class ShuffledDataSourcesArrowExamplesIterable(ArrowExamplesIterable):
@@
     def shift_rngs(self, value: int) -> "_BaseExamplesIterable":
         new_seed = self.generator.bit_generator.state["state"]["state"] + value
         return ShuffledDataSourcesArrowExamplesIterable(
-            self.generate_examples_fn,
+            self.generate_tables_fn,
             self.kwargs,
             np.random.default_rng(seed=new_seed),
         )
src/datasets/builder.py (1)

1339-1367: Update docstring to reflect the new Key-based signature.

The method signature now requires yielding tuple[Key, dict[str, Any]], but the docstring (lines 1353-1365) still describes yielding str or int keys and discusses deterministic hashing/shuffling that may no longer apply.

Update the docstring to reflect the new Key structure:

         Yields:
-            key: `str` or `int`, a unique deterministic example identification key.
-                * Unique: An error will be raised if two examples are yield with the
-                    same key.
-                * Deterministic: When generating the dataset twice, the same example
-                    should have the same key.
-                Good keys can be the image id, or line number if examples are extracted
-                from a text file.
-                The key will be hashed and sorted to shuffle examples deterministically,
-                such as generating the dataset multiple times keep examples in the
-                same order.
+            key: `Key`, a unique identification key containing:
+                * original_shard_id: Zero-based index of the source shard/file.
+                * item_or_batch_id: Zero-based index within that shard.
             example: `dict<str feature_name, feature_value>`, a feature dictionary
                 ready to be encoded and written to disk. The example will be
                 encoded with `self.info.features.encode_example({...})`.
♻️ Duplicate comments (1)
src/datasets/builder.py (1)

1829-1831: Same shard ID ordering assumption as GeneratorBasedBuilder.

This logic has the same potential IndexError issue as lines 1557-1578 if original_shard_id values are non-contiguous or out of order.

Also applies to: 1855-1858, 1867-1867

🧹 Nitpick comments (6)
src/datasets/arrow_writer.py (1)

587-593: Align write() docstring and key placeholder with current behavior

write() no longer accepts or uses a key, but the docstring still talks about an “(Example,Key) pair”, and you now append (example, "") where the second element is effectively ignored.

To avoid confusion for future readers, consider either:

  • Updating current_examples to store just examples (and simplifying the tuple structure), or
  • Keeping the tuple but updating the docstring and changing the annotation to list[tuple[dict[str, Any], Optional[str]]] and using a clearer sentinel such as None instead of "".
src/datasets/packaged_modules/xml/xml.py (1)

61-68: Consider yielding Key(file_idx, 0) instead of a raw tuple

Here _generate_tables now returns (file_idx, 0) as the key, while other loaders (e.g., CSV and SQL) yield Key(...) instances. If the rest of the pipeline is moving to the Key dataclass, it would be more consistent (and less surprising later) to do:

from datasets.builder import Key

yield Key(file_idx, 0), self._cast_table(pa_table)

This keeps XML aligned with the new key model and avoids having one module still emitting bare tuples.

src/datasets/packaged_modules/sql/sql.py (1)

10-10: Using Key for SQL chunks is consistent with the new key model

Importing Key and yielding Key(0, chunk_idx) brings this loader in line with the composite-key pattern used elsewhere. Since SQL doesn’t have a natural file index, using 0 as a synthetic original_shard_id is reasonable; consider adding a short comment to explain that 0 denotes the single SQL source to avoid confusion for future maintainers.

Also applies to: 120-120

tests/test_arrow_writer.py (1)

136-148: Test name is misleading after key removal.

The test function is named test_write_with_keys, but keys are no longer passed to writer.write(). The code changes are correct and align with the updated ArrowWriter API, but the test name should be updated to reflect the current behavior.

Consider renaming to test_write or removing this test if it's now redundant with the existing test_write function at line 99.

src/datasets/builder.py (2)

1320-1327: Add docstring to document the Key dataclass.

The Key dataclass is a new public API component but lacks documentation explaining its purpose, when it should be used, and the meaning of its fields.

Consider adding a docstring:

 @dataclass
 class Key:
+    """Unique key for identifying examples/batches in dataset generation.
+    
+    Attributes:
+        original_shard_id: Zero-based index of the original data shard/file.
+        item_or_batch_id: Zero-based index of the item or batch within that shard.
+    """
     original_shard_id: int
     item_or_batch_id: int

1557-1558: Consider logging when falling back to legacy key format.

When a generator yields non-Key objects, the code silently defaults original_shard_id to 0, which means all examples are tracked as coming from a single shard. Consider adding a one-time warning to help developers identify when custom builders need updating.

+                _legacy_key_warned = False
                 for key, record in generator:
-                    if isinstance(key, Key):  # old custom builders may not use Key
+                    if isinstance(key, Key):
                         original_shard_id = key.original_shard_id
+                    elif not _legacy_key_warned:
+                        logger.warning(
+                            f"Generator yielding non-Key objects (got {type(key).__name__}). "
+                            "Consider updating to yield Key instances for better shard tracking."
+                        )
+                        _legacy_key_warned = True
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 37728d0 and 12117f2.

📒 Files selected for processing (30)
  • src/datasets/arrow_dataset.py (2 hunks)
  • src/datasets/arrow_writer.py (2 hunks)
  • src/datasets/builder.py (28 hunks)
  • src/datasets/info.py (1 hunks)
  • src/datasets/inspect.py (0 hunks)
  • src/datasets/iterable_dataset.py (5 hunks)
  • src/datasets/keyhash.py (0 hunks)
  • src/datasets/naming.py (1 hunks)
  • src/datasets/packaged_modules/arrow/arrow.py (2 hunks)
  • src/datasets/packaged_modules/cache/cache.py (2 hunks)
  • src/datasets/packaged_modules/csv/csv.py (2 hunks)
  • src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py (4 hunks)
  • src/datasets/packaged_modules/generator/generator.py (2 hunks)
  • src/datasets/packaged_modules/hdf5/hdf5.py (2 hunks)
  • src/datasets/packaged_modules/json/json.py (3 hunks)
  • src/datasets/packaged_modules/pandas/pandas.py (2 hunks)
  • src/datasets/packaged_modules/parquet/parquet.py (2 hunks)
  • src/datasets/packaged_modules/spark/spark.py (1 hunks)
  • src/datasets/packaged_modules/sql/sql.py (2 hunks)
  • src/datasets/packaged_modules/text/text.py (4 hunks)
  • src/datasets/packaged_modules/webdataset/webdataset.py (2 hunks)
  • src/datasets/packaged_modules/xml/xml.py (1 hunks)
  • src/datasets/splits.py (3 hunks)
  • src/datasets/table.py (1 hunks)
  • src/datasets/utils/info_utils.py (1 hunks)
  • tests/features/test_embed_storage_sliced.py (1 hunks)
  • tests/packaged_modules/test_spark.py (2 hunks)
  • tests/test_arrow_dataset.py (1 hunks)
  • tests/test_arrow_writer.py (1 hunks)
  • tests/test_builder.py (14 hunks)
💤 Files with no reviewable changes (2)
  • src/datasets/inspect.py
  • src/datasets/keyhash.py
✅ Files skipped from review due to trivial changes (1)
  • src/datasets/utils/info_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/features/test_embed_storage_sliced.py
🧰 Additional context used
🧬 Code graph analysis (19)
src/datasets/info.py (1)
src/datasets/splits.py (1)
  • SplitDict (520-600)
src/datasets/packaged_modules/arrow/arrow.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/csv/csv.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
tests/test_arrow_writer.py (1)
src/datasets/arrow_writer.py (1)
  • write (582-598)
src/datasets/packaged_modules/cache/cache.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/generator/generator.py (3)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/utils/sharding.py (2)
  • _number_of_shards_in_gen_kwargs (4-18)
  • _split_gen_kwargs (48-64)
src/datasets/iterable_dataset.py (16)
  • num_shards (231-232)
  • num_shards (290-291)
  • num_shards (405-406)
  • num_shards (619-620)
  • num_shards (664-665)
  • num_shards (710-711)
  • num_shards (861-862)
  • num_shards (954-955)
  • num_shards (1042-1043)
  • num_shards (1561-1562)
  • num_shards (1663-1664)
  • num_shards (1762-1763)
  • num_shards (1837-1838)
  • num_shards (1886-1887)
  • num_shards (1967-1968)
  • num_shards (2114-2115)
tests/test_arrow_dataset.py (1)
src/datasets/arrow_dataset.py (4)
  • Dataset (704-6467)
  • add_column (6078-6129)
  • features (206-207)
  • features (781-785)
src/datasets/packaged_modules/pandas/pandas.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/hdf5/hdf5.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/webdataset/webdataset.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/xml/xml.py (7)
src/datasets/packaged_modules/arrow/arrow.py (1)
  • _cast_table (56-61)
src/datasets/packaged_modules/csv/csv.py (1)
  • _cast_table (166-175)
src/datasets/packaged_modules/json/json.py (1)
  • _cast_table (88-111)
src/datasets/packaged_modules/pandas/pandas.py (1)
  • _cast_table (55-60)
src/datasets/packaged_modules/parquet/parquet.py (1)
  • _cast_table (144-149)
src/datasets/packaged_modules/sql/sql.py (1)
  • _cast_table (101-110)
src/datasets/packaged_modules/text/text.py (1)
  • _cast_table (53-64)
src/datasets/iterable_dataset.py (2)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/generator/generator.py (1)
  • Generator (25-38)
tests/test_builder.py (2)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/load.py (1)
  • configure_builder_class (115-152)
src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/sql/sql.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/builder.py (4)
src/datasets/packaged_modules/generator/generator.py (1)
  • _split_generators (31-32)
src/datasets/packaged_modules/arrow/arrow.py (1)
  • _generate_tables (63-80)
src/datasets/packaged_modules/cache/cache.py (1)
  • _generate_tables (179-192)
src/datasets/packaged_modules/pandas/pandas.py (1)
  • _generate_tables (62-66)
src/datasets/packaged_modules/json/json.py (8)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/arrow/arrow.py (1)
  • _cast_table (56-61)
src/datasets/packaged_modules/csv/csv.py (1)
  • _cast_table (166-175)
src/datasets/packaged_modules/pandas/pandas.py (1)
  • _cast_table (55-60)
src/datasets/packaged_modules/parquet/parquet.py (1)
  • _cast_table (144-149)
src/datasets/packaged_modules/sql/sql.py (1)
  • _cast_table (101-110)
src/datasets/packaged_modules/text/text.py (1)
  • _cast_table (53-64)
src/datasets/packaged_modules/xml/xml.py (1)
  • _cast_table (48-59)
src/datasets/packaged_modules/text/text.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/parquet/parquet.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build / build_pr_documentation
🔇 Additional comments (29)
src/datasets/table.py (1)

2123-2131: Crash fix is well-implemented and correctly placed.

The contiguous copy for sliced list arrays directly addresses the SIGKILL crash reported in issue #6. The implementation is sound:

  • Defensive hasattr check before accessing offset
  • Correctly targets only sliced views (offset > 0)
  • pa.concat_arrays([array]) is the idiomatic PyArrow approach for creating contiguous copies
  • Placement before embed_storage ensures all downstream operations work on contiguous data
  • Recursive calls via _e(array.values, ...) will handle nested structures

The fix appropriately applies only to list and large_list types, as FixedSizeListArray handles offsets differently (via explicit slicing at lines 2153-2155).

src/datasets/arrow_writer.py (1)

19-19: Typing import adjustment is fine

Using only Any and Optional here matches the current usages in this file; no issues from this change.

tests/test_arrow_dataset.py (1)

4788-4795: New test_add_column effectively validates basic add_column behavior

This test covers the core contract of Dataset.add_column (feature presence and row content) with a minimal example and matches the updated public API; looks good as a regression/sanity check.

src/datasets/packaged_modules/csv/csv.py (1)

10-10: CSV generator’s move to Key(file_idx, batch_idx) looks correct

Switching from a raw (file_idx, batch_idx) tuple to Key(file_idx, batch_idx) together with the new import cleanly aligns CSV with the shared Key abstraction, without changing table data or batching behavior.

Also applies to: 196-196

tests/packaged_modules/test_spark.py (2)

28-28: LGTM!

The helper function correctly constructs tuple-based keys (part_id, row_idx) to match the updated SparkExamplesIterable behavior.


78-80: LGTM!

Test assertions correctly expect tuple keys (0, i) instead of the previous string format, aligning with the new key structure from _generate_iterable_examples.

src/datasets/info.py (1)

152-152: LGTM on the type annotation.

The change from Optional[dict] to Optional[SplitDict] correctly reflects the actual runtime type. The __post_init__ method at lines 178-179 ensures backward compatibility by converting plain dicts.

src/datasets/packaged_modules/cache/cache.py (2)

14-14: LGTM!

Import of Key from datasets.builder is consistent with the codebase-wide migration to structured composite keys.


189-189: LGTM!

Yielding Key(file_idx, batch_idx) instead of a string key aligns with the unified key structure across all packaged modules.

src/datasets/packaged_modules/parquet/parquet.py (2)

10-10: LGTM!

Import of Key is consistent with the PR's migration pattern.


182-182: LGTM!

The yield statement correctly uses Key(file_idx, batch_idx) matching the unified key structure across all Arrow-based builders.

src/datasets/packaged_modules/arrow/arrow.py (2)

8-8: LGTM!

Import follows the consistent pattern established across all packaged modules.


77-77: LGTM!

The Key(file_idx, batch_idx) yield is consistent with the other Arrow-based builders (cache, parquet), ensuring uniform key handling across the dataset generation pipeline.

src/datasets/packaged_modules/generator/generator.py (1)

34-38: LGTM! Shard-aware generation properly implemented.

The implementation correctly:

  • Computes the number of shards from gen_kwargs
  • Splits generation across shards
  • Yields structured Key objects with shard and sample indices

This aligns with the broader Key-based architecture for multi-shard dataset generation.

src/datasets/packaged_modules/hdf5/hdf5.py (1)

88-94: LGTM! Proper Key-based batch indexing.

The implementation correctly:

  • Introduces batch_idx via enumerate
  • Uses it for computing batch end indices
  • Yields structured Key(file_idx, batch_idx) objects

This aligns with the Key-based architecture used across other dataset builders.

src/datasets/packaged_modules/pandas/pandas.py (1)

62-66: LGTM! Correct Key usage for single-batch DataFrames.

The implementation correctly yields Key(i, 0) where i is the file index and 0 represents the single batch per pickled DataFrame. This aligns with the Key-based architecture used across dataset builders.

src/datasets/packaged_modules/webdataset/webdataset.py (1)

110-129: LGTM! Proper Key-based example indexing for WebDataset.

The implementation correctly yields Key(tar_idx, example_idx) to identify examples across TAR archives, replacing the previous string-based key format. This aligns with the standardized Key-based architecture.

src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py (1)

357-411: LGTM! Comprehensive shard-aware Key implementation.

The implementation correctly introduces shard-aware indexing for both metadata-driven and file-driven generation paths:

  • Enumerates metadata files and data files to obtain shard_idx
  • Enumerates samples within each shard to obtain sample_idx
  • Consistently yields Key(shard_idx, sample_idx) in both paths

This properly aligns with the Key-based multi-shard architecture used across dataset builders.

src/datasets/packaged_modules/json/json.py (1)

113-193: Key-based yields in _generate_tables look correct

Wrapping (file_idx, 0 | batch_idx) in Key(...) at all yield sites is consistent with the new composite key model and keeps shard/batch identity intact for both the single-object and line-delimited JSON paths.

tests/test_builder.py (2)

146-212: Shard-aware dummy builders correctly emit Key instances

DummyArrowBasedBuilderWithShards and DummyGeneratorBasedBuilderWithShards now yield Key(shard_idx, i) for each table/example. This gives unambiguous (original_shard_id, item_id) pairs and is exactly what split/shard length aggregation logic expects.


1031-1180: original_shard_lengths assertions align with new split metadata semantics

The new checks on split_info.original_shard_lengths (either None for non–multi-proc paths or [100] * 4 when combining 4 original shards into 2 logical shards) are coherent with the added field on SplitInfo and will pin down regressions in shard bookkeeping.

src/datasets/splits.py (2)

31-57: SplitInfo.original_shard_lengths addition is safe and well-scoped

Adding original_shard_lengths: Optional[list[int]] = None cleanly extends split metadata without changing existing behavior; the field is optional, defaulted, and only consumed where explicitly checked (e.g. in tests). file_instructions continues to rely on dataset_name provided via SplitDict.add, so existing consumers remain intact.


520-596: SplitDict typing and YAML serialization updates are consistent

Typing SplitDict as dict[str, SplitInfo] plus the dataset_name attribute keeps lookups and make_file_instructions well-typed, while _to_yaml_list now stripping both shard_lengths and original_shard_lengths (and the deprecated dataset_name) avoids leaking internal shard metadata into YAML. This matches how SplitInfo is meant to be persisted vs. used at runtime.

src/datasets/iterable_dataset.py (1)

77-89: Key alias and type hints correctly capture new key shapes

Defining

from .builder import Key as BuilderKey  # TYPE_CHECKING
Key = Union[int, str, tuple[int, int], "BuilderKey"]

and updating the examples/arrow iterable constructors to accept Iterator[tuple[Key, ...]] matches actual usage: legacy int/str and (shard_idx, i) tuples still work, while the new builder.Key dataclass is also supported without imposing a runtime import.

src/datasets/builder.py (5)

27-27: Import changes align with API refactoring.

The addition of Iterator, pyarrow as pa, and removal of unused exception imports are consistent with the signature changes throughout this file.

Also applies to: 31-31, 35-35, 52-52


1427-1435: LGTM on extended return value handling.

The addition of original_shards_per_job and original_shard_lengths_per_job to the return values is properly threaded through the unpacking and aggregation logic.

Also applies to: 1599-1611


1517-1522: LGTM on original_shard_lengths aggregation.

The logic correctly flattens and stores original shard lengths from all jobs when multiple shards exist, consistent with the existing pattern for shard_lengths.

Also applies to: 1795-1800


1388-1398: Variable renaming improves clarity.

Renaming to num_original_shards clarifies that these counts refer to input data shards rather than output file shards, making the multiprocessing adjustment logic easier to understand.

Also applies to: 1667-1677


1557-1558: Cannot locate the referenced code in the current repository.

The code snippet referenced at lines 1557-1558 in src/datasets/builder.py, specifically the pattern isinstance(key, Key), original_shard_id, original_shard_lengths, and _prepare_split_single method, does not exist in the current codebase. A comprehensive search found no occurrences of these identifiers or patterns.

This review comment may be referencing:

  • Unreleased or unmerged code from a pull request
  • Incorrect line numbers or file location
  • Code from a different branch or version

Please confirm the correct file location, branch, or provide the actual code being reviewed.

Comment on lines 6079 to 6083
self,
name: str,
column: Union[list, np.ndarray],
new_fingerprint: Optional[str] = None,
feature: Optional[FeatureType] = None,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring is out of sync with add_column signature (and can mislead positional callers).

The new signature add_column(self, name, column, new_fingerprint=None, feature=None) looks fine and matches how fingerprint_transform is used, but the docstring still:

  • Mentions only name, column, feature in Args: and
  • Implies feature is the third parameter while the function actually expects new_fingerprint in that position.

Someone following the docstring and calling add_column(name, col, feature) positionally will accidentally pass feature as new_fingerprint. Please update the docstring to include new_fingerprint and to match the parameter order.

For example:

-        Args:
-            name (`str`):
-                Column name.
-            column (`list` or `np.array`):
-                Column data to be added.
-            feature (`FeatureType` or `None`, defaults to `None`):
-                Column datatype.
+        Args:
+            name (`str`):
+                Column name.
+            column (`list` or `np.array`):
+                Column data to be added.
+            new_fingerprint (`str`, *optional*):
+                The new fingerprint of the dataset after transform. If `None`, it is computed automatically.
+            feature (`FeatureType` or `None`, defaults to `None`):
+                Column datatype.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/datasets/arrow_dataset.py around lines 6079 to 6083, the add_column
docstring is out of sync with the function signature: it omits new_fingerprint
and lists feature as the third parameter which causes positional callers to pass
feature into new_fingerprint; update the docstring to list parameters in the
exact order (name, column, new_fingerprint=None, feature=None), document
new_fingerprint (Optional[str]) and feature (Optional[FeatureType]) with brief
descriptions and default values, and ensure any examples or Args sections
reflect the new parameter order so positional calls behave correctly.

@transmit_format
@fingerprint_transform(inplace=False)
def add_item(self, item: dict, new_fingerprint: str):
def add_item(self, item: dict, new_fingerprint: Optional[str] = None):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

add_item now accepts new_fingerprint, but the docstring doesn’t mention it.

The updated signature add_item(self, item: dict, new_fingerprint: Optional[str] = None) is consistent with other fingerprinted transforms, but the docstring only documents item and omits new_fingerprint. To keep the public API clear, add an entry for new_fingerprint:

-        Args:
-            item (`dict`):
-                Item data to be added.
+        Args:
+            item (`dict`):
+                Item data to be added.
+            new_fingerprint (`str`, *optional*):
+                The new fingerprint of the dataset after transform. If `None`, it is computed automatically.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/datasets/arrow_dataset.py around line 6340, the method signature
add_item(self, item: dict, new_fingerprint: Optional[str] = None) was updated
but the docstring still documents only item; update the docstring to add a new
parameter entry for new_fingerprint describing that it is an optional str
(default None) used to set/override the item's fingerprint for fingerprinted
transforms (or leave None to compute/retain the existing fingerprint), and
ensure the parameter list formatting and any examples in the docstring reflect
this new argument.

The key will be hashed and sorted to shuffle examples deterministically,
such as generating the dataset multiple times keep examples in the
same order.
key: tuple[int, int] original_shard_id and table_idx within that shard
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring should reference Key type.

The docstring describes the key as tuple[int, int], but the signature (line 1628) correctly specifies Key. Update for consistency.

         Yields:
-            key: tuple[int, int] original_shard_id and table_idx within that shard
+            key: Key containing original_shard_id and item_or_batch_id (table index) within that shard
             example: `pyarrow.Table`, a feature table
                 ready to be encoded and written to disk.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
key: tuple[int, int] original_shard_id and table_idx within that shard
key: Key containing original_shard_id and item_or_batch_id (table index) within that shard
🤖 Prompt for AI Agents
In src/datasets/builder.py around line 1642, the docstring line currently
describes the key as "tuple[int, int]" but the function signature uses the alias
Key; update the docstring to reference Key (or the fully qualified alias if
preferred) to keep types consistent, and if Key is defined in a different module
ensure the docstring usage matches (e.g., "Key" or "datasets.builder.Key") so
the docstring and signature align.

if state_dict:
state_dict["partition_example_idx"] = row_id + 1
yield f"{part_id}_{row_id}", row_as_dict
yield (part_id, row_id), row_as_dict
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Inconsistent key type: Use Key object instead of tuple.

This module yields a plain tuple (part_id, row_id), while all other dataset builders in this PR (generator, text, hdf5, pandas, webdataset, folder_based_builder) consistently yield Key(shard_idx, item_idx) objects. This type inconsistency may break downstream code expecting uniform key types.

Apply this diff to align with the Key-based approach:

+from datasets.builder import Key

 def _generate_iterable_examples(
     df: "pyspark.sql.DataFrame",
     partition_order: list[int],
     state_dict: Optional[dict] = None,
 ):
     import pyspark
 
     df_with_partition_id = df.select("*", pyspark.sql.functions.spark_partition_id().alias("part_id"))
     partition_idx_start = state_dict["partition_idx"] if state_dict else 0
     partition_df = _reorder_dataframe_by_partition(df_with_partition_id, partition_order[partition_idx_start:])
     # pipeline next partition in parallel to hide latency
     rows = partition_df.toLocalIterator(prefetchPartitions=True)
     curr_partition = None
     row_id = state_dict["partition_example_idx"] if state_dict else 0
     for row in islice(rows, row_id, None):
         row_as_dict = row.asDict()
         part_id = row_as_dict["part_id"]
         row_as_dict.pop("part_id")
         if curr_partition != part_id:
             if state_dict and curr_partition is not None:
                 state_dict["partition_idx"] += 1
             curr_partition = part_id
             row_id = 0
         if state_dict:
             state_dict["partition_example_idx"] = row_id + 1
-        yield (part_id, row_id), row_as_dict
+        yield Key(part_id, row_id), row_as_dict
         row_id += 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
yield (part_id, row_id), row_as_dict
from datasets.builder import Key
def _generate_iterable_examples(
df: "pyspark.sql.DataFrame",
partition_order: list[int],
state_dict: Optional[dict] = None,
):
import pyspark
df_with_partition_id = df.select("*", pyspark.sql.functions.spark_partition_id().alias("part_id"))
partition_idx_start = state_dict["partition_idx"] if state_dict else 0
partition_df = _reorder_dataframe_by_partition(df_with_partition_id, partition_order[partition_idx_start:])
# pipeline next partition in parallel to hide latency
rows = partition_df.toLocalIterator(prefetchPartitions=True)
curr_partition = None
row_id = state_dict["partition_example_idx"] if state_dict else 0
for row in islice(rows, row_id, None):
row_as_dict = row.asDict()
part_id = row_as_dict["part_id"]
row_as_dict.pop("part_id")
if curr_partition != part_id:
if state_dict and curr_partition is not None:
state_dict["partition_idx"] += 1
curr_partition = part_id
row_id = 0
if state_dict:
state_dict["partition_example_idx"] = row_id + 1
yield Key(part_id, row_id), row_as_dict
row_id += 1
🤖 Prompt for AI Agents
In src/datasets/packaged_modules/spark/spark.py around line 74, the generator
yields a plain tuple (part_id, row_id) which is inconsistent; replace the tuple
with a Key object (Key(part_id, row_id)) so downstream code receives the same
key type as other builders, and add/import Key at the top of the file if it
isn't already imported; ensure any tests or usages nearby expect Key and update
local variable names/annotations if necessary.

@The-Obstacle-Is-The-Way The-Obstacle-Is-The-Way force-pushed the fix/embed-storage-crash branch 2 times, most recently from 28f83ac to 00792fc Compare December 9, 2025 15:37
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/datasets/table.py (1)

2121-2131: Contiguity guard for sliced list arrays fits the bug description and is well-scoped

The new offset-based pa.concat_arrays([array]) step is correctly limited to list/large_list arrays and only triggers for sliced views (offset > 0), preserving behavior for other types and unsliced data. Extension arrays are still unwrapped first, and the subsequent list-handling logic remains unchanged, so this should address the SIGKILL without broad side effects. The extra hasattr(array, "offset") check is slightly redundant here but harmless; you can keep it for defensiveness or drop it later as a minor cleanup.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 12117f2 and 00792fc.

📒 Files selected for processing (2)
  • src/datasets/table.py (1 hunks)
  • tests/features/test_embed_storage_sliced.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/features/test_embed_storage_sliced.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: test (unit, windows-latest, deps-latest)
  • GitHub Check: test (integration, ubuntu-latest, deps-minimum)
  • GitHub Check: test (integration, windows-latest, deps-minimum)
  • GitHub Check: test (integration, ubuntu-latest, deps-latest)
  • GitHub Check: test (unit, windows-latest, deps-minimum)
  • GitHub Check: test (integration, windows-latest, deps-latest)
  • GitHub Check: test (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test (unit, ubuntu-latest, deps-minimum)
  • GitHub Check: test_py314 (unit, windows-latest, deps-latest)
  • GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314_future (unit, windows-latest, deps-latest)
  • GitHub Check: build / build_pr_documentation

ada-ggf25 and others added 8 commits December 16, 2025 15:20
…ace#7877) (huggingface#7891)

* fix(fingerprint): improve TMPDIR environment variable handling in _TempCacheDir

Enhanced the _TempCacheDir.__init__ method to properly respect and handle
the TMPDIR environment variable when creating temporary cache directories.

Changes:
- Add TMPDIR environment variable detection and validation
- Normalise paths to handle path resolution issues
- Auto-create TMPDIR directory if it doesn't exist to prevent silent
  fallback to default temporary directory
- Validate that TMPDIR is actually a directory before use
- Explicitly pass directory to mkdtemp to ensure TMPDIR is respected even
  if tempfile.gettempdir() was already called and cached
- Add appropriate logging for directory creation and fallback scenarios

This ensures that when TMPDIR is set, the temporary cache files are
created in the specified directory rather than silently falling back to
the system default temporary directory.

* test(fingerprint): add comprehensive tests for TMPDIR handling in _TempCacheDir

Add test coverage for the improved TMPDIR environment variable handling
in the _TempCacheDir class. These tests verify the various scenarios
for TMPDIR usage and error handling.

Changes:
- Refactor test_fingerprint_in_multiprocessing to use Pool.map for
  cleaner test implementation
- Add test_temp_cache_dir_with_tmpdir_nonexistent to verify TMPDIR
  auto-creation when directory doesn't exist
- Add test_temp_cache_dir_with_tmpdir_existing to verify correct
  behaviour when TMPDIR exists and is valid
- Add test_temp_cache_dir_without_tmpdir to verify fallback to default
  temporary directory when TMPDIR is not set
- Add test_temp_cache_dir_tmpdir_creation_failure to verify graceful
  error handling and fallback when TMPDIR creation fails

These tests ensure that the TMPDIR improvements work correctly across
all scenarios and edge cases, including proper logging and fallback
behaviour.

* test(fingerprint): tighten TMPDIR error-path tests for _TempCacheDir

Refine TMPDIR-related failure tests for _TempCacheDir to assert explicit
error conditions instead of fallback behaviour.

Changes:
- Update test_temp_cache_dir_tmpdir_creation_failure to use _TempCacheDir
  directly and assert that an OSError is raised with a clear TMPDIR
  context when directory creation fails
- Introduce test_temp_cache_dir_tmpdir_not_directory to verify that
  pointing TMPDIR at a non-directory raises an OSError with an
  informative error message

These tests better match the intended contract of _TempCacheDir by
ensuring invalid TMPDIR configurations fail loudly with descriptive
messages rather than silently falling back.

* fix(fingerprint): make TMPDIR misconfiguration in _TempCacheDir fail loudly

Tighten TMPDIR handling in _TempCacheDir so that invalid configurations
raise clear errors instead of silently falling back to the default
temporary directory.

Changes:
- When TMPDIR points to a non-existent directory, raise an OSError with
  explicit guidance to create it manually or unset TMPDIR
- When TMPDIR points to a non-directory path, raise an OSError with
  guidance to point TMPDIR to a writable directory or unset it
- Remove previous warning-and-fallback behaviour to avoid masking
  configuration issues

This ensures that TMPDIR misconfigurations are surfaced early and
clearly, aligning runtime behaviour with the stricter expectations
codified in the new tests.

* Update src/datasets/fingerprint.py

* Update fingerprint.py

* Fix formatting of TMPDIR retrieval line

---------

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* encode nifti correctly when uploading lazily

* update docstring for nifti functionality
)

* fix(nifti): enable lazy loading for Nifti1ImageWrapper

- Change dataobj initialization from get_fdata() to .dataobj
- Preserves nibabel's ArrayProxy for lazy loading
- Prevents OOM errors on large 4D fMRI files
- Defers I/O errors to usage time for better error handling

Resolves #1

* chore: trigger CI
…gingface#7906)

don't save original_shard_lengths by default for backward compat
When ds.shard() or ds.select() creates a sliced view, array.values returns
values with internal offset references that can cause PyArrow's C++ layer
to crash with SIGKILL when processing nested types like Sequence(Nifti()).

The fix forces a contiguous copy via pa.concat_arrays([array]) when the
array has a non-zero offset, breaking the internal references before
further processing.

Fixes: #6
1. Remove fork-specific issue URL placeholders (upstream-ready)
2. Add consistency assertions to LargeList test:
   - offset == 0 check
   - content verification (bytes embedded)
3. Add offset == 0 check to Nifti test for consistency
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: embed_table_storage crashes (SIGKILL) on sharded datasets with Sequence(Nifti())

4 participants