fix: force contiguous copy for sliced list arrays in embed_array_storage#8
fix: force contiguous copy for sliced list arrays in embed_array_storage#8The-Obstacle-Is-The-Way wants to merge 8 commits intomainfrom
Conversation
|
Warning Rate limit exceeded@The-Obstacle-Is-The-Way has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 9 minutes and 30 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (12)
WalkthroughThe changes fix a crash in Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/features/test_embed_storage_sliced.py (1)
88-113: Consider adding consistency assertions for LargeList test.For consistency with
test_embed_array_storage_sliced_list_image, consider adding:
- Assertion that
embedded.offset == 0to verify contiguity- Content verification (e.g., checking embedded bytes are not None)
# This should NOT crash with SIGKILL embedded = embed_array_storage(sliced, LargeList(Image())) + # The fix should make the result contiguous (offset = 0) + assert embedded.offset == 0, "Result should be contiguous after fix" assert len(embedded) == 2 + # Verify bytes were embedded for item with images + assert embedded[0].as_py()[0]["bytes"] is not None
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/datasets/table.py(1 hunks)tests/features/test_embed_storage_sliced.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/features/test_embed_storage_sliced.py (2)
src/datasets/features/features.py (2)
List(1208-1232)LargeList(1236-1253)src/datasets/table.py (5)
embed_array_storage(2096-2161)slice(441-455)slice(793-808)slice(1048-1065)slice(1482-1511)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: test (integration, windows-latest, deps-minimum)
- GitHub Check: test (unit, windows-latest, deps-latest)
- GitHub Check: test (integration, windows-latest, deps-latest)
- GitHub Check: test (integration, ubuntu-latest, deps-minimum)
- GitHub Check: test (integration, ubuntu-latest, deps-latest)
- GitHub Check: test (unit, ubuntu-latest, deps-minimum)
- GitHub Check: test (unit, windows-latest, deps-minimum)
- GitHub Check: test (unit, ubuntu-latest, deps-latest)
- GitHub Check: test_py314 (unit, windows-latest, deps-latest)
- GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
- GitHub Check: test_py314_future (unit, windows-latest, deps-latest)
- GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
- GitHub Check: build / build_pr_documentation
🔇 Additional comments (2)
src/datasets/table.py (1)
2129-2131: Fix logic is sound and addresses the root cause.The approach of detecting sliced arrays via
offset > 0and forcing a contiguous copy withpa.concat_arrays([array])correctly breaks the internal offset references that cause SIGKILL crashes in PyArrow's C++ layer.A few notes:
- Fixed-size lists are correctly excluded since they already handle offsets explicitly at lines 2154-2155
- The
@_wrap_for_chunked_arraysdecorator ensures this works forChunkedArrayas welltests/features/test_embed_storage_sliced.py (1)
18-52: Good regression test with thorough assertions.The test correctly:
- Creates a realistic sliced array scenario
- Verifies the precondition (
sliced.offset == 1)- Validates the fix produces contiguous output (
embedded.offset == 0)- Checks content preservation
src/datasets/table.py
Outdated
| # When ds.shard() or ds.select() creates a sliced view, array.values returns | ||
| # values with internal offset references that can cause PyArrow's C++ layer | ||
| # to crash when processing nested types like Sequence(Nifti()). | ||
| # See: https://github.com/huggingface/datasets/issues/XXXX |
There was a problem hiding this comment.
Update placeholder issue number in comment.
The comment references issues/XXXX but should reference the actual issue number #6 as mentioned in the PR objectives.
- # See: https://github.com/huggingface/datasets/issues/XXXX
+ # See: https://github.com/huggingface/datasets/issues/6📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # See: https://github.com/huggingface/datasets/issues/XXXX | |
| # See: https://github.com/huggingface/datasets/issues/6 |
🤖 Prompt for AI Agents
In src/datasets/table.py around line 2128, update the placeholder issue
reference in the comment from "issues/XXXX" to the actual issue number "#6" by
replacing the URL/comment fragment so it points to the real issue (e.g.,
https://github.com/huggingface/datasets/issues/6 or include "(#6)") to
accurately document the reference.
| """Tests for embed_array_storage with sliced/sharded arrays. | ||
|
|
||
| Regression tests for https://github.com/huggingface/datasets/issues/XXXX | ||
| (SIGKILL in embed_array_storage when processing sliced/sharded Arrow tables) | ||
| """ |
There was a problem hiding this comment.
Update placeholder issue number in docstring.
Same as in table.py, the placeholder XXXX should be replaced with the actual issue number.
-Regression tests for https://github.com/huggingface/datasets/issues/XXXX
+Regression tests for https://github.com/huggingface/datasets/issues/6📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| """Tests for embed_array_storage with sliced/sharded arrays. | |
| Regression tests for https://github.com/huggingface/datasets/issues/XXXX | |
| (SIGKILL in embed_array_storage when processing sliced/sharded Arrow tables) | |
| """ | |
| """Tests for embed_array_storage with sliced/sharded arrays. | |
| Regression tests for https://github.com/huggingface/datasets/issues/6 | |
| (SIGKILL in embed_array_storage when processing sliced/sharded Arrow tables) | |
| """ |
🤖 Prompt for AI Agents
In tests/features/test_embed_storage_sliced.py lines 1-5, the docstring contains
a placeholder issue number "XXXX"; replace it with the actual GitHub issue
number used in the repo (the same one referenced in table.py). If you don't know
it, search the codebase/history for the corresponding issue reference and update
the docstring to include the real issue number (e.g., "#12345") so the
regression test comment is accurate.
|
@coderabbitai review Addressed all feedback:
|
🧩 Analysis chain🏁 Script executed: Length of output: 94456 🏁 Script executed: Length of output: 4685 @The-Obstacle-Is-The-Way Perfect! I've verified all your changes and can confirm: ✅ All feedback addressed:
Code Quality:
The PR is ready for merge! 🎉 ✅ Actions performedReview triggered.
|
b7839ff to
12117f2
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
src/datasets/naming.py (1)
74-84: Update tests and document the breaking change in single-shard filename format.The logic correctly avoids generating shard-specific filenames for single shards (returning
prefix.suffixinstead ofprefix-00000-of-00001.suffix). However, this introduces a breaking change:
- Tests in
test_arrow_dataset.py(lines 4731, 4740, 4764, 4775) explicitly expect the old format with shard numbering:assert (dataset_path / "data-00000-of-00001.arrow").exists()- Tests in
test_upstream_hub.py(lines 69, 86, 117, etc.) also expecttrain-00000-of-00001.parquetfor single shards- Existing datasets saved with the old naming convention would fail to load, as
filenames_for_dataset_splitwould generate different filenamesUpdate these test assertions to expect the new naming format, or ensure backward compatibility by supporting both naming patterns during load operations.
src/datasets/info.py (1)
121-122: Docstring inconsistency with type annotation.The docstring still documents
splitsasdict, but the type annotation has changed toSplitDict. Consider updating the docstring for consistency:- splits (`dict`, *optional*): + splits ([`SplitDict`], *optional*):src/datasets/packaged_modules/text/text.py (1)
109-109: Critical: Missing Key wrapper in paragraph sampling tail-batch.Line 109 yields a plain tuple
(file_idx, batch_idx)instead ofKey(file_idx, batch_idx), creating type inconsistency within the same function. Lines 104 and 86 correctly use the Key wrapper, but this tail-batch case was missed.Apply this diff:
if batch: pa_table = pa.Table.from_arrays([pa.array([batch])], names=pa_table_names) - yield (file_idx, batch_idx), self._cast_table(pa_table) + yield Key(file_idx, batch_idx), self._cast_table(pa_table)src/datasets/iterable_dataset.py (1)
294-425: Bug inShuffledDataSourcesArrowExamplesIterable.shift_rngs: wrong attribute used
ShuffledDataSourcesArrowExamplesIterableonly storesgenerate_tables_fn, butshift_rngscurrently reconstructs the instance withself.generate_examples_fn, which doesn’t exist. This will raiseAttributeErrorwhenshift_ex_examples_rngsis applied to a shuffled arrow-backed iterable (e.g. in PyTorch worker processes).Apply this diff to fix the bug:
class ShuffledDataSourcesArrowExamplesIterable(ArrowExamplesIterable): @@ def shift_rngs(self, value: int) -> "_BaseExamplesIterable": new_seed = self.generator.bit_generator.state["state"]["state"] + value return ShuffledDataSourcesArrowExamplesIterable( - self.generate_examples_fn, + self.generate_tables_fn, self.kwargs, np.random.default_rng(seed=new_seed), )src/datasets/builder.py (1)
1339-1367: Update docstring to reflect the new Key-based signature.The method signature now requires yielding
tuple[Key, dict[str, Any]], but the docstring (lines 1353-1365) still describes yieldingstrorintkeys and discusses deterministic hashing/shuffling that may no longer apply.Update the docstring to reflect the new Key structure:
Yields: - key: `str` or `int`, a unique deterministic example identification key. - * Unique: An error will be raised if two examples are yield with the - same key. - * Deterministic: When generating the dataset twice, the same example - should have the same key. - Good keys can be the image id, or line number if examples are extracted - from a text file. - The key will be hashed and sorted to shuffle examples deterministically, - such as generating the dataset multiple times keep examples in the - same order. + key: `Key`, a unique identification key containing: + * original_shard_id: Zero-based index of the source shard/file. + * item_or_batch_id: Zero-based index within that shard. example: `dict<str feature_name, feature_value>`, a feature dictionary ready to be encoded and written to disk. The example will be encoded with `self.info.features.encode_example({...})`.
♻️ Duplicate comments (1)
src/datasets/builder.py (1)
1829-1831: Same shard ID ordering assumption as GeneratorBasedBuilder.This logic has the same potential
IndexErrorissue as lines 1557-1578 iforiginal_shard_idvalues are non-contiguous or out of order.Also applies to: 1855-1858, 1867-1867
🧹 Nitpick comments (6)
src/datasets/arrow_writer.py (1)
587-593: Alignwrite()docstring and key placeholder with current behavior
write()no longer accepts or uses a key, but the docstring still talks about an “(Example,Key) pair”, and you now append(example, "")where the second element is effectively ignored.To avoid confusion for future readers, consider either:
- Updating
current_examplesto store just examples (and simplifying the tuple structure), or- Keeping the tuple but updating the docstring and changing the annotation to
list[tuple[dict[str, Any], Optional[str]]]and using a clearer sentinel such asNoneinstead of"".src/datasets/packaged_modules/xml/xml.py (1)
61-68: Consider yieldingKey(file_idx, 0)instead of a raw tupleHere
_generate_tablesnow returns(file_idx, 0)as the key, while other loaders (e.g., CSV and SQL) yieldKey(...)instances. If the rest of the pipeline is moving to theKeydataclass, it would be more consistent (and less surprising later) to do:from datasets.builder import Key yield Key(file_idx, 0), self._cast_table(pa_table)This keeps XML aligned with the new key model and avoids having one module still emitting bare tuples.
src/datasets/packaged_modules/sql/sql.py (1)
10-10: UsingKeyfor SQL chunks is consistent with the new key modelImporting
Keyand yieldingKey(0, chunk_idx)brings this loader in line with the composite-key pattern used elsewhere. Since SQL doesn’t have a natural file index, using0as a syntheticoriginal_shard_idis reasonable; consider adding a short comment to explain that0denotes the single SQL source to avoid confusion for future maintainers.Also applies to: 120-120
tests/test_arrow_writer.py (1)
136-148: Test name is misleading after key removal.The test function is named
test_write_with_keys, but keys are no longer passed towriter.write(). The code changes are correct and align with the updatedArrowWriterAPI, but the test name should be updated to reflect the current behavior.Consider renaming to
test_writeor removing this test if it's now redundant with the existingtest_writefunction at line 99.src/datasets/builder.py (2)
1320-1327: Add docstring to document the Key dataclass.The
Keydataclass is a new public API component but lacks documentation explaining its purpose, when it should be used, and the meaning of its fields.Consider adding a docstring:
@dataclass class Key: + """Unique key for identifying examples/batches in dataset generation. + + Attributes: + original_shard_id: Zero-based index of the original data shard/file. + item_or_batch_id: Zero-based index of the item or batch within that shard. + """ original_shard_id: int item_or_batch_id: int
1557-1558: Consider logging when falling back to legacy key format.When a generator yields non-
Keyobjects, the code silently defaultsoriginal_shard_idto 0, which means all examples are tracked as coming from a single shard. Consider adding a one-time warning to help developers identify when custom builders need updating.+ _legacy_key_warned = False for key, record in generator: - if isinstance(key, Key): # old custom builders may not use Key + if isinstance(key, Key): original_shard_id = key.original_shard_id + elif not _legacy_key_warned: + logger.warning( + f"Generator yielding non-Key objects (got {type(key).__name__}). " + "Consider updating to yield Key instances for better shard tracking." + ) + _legacy_key_warned = True
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (30)
src/datasets/arrow_dataset.py(2 hunks)src/datasets/arrow_writer.py(2 hunks)src/datasets/builder.py(28 hunks)src/datasets/info.py(1 hunks)src/datasets/inspect.py(0 hunks)src/datasets/iterable_dataset.py(5 hunks)src/datasets/keyhash.py(0 hunks)src/datasets/naming.py(1 hunks)src/datasets/packaged_modules/arrow/arrow.py(2 hunks)src/datasets/packaged_modules/cache/cache.py(2 hunks)src/datasets/packaged_modules/csv/csv.py(2 hunks)src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py(4 hunks)src/datasets/packaged_modules/generator/generator.py(2 hunks)src/datasets/packaged_modules/hdf5/hdf5.py(2 hunks)src/datasets/packaged_modules/json/json.py(3 hunks)src/datasets/packaged_modules/pandas/pandas.py(2 hunks)src/datasets/packaged_modules/parquet/parquet.py(2 hunks)src/datasets/packaged_modules/spark/spark.py(1 hunks)src/datasets/packaged_modules/sql/sql.py(2 hunks)src/datasets/packaged_modules/text/text.py(4 hunks)src/datasets/packaged_modules/webdataset/webdataset.py(2 hunks)src/datasets/packaged_modules/xml/xml.py(1 hunks)src/datasets/splits.py(3 hunks)src/datasets/table.py(1 hunks)src/datasets/utils/info_utils.py(1 hunks)tests/features/test_embed_storage_sliced.py(1 hunks)tests/packaged_modules/test_spark.py(2 hunks)tests/test_arrow_dataset.py(1 hunks)tests/test_arrow_writer.py(1 hunks)tests/test_builder.py(14 hunks)
💤 Files with no reviewable changes (2)
- src/datasets/inspect.py
- src/datasets/keyhash.py
✅ Files skipped from review due to trivial changes (1)
- src/datasets/utils/info_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/features/test_embed_storage_sliced.py
🧰 Additional context used
🧬 Code graph analysis (19)
src/datasets/info.py (1)
src/datasets/splits.py (1)
SplitDict(520-600)
src/datasets/packaged_modules/arrow/arrow.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/packaged_modules/csv/csv.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
tests/test_arrow_writer.py (1)
src/datasets/arrow_writer.py (1)
write(582-598)
src/datasets/packaged_modules/cache/cache.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/packaged_modules/generator/generator.py (3)
src/datasets/builder.py (1)
Key(1321-1326)src/datasets/utils/sharding.py (2)
_number_of_shards_in_gen_kwargs(4-18)_split_gen_kwargs(48-64)src/datasets/iterable_dataset.py (16)
num_shards(231-232)num_shards(290-291)num_shards(405-406)num_shards(619-620)num_shards(664-665)num_shards(710-711)num_shards(861-862)num_shards(954-955)num_shards(1042-1043)num_shards(1561-1562)num_shards(1663-1664)num_shards(1762-1763)num_shards(1837-1838)num_shards(1886-1887)num_shards(1967-1968)num_shards(2114-2115)
tests/test_arrow_dataset.py (1)
src/datasets/arrow_dataset.py (4)
Dataset(704-6467)add_column(6078-6129)features(206-207)features(781-785)
src/datasets/packaged_modules/pandas/pandas.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/packaged_modules/hdf5/hdf5.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/packaged_modules/webdataset/webdataset.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/packaged_modules/xml/xml.py (7)
src/datasets/packaged_modules/arrow/arrow.py (1)
_cast_table(56-61)src/datasets/packaged_modules/csv/csv.py (1)
_cast_table(166-175)src/datasets/packaged_modules/json/json.py (1)
_cast_table(88-111)src/datasets/packaged_modules/pandas/pandas.py (1)
_cast_table(55-60)src/datasets/packaged_modules/parquet/parquet.py (1)
_cast_table(144-149)src/datasets/packaged_modules/sql/sql.py (1)
_cast_table(101-110)src/datasets/packaged_modules/text/text.py (1)
_cast_table(53-64)
src/datasets/iterable_dataset.py (2)
src/datasets/builder.py (1)
Key(1321-1326)src/datasets/packaged_modules/generator/generator.py (1)
Generator(25-38)
tests/test_builder.py (2)
src/datasets/builder.py (1)
Key(1321-1326)src/datasets/load.py (1)
configure_builder_class(115-152)
src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/packaged_modules/sql/sql.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/builder.py (4)
src/datasets/packaged_modules/generator/generator.py (1)
_split_generators(31-32)src/datasets/packaged_modules/arrow/arrow.py (1)
_generate_tables(63-80)src/datasets/packaged_modules/cache/cache.py (1)
_generate_tables(179-192)src/datasets/packaged_modules/pandas/pandas.py (1)
_generate_tables(62-66)
src/datasets/packaged_modules/json/json.py (8)
src/datasets/builder.py (1)
Key(1321-1326)src/datasets/packaged_modules/arrow/arrow.py (1)
_cast_table(56-61)src/datasets/packaged_modules/csv/csv.py (1)
_cast_table(166-175)src/datasets/packaged_modules/pandas/pandas.py (1)
_cast_table(55-60)src/datasets/packaged_modules/parquet/parquet.py (1)
_cast_table(144-149)src/datasets/packaged_modules/sql/sql.py (1)
_cast_table(101-110)src/datasets/packaged_modules/text/text.py (1)
_cast_table(53-64)src/datasets/packaged_modules/xml/xml.py (1)
_cast_table(48-59)
src/datasets/packaged_modules/text/text.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/packaged_modules/parquet/parquet.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build / build_pr_documentation
🔇 Additional comments (29)
src/datasets/table.py (1)
2123-2131: Crash fix is well-implemented and correctly placed.The contiguous copy for sliced list arrays directly addresses the SIGKILL crash reported in issue #6. The implementation is sound:
- Defensive
hasattrcheck before accessingoffset- Correctly targets only sliced views (
offset > 0)pa.concat_arrays([array])is the idiomatic PyArrow approach for creating contiguous copies- Placement before
embed_storageensures all downstream operations work on contiguous data- Recursive calls via
_e(array.values, ...)will handle nested structuresThe fix appropriately applies only to
listandlarge_listtypes, asFixedSizeListArrayhandles offsets differently (via explicit slicing at lines 2153-2155).src/datasets/arrow_writer.py (1)
19-19: Typing import adjustment is fineUsing only
AnyandOptionalhere matches the current usages in this file; no issues from this change.tests/test_arrow_dataset.py (1)
4788-4795: Newtest_add_columneffectively validates basic add_column behaviorThis test covers the core contract of
Dataset.add_column(feature presence and row content) with a minimal example and matches the updated public API; looks good as a regression/sanity check.src/datasets/packaged_modules/csv/csv.py (1)
10-10: CSV generator’s move toKey(file_idx, batch_idx)looks correctSwitching from a raw
(file_idx, batch_idx)tuple toKey(file_idx, batch_idx)together with the new import cleanly aligns CSV with the sharedKeyabstraction, without changing table data or batching behavior.Also applies to: 196-196
tests/packaged_modules/test_spark.py (2)
28-28: LGTM!The helper function correctly constructs tuple-based keys
(part_id, row_idx)to match the updatedSparkExamplesIterablebehavior.
78-80: LGTM!Test assertions correctly expect tuple keys
(0, i)instead of the previous string format, aligning with the new key structure from_generate_iterable_examples.src/datasets/info.py (1)
152-152: LGTM on the type annotation.The change from
Optional[dict]toOptional[SplitDict]correctly reflects the actual runtime type. The__post_init__method at lines 178-179 ensures backward compatibility by converting plain dicts.src/datasets/packaged_modules/cache/cache.py (2)
14-14: LGTM!Import of
Keyfromdatasets.builderis consistent with the codebase-wide migration to structured composite keys.
189-189: LGTM!Yielding
Key(file_idx, batch_idx)instead of a string key aligns with the unified key structure across all packaged modules.src/datasets/packaged_modules/parquet/parquet.py (2)
10-10: LGTM!Import of
Keyis consistent with the PR's migration pattern.
182-182: LGTM!The yield statement correctly uses
Key(file_idx, batch_idx)matching the unified key structure across all Arrow-based builders.src/datasets/packaged_modules/arrow/arrow.py (2)
8-8: LGTM!Import follows the consistent pattern established across all packaged modules.
77-77: LGTM!The
Key(file_idx, batch_idx)yield is consistent with the other Arrow-based builders (cache, parquet), ensuring uniform key handling across the dataset generation pipeline.src/datasets/packaged_modules/generator/generator.py (1)
34-38: LGTM! Shard-aware generation properly implemented.The implementation correctly:
- Computes the number of shards from gen_kwargs
- Splits generation across shards
- Yields structured Key objects with shard and sample indices
This aligns with the broader Key-based architecture for multi-shard dataset generation.
src/datasets/packaged_modules/hdf5/hdf5.py (1)
88-94: LGTM! Proper Key-based batch indexing.The implementation correctly:
- Introduces
batch_idxvia enumerate- Uses it for computing batch end indices
- Yields structured
Key(file_idx, batch_idx)objectsThis aligns with the Key-based architecture used across other dataset builders.
src/datasets/packaged_modules/pandas/pandas.py (1)
62-66: LGTM! Correct Key usage for single-batch DataFrames.The implementation correctly yields
Key(i, 0)whereiis the file index and0represents the single batch per pickled DataFrame. This aligns with the Key-based architecture used across dataset builders.src/datasets/packaged_modules/webdataset/webdataset.py (1)
110-129: LGTM! Proper Key-based example indexing for WebDataset.The implementation correctly yields
Key(tar_idx, example_idx)to identify examples across TAR archives, replacing the previous string-based key format. This aligns with the standardized Key-based architecture.src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py (1)
357-411: LGTM! Comprehensive shard-aware Key implementation.The implementation correctly introduces shard-aware indexing for both metadata-driven and file-driven generation paths:
- Enumerates metadata files and data files to obtain
shard_idx- Enumerates samples within each shard to obtain
sample_idx- Consistently yields
Key(shard_idx, sample_idx)in both pathsThis properly aligns with the Key-based multi-shard architecture used across dataset builders.
src/datasets/packaged_modules/json/json.py (1)
113-193: Key-based yields in_generate_tableslook correctWrapping
(file_idx, 0 | batch_idx)inKey(...)at all yield sites is consistent with the new composite key model and keeps shard/batch identity intact for both the single-object and line-delimited JSON paths.tests/test_builder.py (2)
146-212: Shard-aware dummy builders correctly emitKeyinstances
DummyArrowBasedBuilderWithShardsandDummyGeneratorBasedBuilderWithShardsnow yieldKey(shard_idx, i)for each table/example. This gives unambiguous(original_shard_id, item_id)pairs and is exactly what split/shard length aggregation logic expects.
1031-1180:original_shard_lengthsassertions align with new split metadata semanticsThe new checks on
split_info.original_shard_lengths(eitherNonefor non–multi-proc paths or[100] * 4when combining 4 original shards into 2 logical shards) are coherent with the added field onSplitInfoand will pin down regressions in shard bookkeeping.src/datasets/splits.py (2)
31-57:SplitInfo.original_shard_lengthsaddition is safe and well-scopedAdding
original_shard_lengths: Optional[list[int]] = Nonecleanly extends split metadata without changing existing behavior; the field is optional, defaulted, and only consumed where explicitly checked (e.g. in tests).file_instructionscontinues to rely ondataset_nameprovided viaSplitDict.add, so existing consumers remain intact.
520-596:SplitDicttyping and YAML serialization updates are consistentTyping
SplitDictasdict[str, SplitInfo]plus thedataset_nameattribute keeps lookups andmake_file_instructionswell-typed, while_to_yaml_listnow stripping bothshard_lengthsandoriginal_shard_lengths(and the deprecateddataset_name) avoids leaking internal shard metadata into YAML. This matches howSplitInfois meant to be persisted vs. used at runtime.src/datasets/iterable_dataset.py (1)
77-89: Key alias and type hints correctly capture new key shapesDefining
from .builder import Key as BuilderKey # TYPE_CHECKING Key = Union[int, str, tuple[int, int], "BuilderKey"]and updating the examples/arrow iterable constructors to accept
Iterator[tuple[Key, ...]]matches actual usage: legacy int/str and(shard_idx, i)tuples still work, while the newbuilder.Keydataclass is also supported without imposing a runtime import.src/datasets/builder.py (5)
27-27: Import changes align with API refactoring.The addition of
Iterator,pyarrow as pa, and removal of unused exception imports are consistent with the signature changes throughout this file.Also applies to: 31-31, 35-35, 52-52
1427-1435: LGTM on extended return value handling.The addition of
original_shards_per_jobandoriginal_shard_lengths_per_jobto the return values is properly threaded through the unpacking and aggregation logic.Also applies to: 1599-1611
1517-1522: LGTM on original_shard_lengths aggregation.The logic correctly flattens and stores original shard lengths from all jobs when multiple shards exist, consistent with the existing pattern for
shard_lengths.Also applies to: 1795-1800
1388-1398: Variable renaming improves clarity.Renaming to
num_original_shardsclarifies that these counts refer to input data shards rather than output file shards, making the multiprocessing adjustment logic easier to understand.Also applies to: 1667-1677
1557-1558: Cannot locate the referenced code in the current repository.The code snippet referenced at lines 1557-1558 in src/datasets/builder.py, specifically the pattern
isinstance(key, Key),original_shard_id,original_shard_lengths, and_prepare_split_singlemethod, does not exist in the current codebase. A comprehensive search found no occurrences of these identifiers or patterns.This review comment may be referencing:
- Unreleased or unmerged code from a pull request
- Incorrect line numbers or file location
- Code from a different branch or version
Please confirm the correct file location, branch, or provide the actual code being reviewed.
| self, | ||
| name: str, | ||
| column: Union[list, np.ndarray], | ||
| new_fingerprint: Optional[str] = None, | ||
| feature: Optional[FeatureType] = None, |
There was a problem hiding this comment.
Docstring is out of sync with add_column signature (and can mislead positional callers).
The new signature add_column(self, name, column, new_fingerprint=None, feature=None) looks fine and matches how fingerprint_transform is used, but the docstring still:
- Mentions only
name,column,featureinArgs:and - Implies
featureis the third parameter while the function actually expectsnew_fingerprintin that position.
Someone following the docstring and calling add_column(name, col, feature) positionally will accidentally pass feature as new_fingerprint. Please update the docstring to include new_fingerprint and to match the parameter order.
For example:
- Args:
- name (`str`):
- Column name.
- column (`list` or `np.array`):
- Column data to be added.
- feature (`FeatureType` or `None`, defaults to `None`):
- Column datatype.
+ Args:
+ name (`str`):
+ Column name.
+ column (`list` or `np.array`):
+ Column data to be added.
+ new_fingerprint (`str`, *optional*):
+ The new fingerprint of the dataset after transform. If `None`, it is computed automatically.
+ feature (`FeatureType` or `None`, defaults to `None`):
+ Column datatype.Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/datasets/arrow_dataset.py around lines 6079 to 6083, the add_column
docstring is out of sync with the function signature: it omits new_fingerprint
and lists feature as the third parameter which causes positional callers to pass
feature into new_fingerprint; update the docstring to list parameters in the
exact order (name, column, new_fingerprint=None, feature=None), document
new_fingerprint (Optional[str]) and feature (Optional[FeatureType]) with brief
descriptions and default values, and ensure any examples or Args sections
reflect the new parameter order so positional calls behave correctly.
| @transmit_format | ||
| @fingerprint_transform(inplace=False) | ||
| def add_item(self, item: dict, new_fingerprint: str): | ||
| def add_item(self, item: dict, new_fingerprint: Optional[str] = None): |
There was a problem hiding this comment.
add_item now accepts new_fingerprint, but the docstring doesn’t mention it.
The updated signature add_item(self, item: dict, new_fingerprint: Optional[str] = None) is consistent with other fingerprinted transforms, but the docstring only documents item and omits new_fingerprint. To keep the public API clear, add an entry for new_fingerprint:
- Args:
- item (`dict`):
- Item data to be added.
+ Args:
+ item (`dict`):
+ Item data to be added.
+ new_fingerprint (`str`, *optional*):
+ The new fingerprint of the dataset after transform. If `None`, it is computed automatically.Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/datasets/arrow_dataset.py around line 6340, the method signature
add_item(self, item: dict, new_fingerprint: Optional[str] = None) was updated
but the docstring still documents only item; update the docstring to add a new
parameter entry for new_fingerprint describing that it is an optional str
(default None) used to set/override the item's fingerprint for fingerprinted
transforms (or leave None to compute/retain the existing fingerprint), and
ensure the parameter list formatting and any examples in the docstring reflect
this new argument.
| The key will be hashed and sorted to shuffle examples deterministically, | ||
| such as generating the dataset multiple times keep examples in the | ||
| same order. | ||
| key: tuple[int, int] original_shard_id and table_idx within that shard |
There was a problem hiding this comment.
Docstring should reference Key type.
The docstring describes the key as tuple[int, int], but the signature (line 1628) correctly specifies Key. Update for consistency.
Yields:
- key: tuple[int, int] original_shard_id and table_idx within that shard
+ key: Key containing original_shard_id and item_or_batch_id (table index) within that shard
example: `pyarrow.Table`, a feature table
ready to be encoded and written to disk.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| key: tuple[int, int] original_shard_id and table_idx within that shard | |
| key: Key containing original_shard_id and item_or_batch_id (table index) within that shard |
🤖 Prompt for AI Agents
In src/datasets/builder.py around line 1642, the docstring line currently
describes the key as "tuple[int, int]" but the function signature uses the alias
Key; update the docstring to reference Key (or the fully qualified alias if
preferred) to keep types consistent, and if Key is defined in a different module
ensure the docstring usage matches (e.g., "Key" or "datasets.builder.Key") so
the docstring and signature align.
| if state_dict: | ||
| state_dict["partition_example_idx"] = row_id + 1 | ||
| yield f"{part_id}_{row_id}", row_as_dict | ||
| yield (part_id, row_id), row_as_dict |
There was a problem hiding this comment.
Inconsistent key type: Use Key object instead of tuple.
This module yields a plain tuple (part_id, row_id), while all other dataset builders in this PR (generator, text, hdf5, pandas, webdataset, folder_based_builder) consistently yield Key(shard_idx, item_idx) objects. This type inconsistency may break downstream code expecting uniform key types.
Apply this diff to align with the Key-based approach:
+from datasets.builder import Key
def _generate_iterable_examples(
df: "pyspark.sql.DataFrame",
partition_order: list[int],
state_dict: Optional[dict] = None,
):
import pyspark
df_with_partition_id = df.select("*", pyspark.sql.functions.spark_partition_id().alias("part_id"))
partition_idx_start = state_dict["partition_idx"] if state_dict else 0
partition_df = _reorder_dataframe_by_partition(df_with_partition_id, partition_order[partition_idx_start:])
# pipeline next partition in parallel to hide latency
rows = partition_df.toLocalIterator(prefetchPartitions=True)
curr_partition = None
row_id = state_dict["partition_example_idx"] if state_dict else 0
for row in islice(rows, row_id, None):
row_as_dict = row.asDict()
part_id = row_as_dict["part_id"]
row_as_dict.pop("part_id")
if curr_partition != part_id:
if state_dict and curr_partition is not None:
state_dict["partition_idx"] += 1
curr_partition = part_id
row_id = 0
if state_dict:
state_dict["partition_example_idx"] = row_id + 1
- yield (part_id, row_id), row_as_dict
+ yield Key(part_id, row_id), row_as_dict
row_id += 1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| yield (part_id, row_id), row_as_dict | |
| from datasets.builder import Key | |
| def _generate_iterable_examples( | |
| df: "pyspark.sql.DataFrame", | |
| partition_order: list[int], | |
| state_dict: Optional[dict] = None, | |
| ): | |
| import pyspark | |
| df_with_partition_id = df.select("*", pyspark.sql.functions.spark_partition_id().alias("part_id")) | |
| partition_idx_start = state_dict["partition_idx"] if state_dict else 0 | |
| partition_df = _reorder_dataframe_by_partition(df_with_partition_id, partition_order[partition_idx_start:]) | |
| # pipeline next partition in parallel to hide latency | |
| rows = partition_df.toLocalIterator(prefetchPartitions=True) | |
| curr_partition = None | |
| row_id = state_dict["partition_example_idx"] if state_dict else 0 | |
| for row in islice(rows, row_id, None): | |
| row_as_dict = row.asDict() | |
| part_id = row_as_dict["part_id"] | |
| row_as_dict.pop("part_id") | |
| if curr_partition != part_id: | |
| if state_dict and curr_partition is not None: | |
| state_dict["partition_idx"] += 1 | |
| curr_partition = part_id | |
| row_id = 0 | |
| if state_dict: | |
| state_dict["partition_example_idx"] = row_id + 1 | |
| yield Key(part_id, row_id), row_as_dict | |
| row_id += 1 |
🤖 Prompt for AI Agents
In src/datasets/packaged_modules/spark/spark.py around line 74, the generator
yields a plain tuple (part_id, row_id) which is inconsistent; replace the tuple
with a Key object (Key(part_id, row_id)) so downstream code receives the same
key type as other builders, and add/import Key at the top of the file if it
isn't already imported; ensure any tests or usages nearby expect Key and update
local variable names/annotations if necessary.
28f83ac to
00792fc
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/datasets/table.py (1)
2121-2131: Contiguity guard for sliced list arrays fits the bug description and is well-scopedThe new offset-based
pa.concat_arrays([array])step is correctly limited to list/large_list arrays and only triggers for sliced views (offset > 0), preserving behavior for other types and unsliced data. Extension arrays are still unwrapped first, and the subsequent list-handling logic remains unchanged, so this should address the SIGKILL without broad side effects. The extrahasattr(array, "offset")check is slightly redundant here but harmless; you can keep it for defensiveness or drop it later as a minor cleanup.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/datasets/table.py(1 hunks)tests/features/test_embed_storage_sliced.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/features/test_embed_storage_sliced.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: test (unit, windows-latest, deps-latest)
- GitHub Check: test (integration, ubuntu-latest, deps-minimum)
- GitHub Check: test (integration, windows-latest, deps-minimum)
- GitHub Check: test (integration, ubuntu-latest, deps-latest)
- GitHub Check: test (unit, windows-latest, deps-minimum)
- GitHub Check: test (integration, windows-latest, deps-latest)
- GitHub Check: test (unit, ubuntu-latest, deps-latest)
- GitHub Check: test (unit, ubuntu-latest, deps-minimum)
- GitHub Check: test_py314 (unit, windows-latest, deps-latest)
- GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
- GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
- GitHub Check: test_py314_future (unit, windows-latest, deps-latest)
- GitHub Check: build / build_pr_documentation
…ace#7877) (huggingface#7891) * fix(fingerprint): improve TMPDIR environment variable handling in _TempCacheDir Enhanced the _TempCacheDir.__init__ method to properly respect and handle the TMPDIR environment variable when creating temporary cache directories. Changes: - Add TMPDIR environment variable detection and validation - Normalise paths to handle path resolution issues - Auto-create TMPDIR directory if it doesn't exist to prevent silent fallback to default temporary directory - Validate that TMPDIR is actually a directory before use - Explicitly pass directory to mkdtemp to ensure TMPDIR is respected even if tempfile.gettempdir() was already called and cached - Add appropriate logging for directory creation and fallback scenarios This ensures that when TMPDIR is set, the temporary cache files are created in the specified directory rather than silently falling back to the system default temporary directory. * test(fingerprint): add comprehensive tests for TMPDIR handling in _TempCacheDir Add test coverage for the improved TMPDIR environment variable handling in the _TempCacheDir class. These tests verify the various scenarios for TMPDIR usage and error handling. Changes: - Refactor test_fingerprint_in_multiprocessing to use Pool.map for cleaner test implementation - Add test_temp_cache_dir_with_tmpdir_nonexistent to verify TMPDIR auto-creation when directory doesn't exist - Add test_temp_cache_dir_with_tmpdir_existing to verify correct behaviour when TMPDIR exists and is valid - Add test_temp_cache_dir_without_tmpdir to verify fallback to default temporary directory when TMPDIR is not set - Add test_temp_cache_dir_tmpdir_creation_failure to verify graceful error handling and fallback when TMPDIR creation fails These tests ensure that the TMPDIR improvements work correctly across all scenarios and edge cases, including proper logging and fallback behaviour. * test(fingerprint): tighten TMPDIR error-path tests for _TempCacheDir Refine TMPDIR-related failure tests for _TempCacheDir to assert explicit error conditions instead of fallback behaviour. Changes: - Update test_temp_cache_dir_tmpdir_creation_failure to use _TempCacheDir directly and assert that an OSError is raised with a clear TMPDIR context when directory creation fails - Introduce test_temp_cache_dir_tmpdir_not_directory to verify that pointing TMPDIR at a non-directory raises an OSError with an informative error message These tests better match the intended contract of _TempCacheDir by ensuring invalid TMPDIR configurations fail loudly with descriptive messages rather than silently falling back. * fix(fingerprint): make TMPDIR misconfiguration in _TempCacheDir fail loudly Tighten TMPDIR handling in _TempCacheDir so that invalid configurations raise clear errors instead of silently falling back to the default temporary directory. Changes: - When TMPDIR points to a non-existent directory, raise an OSError with explicit guidance to create it manually or unset TMPDIR - When TMPDIR points to a non-directory path, raise an OSError with guidance to point TMPDIR to a writable directory or unset it - Remove previous warning-and-fallback behaviour to avoid masking configuration issues This ensures that TMPDIR misconfigurations are surfaced early and clearly, aligning runtime behaviour with the stricter expectations codified in the new tests. * Update src/datasets/fingerprint.py * Update fingerprint.py * Fix formatting of TMPDIR retrieval line --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* encode nifti correctly when uploading lazily * update docstring for nifti functionality
) * fix(nifti): enable lazy loading for Nifti1ImageWrapper - Change dataobj initialization from get_fdata() to .dataobj - Preserves nibabel's ArrayProxy for lazy loading - Prevents OOM errors on large 4D fMRI files - Defers I/O errors to usage time for better error handling Resolves #1 * chore: trigger CI
…gingface#7906) don't save original_shard_lengths by default for backward compat
When ds.shard() or ds.select() creates a sliced view, array.values returns values with internal offset references that can cause PyArrow's C++ layer to crash with SIGKILL when processing nested types like Sequence(Nifti()). The fix forces a contiguous copy via pa.concat_arrays([array]) when the array has a non-zero offset, breaking the internal references before further processing. Fixes: #6
1. Remove fork-specific issue URL placeholders (upstream-ready) 2. Add consistency assertions to LargeList test: - offset == 0 check - content verification (bytes embedded) 3. Add offset == 0 check to Nifti test for consistency
00792fc to
b46579b
Compare
Summary
Fixes SIGKILL crash when processing sliced/sharded datasets with nested types like
Sequence(Nifti()).Root Cause:
ds.shard()ords.select()creates a sliced view with internal offset metadataarray.valueson a sliced ListArray returns values with internal offset referencesFix:
pa.concat_arrays([array])when the array has a non-zero offsetTest plan
tests/features/test_embed_storage_sliced.pywith 3 test cases:Fixes #6
@coderabbitai review
Summary by CodeRabbit
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.