fix(nifti): Enable lazy loading for Nifti1ImageWrapper#3
fix(nifti): Enable lazy loading for Nifti1ImageWrapper#3The-Obstacle-Is-The-Way wants to merge 7 commits intomainfrom
Conversation
|
Warning Rate limit exceeded@The-Obstacle-Is-The-Way has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 9 minutes and 34 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (10)
WalkthroughModified Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
c706e73 to
0eab74c
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
src/datasets/packaged_modules/text/text.py (1)
104-109: Inconsistent key type at line 109.Lines 86 and 104 correctly yield
Key(file_idx, batch_idx), but line 109 still yields a plain tuple(file_idx, batch_idx). This inconsistency will cause type mismatches in downstream processing.Apply this diff to fix the inconsistency:
if batch: pa_table = pa.Table.from_arrays([pa.array([batch])], names=pa_table_names) - yield (file_idx, batch_idx), self._cast_table(pa_table) + yield Key(file_idx, batch_idx), self._cast_table(pa_table)src/datasets/iterable_dataset.py (1)
419-425: Fix undefined attribute inShuffledDataSourcesArrowExamplesIterable.shift_rngs
shift_rngscurrently referencesself.generate_examples_fn, which doesn’t exist onShuffledDataSourcesArrowExamplesIterable(it hasgenerate_tables_fnviaArrowExamplesIterable). This will raise anAttributeErrorwhenshift_rngsis called.Suggest using
generate_tables_fninstead:def shift_rngs(self, value: int) -> "_BaseExamplesIterable": new_seed = self.generator.bit_generator.state["state"]["state"] + value return ShuffledDataSourcesArrowExamplesIterable( - self.generate_examples_fn, - self.kwargs, - np.random.default_rng(seed=new_seed), - ) + self.generate_tables_fn, + self.kwargs, + np.random.default_rng(seed=new_seed), + )src/datasets/splits.py (1)
558-567: from_split_dict can crash when called withdict[str, SplitInfo]
from_split_dictconverts adicttolist(split_infos.values()), but still assumes the resulting items aredictby calling.get("dataset_name")onsplit_infos[0]. If the values areSplitInfoinstances, this will raiseAttributeError.Consider making dataset name extraction robust to both
dictandSplitInfoinputs, e.g.:- if dataset_name is None: - dataset_name = split_infos[0].get("dataset_name") if split_infos else None + if dataset_name is None and split_infos: + first = split_infos[0] + if isinstance(first, dict): + dataset_name = first.get("dataset_name") + elif isinstance(first, SplitInfo): + dataset_name = first.dataset_namesrc/datasets/builder.py (2)
1535-1588:original_shard_lengthsaccumulation raisesIndexErrorfor non-zero or non-contiguousKey.original_shard_idThe code at lines 1575–1578 (and identical logic at 1855–1858) assumes
original_shard_idis always contiguous and starts at 0:if len(original_shard_lengths) == original_shard_id: original_shard_lengths.append(1) else: original_shard_lengths[original_shard_id] += 1If the first
Keyhasoriginal_shard_id > 0or IDs arrive out of order (e.g., 3 then 0), the condition is false andoriginal_shard_lengths[original_shard_id]raisesIndexError. TheKeyclass has no documented constraints enforcing contiguity or ordering.Recommend either:
- Document that
original_shard_idmust start at 0 and remain contiguous/non-decreasing per job, or- Harden the code to handle arbitrary non-negative IDs using
extend()to pre-allocate slots as needed.Given this is new behavior that can crash custom builders adopting
Key, hardening is strongly recommended.
1807-1868: Fixoriginal_shard_lengthslist bounds checking in both_prepare_split_singlepathsThe Arrow and generator-based paths both access
original_shard_lengths[original_shard_id]without ensuring the list is large enough. If the firstKeyyielded hasoriginal_shard_id > 0, anIndexErroroccurs.Lines 1855-1858 (Arrow path) and the equivalent in the generator path should grow the list to at least
original_shard_id + 1before incrementing, or enforce a documented contract thatoriginal_shard_idvalues are 0-based and contiguous within each job.
🧹 Nitpick comments (15)
src/datasets/packaged_modules/xml/xml.py (1)
61-68: Composite(file_idx, 0)key matches the (shard, batch) pattern; consider usingKeyfor uniformity.Yielding
(file_idx, 0)gives XML tables a consistent two-part key (file, batch) like other loaders. For longer-term consistency with JSON and other packaged modules, you might eventually wrap this in the sharedKeytype as well, but that’s an optional cleanup rather than a blocker.tests/test_arrow_writer.py (1)
136-147: Misleading test name after key removal.The test is named
test_write_with_keysbut no longer uses keys in thewriter.write()calls (lines 142-143). This makes the test name misleading.Consider renaming the test to reflect its actual purpose or removing it if it duplicates
test_write:-def test_write_with_keys(writer_batch_size): +def test_write_without_keys(writer_batch_size): output = pa.BufferOutputStream() with ArrowWriter( stream=output,Alternatively, if this test is now redundant with
test_write(lines 99-111), consider removing it entirely.src/datasets/info.py (1)
121-123: Alignsplitsdocstring withSplitDicttypeThe
splitsfield is now correctly typed asOptional[SplitDict]and__post_init__already normalizes dict-like inputs viaSplitDict.from_split_dict, which is good.The attribute docs still describe
splitsas adict, though. Consider updating that to mentionSplitDict(or “SplitDict-like mapping”) to avoid confusion for users relying on type hints/docstrings.Also applies to: 150-153, 167-185
src/datasets/iterable_dataset.py (2)
84-89: Key alias looks good; update nearby docstrings to match expanded key typesThe new
Keyunion andBuilderKeyimport align this module with the structured key class, but several docstrings still say keys are only(int/str, dict), e.g. in__iter__of_BaseExamplesIterableand in_convert_to_arrow’s Args section. To avoid confusion now that keys can beint | str | tuple[int, int] | BuilderKey, consider updating those docstrings accordingly.
257-262: Iterator-based type hints for generators are consistent with usageChanging
generate_examples_fn/generate_tables_fnannotations to returnIterator[tuple[Key, ...]]matches how these call sites consume them (via direct iteration andislice). No functional issues spotted; just ensure any externally typed generators are declared as returningIterator[...](or a subtype) so type-checkers stay happy.Also applies to: 295-300, 343-347, 410-415
src/datasets/arrow_dataset.py (1)
6340-6340:add_itemsignature change looks correct; consider documentingnew_fingerprint.Appending
new_fingerprint: Optional[str] = Noneis backward compatible for positional callers and alignsadd_itemwith other@fingerprint_transformmethods that accept an explicit fingerprint override. To keep the public API clear, you may want to add anArgs:entry fornew_fingerprintin the docstring similar to other transforms.src/datasets/splits.py (1)
587-596: YAML export intentionally drops shard-length metadata
_to_yaml_listnow removes bothshard_lengthsandoriginal_shard_lengthsbefore serializing. This is fine if these are considered derived/large metadata, but it means YAML round‑trips won’t preserve them.If consumers rely on YAML as a full-fidelity representation of
SplitInfo, consider documenting this lossy behavior or, if needed, gating the pops behind a flag so they can be retained when required.src/datasets/builder.py (8)
1320-1340: Clarify Key usage and update_generate_examplesdocumentationYou’ve introduced
Keyand updated_generate_examplesto returnIterator[tuple[Key, dict[str, Any]]], but the docstring still documentskeyasstr/int. Also, the implementation gracefully handles legacy keys (isinstance(key, Key)guard), so the runtime contract is broader than the type hint.Consider:
- Updating the docstring to mention
Keyand howoriginal_shard_id/item_or_batch_idare expected to be used.- Widening the return type hint to include legacy key types, e.g.
Iterator[tuple[Union[Key, int, str], dict[str, Any]]], if you intend to support them.This will make the new API surface easier to adopt for custom builders.
1387-1399: num_proc vs num_original_shards logic looks correct but relies on_number_of_shards_in_gen_kwargscontractThe
num_procadjustment based onnum_original_shards(warnings + clamping) is reasonable and prevents over‑parallelizing when there are too few shards. The behavior depends entirely on_number_of_shards_in_gen_kwargscorrectly reflecting “original shard” count.Just a note: if
_number_of_shards_in_gen_kwargsever returns 0 for some edge builder, this will silently forcenum_procto 1 and skip the warning. If that’s not expected, you may want an explicit assertion/log fornum_original_shards == 0.
1418-1435: Single-process_prepare_split_singlenow returns extra metadata; type hints should reflect 7-tupleIn the single-process branch you now wrap a 7‑element result (
..., shard_lengths, num_original_shards, original_shard_lengths) into lists:( examples_per_job, bytes_per_job, features_per_job, shards_per_job, shard_lengths_per_job, original_shards_per_job, original_shard_lengths_per_job, ) = ([item] for item in result)But
_prepare_split_single’s return type is stillIterator[tuple[int, bool, tuple[int, int, Features, int, int, int]]], which only accounts for six inner items and doesn’t mention the list type oforiginal_shard_lengths.Recommend updating the annotation to match reality (and mirroring the change on the Arrow path), e.g.:
-> Iterator[ tuple[int, bool, tuple[int, int, Features, int, list[int], int, list[int]]] ](or similar structure you prefer).
1477-1522:original_shard_lengthspropagation looks good, but only set when more than one original shardAfter aggregation:
total_original_shards = sum(original_shards_per_job) ... if total_original_shards > 1: split_generator.split_info.original_shard_lengths = [ original_shard_length for original_shard_lengths in original_shard_lengths_per_job for original_shard_length in original_shard_lengths ]This means single‑shard splits (or builders that don’t emit
Key) will haveoriginal_shard_lengths = None, which seems intentional and keepsSplitInfosmall. Just ensure downstream code treatsNoneas “single-shard or unknown” rather than “no data.”If consumers are going to rely on
original_shard_lengthsbeing present whenevernum_examples > 0, consider setting it even whentotal_original_shards == 1.
1627-1645: Align_generate_tablesdocstring and type with newKeyusage
_generate_tablesis now typed as:def _generate_tables(self, **kwargs) -> Iterator[tuple[Key, pa.Table]]:and
_prepare_split_singlechecksisinstance(key, Key). However, the docstring still says:key: tuple[int, int] original_shard_id and table_idx within that shardand existing builders may still yield plain integer or tuple keys.
To avoid confusion and ease migration:
- Update the docstring to reference
Keyand the recommended semantics.- Either widen the type hint to include legacy key types (as with
_generate_examples) or explicitly document that new implementations should returnKey.Runtime behavior is already backward compatible due to the
isinstanceguard, so this is mostly about clarity.
1667-1678: Samenum_procvsnum_original_shardslogic for Arrow path; consider shared helperArrowBasedBuilder
_prepare_splitduplicates thenum_procclamping based onnum_original_shards. Behavior is consistent with the GeneratorBasedBuilder path, which is good.As this logic evolves (e.g., if
_number_of_shards_in_gen_kwargschanges), you may want to factor it into a small helper to keep the two code paths in sync and avoid subtle drift.
1705-1713: Arrow_prepare_split_singletype hints diverge from actual return shapeAs with the generator-based path, the Arrow-based
_prepare_split_singlenow returns 7 inner values, withoriginal_shard_lengthsbeing a list:( total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths, num_original_shards, original_shard_lengths, )but the annotation remains:
-> Iterator[tuple[int, bool, tuple[int, int, Features, int, int, int]]]Recommend updating the type hint to match, mirroring whatever you choose on the GeneratorBasedBuilder side.
949-949: Wrapped OSError message still preserves original error but drops tracebackIn
_download_and_prepare, you now rewrapOSErrorfrom_prepare_splitas:except OSError as e: raise OSError("Cannot find data file. " + "\nOriginal error:\n" + str(e)) from NoneThis improves the user-facing message but suppresses the original traceback (
from None). If some failures are more complex than “cannot find data file”, this may hide useful debugging context.If you want clearer messaging without losing traceback, consider either:
- Re-raising with
from eto preserve the chain, or- Restricting this wrapping to known “missing file” patterns instead of all
OSError.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (30)
src/datasets/arrow_dataset.py(2 hunks)src/datasets/arrow_writer.py(2 hunks)src/datasets/builder.py(28 hunks)src/datasets/features/nifti.py(1 hunks)src/datasets/info.py(1 hunks)src/datasets/inspect.py(0 hunks)src/datasets/iterable_dataset.py(5 hunks)src/datasets/keyhash.py(0 hunks)src/datasets/naming.py(1 hunks)src/datasets/packaged_modules/arrow/arrow.py(2 hunks)src/datasets/packaged_modules/cache/cache.py(2 hunks)src/datasets/packaged_modules/csv/csv.py(2 hunks)src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py(4 hunks)src/datasets/packaged_modules/generator/generator.py(2 hunks)src/datasets/packaged_modules/hdf5/hdf5.py(2 hunks)src/datasets/packaged_modules/json/json.py(3 hunks)src/datasets/packaged_modules/pandas/pandas.py(2 hunks)src/datasets/packaged_modules/parquet/parquet.py(2 hunks)src/datasets/packaged_modules/spark/spark.py(1 hunks)src/datasets/packaged_modules/sql/sql.py(2 hunks)src/datasets/packaged_modules/text/text.py(4 hunks)src/datasets/packaged_modules/webdataset/webdataset.py(2 hunks)src/datasets/packaged_modules/xml/xml.py(1 hunks)src/datasets/splits.py(3 hunks)src/datasets/utils/info_utils.py(1 hunks)tests/features/test_nifti.py(1 hunks)tests/packaged_modules/test_spark.py(2 hunks)tests/test_arrow_dataset.py(1 hunks)tests/test_arrow_writer.py(1 hunks)tests/test_builder.py(14 hunks)
💤 Files with no reviewable changes (2)
- src/datasets/keyhash.py
- src/datasets/inspect.py
🧰 Additional context used
🧬 Code graph analysis (18)
tests/test_builder.py (1)
src/datasets/builder.py (2)
Key(1321-1326)cache_dir(598-599)
tests/features/test_nifti.py (2)
tests/utils.py (1)
require_nibabel(226-235)src/datasets/features/nifti.py (3)
Nifti(64-300)encode_example(110-148)decode_example(150-211)
src/datasets/packaged_modules/generator/generator.py (2)
src/datasets/builder.py (1)
Key(1321-1326)src/datasets/utils/sharding.py (2)
_number_of_shards_in_gen_kwargs(4-18)_split_gen_kwargs(48-64)
src/datasets/packaged_modules/webdataset/webdataset.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
tests/test_arrow_dataset.py (1)
src/datasets/arrow_dataset.py (5)
Dataset(704-6467)from_dict(974-1035)add_column(6078-6129)features(206-207)features(781-785)
src/datasets/info.py (1)
src/datasets/splits.py (1)
SplitDict(520-600)
src/datasets/packaged_modules/cache/cache.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/packaged_modules/csv/csv.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
tests/test_arrow_writer.py (1)
src/datasets/arrow_writer.py (1)
write(582-598)
src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/packaged_modules/text/text.py (8)
src/datasets/builder.py (1)
Key(1321-1326)src/datasets/packaged_modules/arrow/arrow.py (1)
_cast_table(56-61)src/datasets/packaged_modules/csv/csv.py (1)
_cast_table(166-175)src/datasets/packaged_modules/json/json.py (1)
_cast_table(88-111)src/datasets/packaged_modules/pandas/pandas.py (1)
_cast_table(55-60)src/datasets/packaged_modules/parquet/parquet.py (1)
_cast_table(144-149)src/datasets/packaged_modules/sql/sql.py (1)
_cast_table(101-110)src/datasets/packaged_modules/xml/xml.py (1)
_cast_table(48-59)
src/datasets/packaged_modules/arrow/arrow.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/packaged_modules/json/json.py (8)
src/datasets/builder.py (1)
Key(1321-1326)src/datasets/packaged_modules/arrow/arrow.py (1)
_cast_table(56-61)src/datasets/packaged_modules/csv/csv.py (1)
_cast_table(166-175)src/datasets/packaged_modules/pandas/pandas.py (1)
_cast_table(55-60)src/datasets/packaged_modules/parquet/parquet.py (1)
_cast_table(144-149)src/datasets/packaged_modules/sql/sql.py (1)
_cast_table(101-110)src/datasets/packaged_modules/text/text.py (1)
_cast_table(53-64)src/datasets/packaged_modules/xml/xml.py (1)
_cast_table(48-59)
src/datasets/packaged_modules/pandas/pandas.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/packaged_modules/hdf5/hdf5.py (2)
src/datasets/builder.py (1)
Key(1321-1326)src/datasets/table.py (1)
cast_table_to_features(2179-2198)
src/datasets/packaged_modules/sql/sql.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/packaged_modules/parquet/parquet.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/iterable_dataset.py (2)
src/datasets/builder.py (1)
Key(1321-1326)src/datasets/packaged_modules/generator/generator.py (1)
Generator(25-38)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: test (unit, windows-latest, deps-minimum)
- GitHub Check: test (integration, ubuntu-latest, deps-minimum)
- GitHub Check: test (integration, windows-latest, deps-latest)
- GitHub Check: test (integration, windows-latest, deps-minimum)
- GitHub Check: test (integration, ubuntu-latest, deps-latest)
- GitHub Check: test_py314_future (unit, windows-latest, deps-latest)
- GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
- GitHub Check: test (unit, windows-latest, deps-latest)
- GitHub Check: test (unit, ubuntu-latest, deps-minimum)
- GitHub Check: test (unit, ubuntu-latest, deps-latest)
- GitHub Check: test_py314 (unit, windows-latest, deps-latest)
- GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
- GitHub Check: build / build_pr_documentation
🔇 Additional comments (30)
src/datasets/utils/info_utils.py (1)
32-32: ClarifiedALL_CHECKSdescription looks accurate and conciseThe updated description correctly communicates that
ALL_CHECKSruns both split checks and downloaded-file validity checks (counts, checksums) in one line, and the table formatting remains intact. No further changes needed here.src/datasets/features/nifti.py (1)
28-36: Lazy-loading change looks good; please verify remote xopen use cases.Switching to
dataobj=nifti_image.dataobjavoids eager reads and keeps affine/header/extra/file_map/dtype in sync, so local and in-memory paths should now be lazily loaded as intended. One thing to double-check is the remote path indecode_examplethat doeswith xopen(path, "rb") as f: nifti = nib.load(f): previously the eagerget_fdata()ran whilefwas still open, whereas now.get_fdata()may be called only afterfis closed. It’s worth adding/ran an integration test that decodes a NIfTI via a remotexopenURL and callsget_fdata()to ensure nibabel’sArrayProxystill works in that scenario.src/datasets/packaged_modules/spark/spark.py (1)
49-76: Composite(part_id, row_id)keys are reasonable; confirm downstream expectations.Emitting
(part_id, row_id)instead of a formatted string keeps partition and row indices explicit and in sync with the state_dict logic. Since this changes the key type returned by the Spark examples iterable, ensure all consumers/tests that previously assumed string keys are updated or documented accordingly.src/datasets/packaged_modules/json/json.py (1)
12-12: AdoptingKeyfor JSON table identifiers is consistent with the new key model.Importing
Keyand yieldingKey(file_idx, batch_idx_or_0)for both single-object and line-based JSON files aligns this module with the repo-wide Key-based indexing. This should work seamlessly anywhere keys are only stringified, but be aware this is a behavioral change for callers that previously assumed plain tuples/ints and update any such usages/tests accordingly.Also applies to: 125-125, 190-193
tests/features/test_nifti.py (1)
133-149: Lazy-loading test is well targeted and validates the new behavior.This test nicely checks that
decode_examplepreserves a proxydataobjand thatget_fdata()still yields the expected shape, so it should catch regressions where the wrapper accidentally materializes data eagerly again.src/datasets/naming.py (1)
70-84: Single-shard filenames without-of-00001are clearer.Requiring
len(shard_lengths) > 1before adding the-SSSSS-of-NNNNNsuffix makes single-shard splits use the plain base filename, which is more intuitive and avoids confusing “of 1” artifacts. This looks correct and backward-compatible aside from the intended naming change.tests/test_arrow_dataset.py (1)
4787-4795: Newtest_add_columngives a clear smoke test for the public API.Using
from datasets import Datasethere ensures the top-levelDatasetalias plusadd_columnbehave as expected end-to-end, and the assertions on features and row contents are sufficient. Given the more detailedtest_dataset_add_columnabove, this is a lightweight but useful complementary check.src/datasets/packaged_modules/text/text.py (3)
9-9: LGTM: Key import added.The import is correctly added to support the Key-based refactoring.
86-86: LGTM: Key-based yield for line-based sampling.The change from tuple to Key object is consistent with the broader refactoring.
113-113: LGTM: Key-based yield for document-based sampling.The change from tuple to Key object is correct.
src/datasets/packaged_modules/cache/cache.py (2)
14-14: LGTM: Key import added.Import is correctly added for the Key-based refactoring.
189-189: LGTM: Migration from string to Key object.The change from
f"{file_idx}_{batch_idx}"toKey(file_idx, batch_idx)provides better structure and type safety.src/datasets/packaged_modules/sql/sql.py (2)
10-10: LGTM: Key import added.Import correctly added for Key-based refactoring.
120-120: LGTM: Key-based yield with appropriate shard_id.Using
Key(0, chunk_idx)is correct for SQL where there's a single logical data source. The 0 represents the single shard, and chunk_idx represents the batch within that shard.src/datasets/packaged_modules/arrow/arrow.py (2)
8-8: LGTM: Key import added.Import correctly added for the Key-based refactoring.
77-77: LGTM: Migration from string to Key object.The change from string-based key to
Key(file_idx, batch_idx)provides better type safety and structured identification.src/datasets/packaged_modules/parquet/parquet.py (2)
10-10: LGTM: Key import added.Import correctly added for the Key-based refactoring.
182-182: LGTM: Migration from string to Key object.The change from
"{file_idx}_{batch_idx}"toKey(file_idx, batch_idx)aligns with the broader Key-based refactoring across packaged modules.src/datasets/packaged_modules/pandas/pandas.py (2)
10-10: LGTM: Key import added.Import correctly added for the Key-based refactoring.
66-66: LGTM: Key-based yield with appropriate batch_id.Using
Key(i, 0)is correct since each pandas pickle file is read as a single batch. The 0 batch_id indicates this is the only batch from that file.src/datasets/packaged_modules/csv/csv.py (2)
10-10: LGTM: Key import added.Import correctly added for the Key-based refactoring.
196-196: LGTM: Migration from tuple to Key object.The change from
(file_idx, batch_idx)tuple toKey(file_idx, batch_idx)provides structured key representation consistent with other packaged modules.tests/packaged_modules/test_spark.py (1)
23-29: Tuple-based row identifiers look consistent with iterator behaviorUsing
(part_id, row_idx)in_get_expected_row_ids_and_row_dicts_for_partition_orderand assertingrow_key == (0, i)intest_spark_examples_iterablekeeps test expectations aligned with the new composite key semantics. No functional issues spotted here.Also applies to: 71-80
src/datasets/packaged_modules/hdf5/hdf5.py (1)
8-10: Key-based shard/batch identifiers for HDF5 tables look correctUsing
Key(file_idx, batch_idx)in_generate_tablesis consistent with the intended “(original_shard_id, item_or_batch_id)” semantics:file_idxidentifies the source HDF5 file, andbatch_idxidentifies the batch within that file. The surrounding length checks and empty-file skips are unchanged, so behavior should stay correct while enabling richer split metadata.Also applies to: 75-95
tests/test_builder.py (3)
19-26: Dummy builders updated to Key-based identifiers are consistent with new sharding modelAll dummy generator/arrow-based builders in this file now yield
Key(shard_idx, item_idx)(or(0, i)where only a trivial shard dimension is needed). This matches the new composite key abstraction indatasets.builder.Keyand lets the core builder logic compute shard and original-shard metadata correctly while leaving payloads unchanged.No issues spotted with these test helpers; they look well-aligned with the library changes.
Also applies to: 64-86, 88-99, 107-119, 146-157, 159-171, 172-191, 193-212
766-773: Extended cache-dir tests fordata_dirand configured builders look correctThe new expectations around
cache_dirshow that:
data_diris part of the cache-dir fingerprint even whenconfig_nameis set, and- configured builders created via
configure_builder_classproduce stable cache dirs for identicalhash/config combos.These tests are valuable to prevent regressions in cache-dir computation; the logic they assert is coherent with how cache isolation is typically handled.
Also applies to: 775-781
1030-1039: Shard and original-shard length assertions give good coverageThe added assertions on:
shard_lengths == [25] * 4andoriginal_shard_lengths is Nonefor pure writer-based sharding,shard_lengths == [200, 200]andoriginal_shard_lengths == [100] * 4for generator-based multi-proc sharding, and- analogous expectations for arrow-based builders,
provide solid regression coverage for the new split metadata behavior. They clearly distinguish “logical shards” from “original shards” and validate that multi-proc jobs are merged as intended.
No issues; these tests are a good addition.
Also applies to: 1081-1088, 1124-1132, 1173-1180
src/datasets/packaged_modules/webdataset/webdataset.py (1)
11-13: Key-based example identifiers for WebDataset are reasonableSwitching
_generate_examplesto yieldKey(tar_idx, example_idx)instead of a string key keeps the observable key structure consistent with other packaged modules while preserving the existing example contents and feature handling. Usingtar_idxas the shard id andexample_idxas the per-tar index is a sensible choice.No functional concerns here.
Also applies to: 110-130
src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py (1)
16-18: Folder-based builder Key usage is coherent across metadata and non-metadata pathsIn
_generate_examples:
- When
add_metadatais true, each metadata file becomes a shard (shard_idx) and each metadata row asample_idx, yieldingKey(shard_idx, sample_idx), sampleafter applying_nested_apply.- Without metadata, each
(original_file, downloaded_file_or_dir)pair is treated as a shard, and files within it are indexed bysample_idx, again yieldingKey(shard_idx, sample_idx), sample.This is internally consistent and should integrate well with the new shard/original-shard accounting, while leaving label inference and filtering behavior unchanged.
Also applies to: 357-391, 392-412
src/datasets/arrow_writer.py (1)
19-20: Typing imports are consistent with updated hintsImporting
AnyandOptionalmatches the new type annotations added below; nothing to change here.
| self, | ||
| name: str, | ||
| column: Union[list, np.ndarray], | ||
| new_fingerprint: Optional[str] = None, | ||
| feature: Optional[FeatureType] = None, | ||
| ): |
There was a problem hiding this comment.
Avoid breaking positional callers by reordering new_fingerprint after feature.
Adding new_fingerprint is correct for fingerprint_transform, but inserting it before feature changes the positional parameter order. Existing code that calls add_column(name, column, feature) will now pass feature into new_fingerprint and leave feature as None, silently misbehaving.
To preserve backward compatibility, keep feature as the third positional argument and add new_fingerprint at the end:
- def add_column(
- self,
- name: str,
- column: Union[list, np.ndarray],
- new_fingerprint: Optional[str] = None,
- feature: Optional[FeatureType] = None,
- ):
+ def add_column(
+ self,
+ name: str,
+ column: Union[list, np.ndarray],
+ feature: Optional[FeatureType] = None,
+ new_fingerprint: Optional[str] = None,
+ ):It would also be good to extend the docstring’s Args: section to document new_fingerprint for consistency with other fingerprinted transforms.
🤖 Prompt for AI Agents
In src/datasets/arrow_dataset.py around lines 6079-6084, the new_fingerprint
parameter was inserted before feature which changes the positional-argument
order and breaks callers; restore backward compatibility by moving feature back
to be the third positional parameter and place new_fingerprint as an optional
fourth (last) parameter, update the function signature accordingly, and add
new_fingerprint to the method docstring Args: section to document its purpose
and type consistent with other fingerprinted transforms.
| def write( | ||
| self, | ||
| example: dict[str, Any], | ||
| key: Optional[Union[str, int, bytes]] = None, | ||
| writer_batch_size: Optional[int] = None, | ||
| ): | ||
| """Add a given (Example,Key) pair to the write-pool of examples which is written to file. | ||
|
|
||
| Args: | ||
| example: the Example to add. | ||
| key: Optional, a unique identifier(str, int or bytes) associated with each example | ||
| """ | ||
| # Utilize the keys and duplicate checking when `self._check_duplicates` is passed True | ||
| if self._check_duplicates: | ||
| # Create unique hash from key and store as (key, example) pairs | ||
| hash = self._hasher.hash(key) | ||
| self.current_examples.append((example, hash)) | ||
| # Maintain record of keys and their respective hashes for checking duplicates | ||
| self.hkey_record.append((hash, key)) | ||
| else: | ||
| # Store example as a tuple so as to keep the structure of `self.current_examples` uniform | ||
| self.current_examples.append((example, "")) | ||
| # Store example as a tuple so as to keep the structure of `self.current_examples` uniform | ||
| self.current_examples.append((example, "")) | ||
|
|
||
| if writer_batch_size is None: | ||
| writer_batch_size = self.writer_batch_size | ||
| if writer_batch_size is not None and len(self.current_examples) >= writer_batch_size: | ||
| if self._check_duplicates: | ||
| self.check_duplicate_keys() | ||
| # Re-initializing to empty list for next batch | ||
| self.hkey_record = [] | ||
|
|
||
| self.write_examples_on_file() | ||
|
|
There was a problem hiding this comment.
Align ArrowWriter.write docstring and consider simplifying current_examples structure
The implementation now only accepts example (no key) and always appends (example, "") to current_examples, but the docstring still says “(Example,Key) pair”. That’s misleading.
You could:
- Update the docstring to describe just
examplebeing buffered, and - (Optionally) in a follow-up, simplify
current_examplestolist[dict[str, Any]]and adjustwrite_examples_on_fileto drop the unused second tuple element, since keys are no longer part of this API.
- """Add a given (Example,Key) pair to the write-pool of examples which is written to file.
+ """Add a given example to the write-pool of examples which is written to file.
@@
- example: the Example to add.
+ example: the example to add.🤖 Prompt for AI Agents
In src/datasets/arrow_writer.py around lines 582 to 599, the write() docstring
still states it accepts an (Example,Key) pair while the implementation only
accepts a single example and appends (example, "") to current_examples; update
the docstring to describe that write(example: dict[str, Any], writer_batch_size:
Optional[int] = None) buffers a single example (no key) and triggers a flush
when the batch size is reached, and optionally add a TODO note suggesting future
simplification of current_examples to a list of examples and corresponding
adjustments in write_examples_on_file to remove the unused tuple element.
| from datasets.builder import Key | ||
| from datasets.utils.sharding import _number_of_shards_in_gen_kwargs, _split_gen_kwargs | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg "_number_of_shards_in_gen_kwargs|_split_gen_kwargs" -nRepository: The-Obstacle-Is-The-Way/datasets
Length of output: 3491
🏁 Script executed:
sed -n '1,50p' src/datasets/packaged_modules/generator/generator.py | cat -nRepository: The-Obstacle-Is-The-Way/datasets
Length of output: 1721
🏁 Script executed:
rg "class GeneratorBasedBuilder" -A 20 -B 2Repository: The-Obstacle-Is-The-Way/datasets
Length of output: 1582
🏁 Script executed:
sed -n '1380,1450p' src/datasets/builder.py | cat -nRepository: The-Obstacle-Is-The-Way/datasets
Length of output: 3528
🏁 Script executed:
sed -n '1660,1730p' src/datasets/builder.py | cat -nRepository: The-Obstacle-Is-The-Way/datasets
Length of output: 3540
Confirm interaction with base sharding to avoid double-splitting gen_kwargs
This implementation cleanly derives a shard dimension from list-valued gen_kwargs and exposes (shard_idx, sample_idx) via Key, which fits the new composite-key model.
However, there is a real risk of unintended double-splitting: when num_proc > 1, DatasetBuilder._prepare_split splits gen_kwargs across jobs at lines 1440 and 1718, then passes each shard to _generate_examples in a worker. If the Generator subclass also applies _split_gen_kwargs in its _generate_examples (lines 35–36), the gen_kwargs are split twice—once at the job level by the base builder, and again within each worker. This cascade can distort shard metadata and make the composite Key semantics ambiguous.
Either clarify that Generator should only apply internal sharding when used single-process, or ensure _generate_examples detects and skips splitting if the base builder already split the inputs.
🤖 Prompt for AI Agents
In src/datasets/packaged_modules/generator/generator.py around lines 5–7 and the
_generate_examples region (lines ~35–36), avoid double-splitting gen_kwargs by
detecting when the base DatasetBuilder already sharded inputs: only call
_split_gen_kwargs when gen_kwargs clearly represents full (unsplit) list-valued
inputs. Concretely, add a guard that skips _split_gen_kwargs if either (a) any
gen_kwargs value is not a list (already per-worker), or (b) all list-valued
gen_kwargs have length == 1 (already reduced by the builder), or (c) an explicit
shard marker/key is present in gen_kwargs; otherwise perform the split as
before. Ensure the behavior is documented in a comment so Generator subclasses
know splitting is conditional.
d2c1fce to
d1d14bd
Compare
…ace#7877) (huggingface#7891) * fix(fingerprint): improve TMPDIR environment variable handling in _TempCacheDir Enhanced the _TempCacheDir.__init__ method to properly respect and handle the TMPDIR environment variable when creating temporary cache directories. Changes: - Add TMPDIR environment variable detection and validation - Normalise paths to handle path resolution issues - Auto-create TMPDIR directory if it doesn't exist to prevent silent fallback to default temporary directory - Validate that TMPDIR is actually a directory before use - Explicitly pass directory to mkdtemp to ensure TMPDIR is respected even if tempfile.gettempdir() was already called and cached - Add appropriate logging for directory creation and fallback scenarios This ensures that when TMPDIR is set, the temporary cache files are created in the specified directory rather than silently falling back to the system default temporary directory. * test(fingerprint): add comprehensive tests for TMPDIR handling in _TempCacheDir Add test coverage for the improved TMPDIR environment variable handling in the _TempCacheDir class. These tests verify the various scenarios for TMPDIR usage and error handling. Changes: - Refactor test_fingerprint_in_multiprocessing to use Pool.map for cleaner test implementation - Add test_temp_cache_dir_with_tmpdir_nonexistent to verify TMPDIR auto-creation when directory doesn't exist - Add test_temp_cache_dir_with_tmpdir_existing to verify correct behaviour when TMPDIR exists and is valid - Add test_temp_cache_dir_without_tmpdir to verify fallback to default temporary directory when TMPDIR is not set - Add test_temp_cache_dir_tmpdir_creation_failure to verify graceful error handling and fallback when TMPDIR creation fails These tests ensure that the TMPDIR improvements work correctly across all scenarios and edge cases, including proper logging and fallback behaviour. * test(fingerprint): tighten TMPDIR error-path tests for _TempCacheDir Refine TMPDIR-related failure tests for _TempCacheDir to assert explicit error conditions instead of fallback behaviour. Changes: - Update test_temp_cache_dir_tmpdir_creation_failure to use _TempCacheDir directly and assert that an OSError is raised with a clear TMPDIR context when directory creation fails - Introduce test_temp_cache_dir_tmpdir_not_directory to verify that pointing TMPDIR at a non-directory raises an OSError with an informative error message These tests better match the intended contract of _TempCacheDir by ensuring invalid TMPDIR configurations fail loudly with descriptive messages rather than silently falling back. * fix(fingerprint): make TMPDIR misconfiguration in _TempCacheDir fail loudly Tighten TMPDIR handling in _TempCacheDir so that invalid configurations raise clear errors instead of silently falling back to the default temporary directory. Changes: - When TMPDIR points to a non-existent directory, raise an OSError with explicit guidance to create it manually or unset TMPDIR - When TMPDIR points to a non-directory path, raise an OSError with guidance to point TMPDIR to a writable directory or unset it - Remove previous warning-and-fallback behaviour to avoid masking configuration issues This ensures that TMPDIR misconfigurations are surfaced early and clearly, aligning runtime behaviour with the stricter expectations codified in the new tests. * Update src/datasets/fingerprint.py * Update fingerprint.py * Fix formatting of TMPDIR retrieval line --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* encode nifti correctly when uploading lazily * update docstring for nifti functionality
) * fix(nifti): enable lazy loading for Nifti1ImageWrapper - Change dataobj initialization from get_fdata() to .dataobj - Preserves nibabel's ArrayProxy for lazy loading - Prevents OOM errors on large 4D fMRI files - Defers I/O errors to usage time for better error handling Resolves #1 * chore: trigger CI
…gingface#7906) don't save original_shard_lengths by default for backward compat
d1d14bd to
4074164
Compare
Summary
Nifti1ImageWrapperinitialization to use.dataobjinstead of.get_fdata()ArrayProxyfor lazy loading instead of eagerly loading dataProblem
The current implementation calls
get_fdata()during wrapper initialization, which:EOFErrorSolution
Benefits
.get_fdata()is calledTest plan
test_nifti_lazy_loadingto verify ArrayProxy is preservedResolves #1
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.