fix(nifti): Enable lazy loading for Nifti1ImageWrapper by The-Obstacle-Is-The-Way · Pull Request #3 · The-Obstacle-Is-The-Way/datasets

The-Obstacle-Is-The-Way · 2025-11-29T01:17:49Z

Summary

Change Nifti1ImageWrapper initialization to use .dataobj instead of .get_fdata()
Preserves nibabel's ArrayProxy for lazy loading instead of eagerly loading data

Problem

The current implementation calls get_fdata() during wrapper initialization, which:

Memory issues: Loads entire 4D fMRI files into memory immediately
Poor error handling: Corrupted files crash at access time with opaque EOFError
No recovery: Entire dataset iteration fails on one bad file

Solution

# Before (eager - loads immediately)
dataobj=nifti_image.get_fdata()

# After (lazy - loads on demand)
dataobj=nifti_image.dataobj

Benefits

Memory efficiency: Large files only loaded when .get_fdata() is called
Better errors: Decode errors happen at usage time with clear context
Graceful handling: Users can catch errors in processing loops

Test plan

Added test_nifti_lazy_loading to verify ArrayProxy is preserved
All 22 existing Nifti tests pass
Verified with real OpenNeuro BIDS data (ds000102)

Resolves #1

Summary by CodeRabbit

Release Notes

New Features
- NIfTI files now support lazy data loading for improved performance with large datasets
- Original data types are preserved during file initialization

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-29T01:17:59Z

Warning

Rate limit exceeded

@The-Obstacle-Is-The-Way has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 9 minutes and 34 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between d1d14bd and 4074164.

📒 Files selected for processing (10)

setup.py (1 hunks)
src/datasets/__init__.py (1 hunks)
src/datasets/builder.py (2 hunks)
src/datasets/config.py (1 hunks)
src/datasets/features/features.py (3 hunks)
src/datasets/features/nifti.py (3 hunks)
src/datasets/fingerprint.py (1 hunks)
tests/conftest.py (1 hunks)
tests/features/test_nifti.py (1 hunks)
tests/test_fingerprint.py (1 hunks)

Walkthrough

Modified Nifti1ImageWrapper initialization to use dataobj attribute directly instead of calling get_fdata(), preserving lazy loading behavior. Added test verifying that decoded NIfTI files maintain ArrayProxy lazy-loading semantics and can be accessed via get_fdata().

Changes

Cohort / File(s)	Change Summary
NIfTI Lazy Loading `src/datasets/features/nifti.py`	Changed `Nifti1ImageWrapper.__init__` to initialize with `nifti_image.dataobj` instead of `nifti_image.get_fdata()`, preserving the underlying ArrayProxy for lazy loading
NIfTI Lazy Loading Test `tests/features/test_nifti.py`	Added `test_nifti_lazy_loading()` to verify that decoded NIfTI dataobj remains an ArrayProxy (lazy) and can be accessed via `get_fdata()`

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Single-line logic change in initialization
Straightforward test addition following existing patterns
Low risk: minimal behavioral surface area, targeted to one wrapper class

Poem

🐰 Hoppy news, dear data friends!
No more eager loads that never end,
Lazy proxies, light and free,
ArrayProxy as nature meant to be!
Big files dance with memory cheer, ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 13.75% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(nifti): Enable lazy loading for Nifti1ImageWrapper' accurately and concisely summarizes the main change: enabling lazy loading for the Nifti1ImageWrapper class by switching from eager get_fdata() to lazy dataobj initialization.
Linked Issues check	✅ Passed	The pull request fully addresses all coding requirements from issue #1: replaces get_fdata() with nifti_image.dataobj to enable lazy loading, preserves ArrayProxy, reduces memory usage, defers decoding errors to usage time, maintains all constructor arguments, and adds test verification.
Out of Scope Changes check	✅ Passed	All changes are directly in scope: the nifti.py modification implements the exact solution proposed in issue #1, and the test addition verifies the lazy loading behavior without introducing unrelated functionality.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The-Obstacle-Is-The-Way · 2025-11-29T01:32:14Z

@coderabbitai review

coderabbitai · 2025-11-29T01:32:20Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

src/datasets/packaged_modules/text/text.py (1)
104-109: Inconsistent key type at line 109.

Lines 86 and 104 correctly yield Key(file_idx, batch_idx), but line 109 still yields a plain tuple (file_idx, batch_idx). This inconsistency will cause type mismatches in downstream processing.

Apply this diff to fix the inconsistency:
                 if batch:
                     pa_table = pa.Table.from_arrays([pa.array([batch])], names=pa_table_names)
-                    yield (file_idx, batch_idx), self._cast_table(pa_table)
+                    yield Key(file_idx, batch_idx), self._cast_table(pa_table)
src/datasets/iterable_dataset.py (1)
419-425: Fix undefined attribute in ShuffledDataSourcesArrowExamplesIterable.shift_rngs

shift_rngs currently references self.generate_examples_fn, which doesn’t exist on ShuffledDataSourcesArrowExamplesIterable (it has generate_tables_fn via ArrowExamplesIterable). This will raise an AttributeError when shift_rngs is called.

Suggest using generate_tables_fn instead:
     def shift_rngs(self, value: int) -> "_BaseExamplesIterable":
         new_seed = self.generator.bit_generator.state["state"]["state"] + value
         return ShuffledDataSourcesArrowExamplesIterable(
-            self.generate_examples_fn,
-            self.kwargs,
-            np.random.default_rng(seed=new_seed),
-        )
+            self.generate_tables_fn,
+            self.kwargs,
+            np.random.default_rng(seed=new_seed),
+        )
src/datasets/splits.py (1)
558-567: from_split_dict can crash when called with dict[str, SplitInfo]

from_split_dict converts a dict to list(split_infos.values()), but still assumes the resulting items are dict by calling .get("dataset_name") on split_infos[0]. If the values are SplitInfo instances, this will raise AttributeError.

Consider making dataset name extraction robust to both dict and SplitInfo inputs, e.g.:
-        if dataset_name is None:
-            dataset_name = split_infos[0].get("dataset_name") if split_infos else None
+        if dataset_name is None and split_infos:
+            first = split_infos[0]
+            if isinstance(first, dict):
+                dataset_name = first.get("dataset_name")
+            elif isinstance(first, SplitInfo):
+                dataset_name = first.dataset_name
src/datasets/builder.py (2)
1535-1588: original_shard_lengths accumulation raises IndexError for non-zero or non-contiguous Key.original_shard_id

The code at lines 1575–1578 (and identical logic at 1855–1858) assumes original_shard_id is always contiguous and starts at 0:
if len(original_shard_lengths) == original_shard_id:
    original_shard_lengths.append(1)
else:
    original_shard_lengths[original_shard_id] += 1
If the first Key has original_shard_id > 0 or IDs arrive out of order (e.g., 3 then 0), the condition is false and original_shard_lengths[original_shard_id] raises IndexError. The Key class has no documented constraints enforcing contiguity or ordering.

Recommend either:

Document that original_shard_id must start at 0 and remain contiguous/non-decreasing per job, or

Harden the code to handle arbitrary non-negative IDs using extend() to pre-allocate slots as needed.

Given this is new behavior that can crash custom builders adopting Key, hardening is strongly recommended.

1807-1868: Fix original_shard_lengths list bounds checking in both _prepare_split_single paths

The Arrow and generator-based paths both access original_shard_lengths[original_shard_id] without ensuring the list is large enough. If the first Key yielded has original_shard_id > 0, an IndexError occurs.

Lines 1855-1858 (Arrow path) and the equivalent in the generator path should grow the list to at least original_shard_id + 1 before incrementing, or enforce a documented contract that original_shard_id values are 0-based and contiguous within each job.

🧹 Nitpick comments (15)

src/datasets/packaged_modules/xml/xml.py (1)

61-68: Composite (file_idx, 0) key matches the (shard, batch) pattern; consider using Key for uniformity.

Yielding (file_idx, 0) gives XML tables a consistent two-part key (file, batch) like other loaders. For longer-term consistency with JSON and other packaged modules, you might eventually wrap this in the shared Key type as well, but that’s an optional cleanup rather than a blocker.
tests/test_arrow_writer.py (1)
136-147: Misleading test name after key removal.

The test is named test_write_with_keys but no longer uses keys in the writer.write() calls (lines 142-143). This makes the test name misleading.

Consider renaming the test to reflect its actual purpose or removing it if it duplicates test_write:
-def test_write_with_keys(writer_batch_size):
+def test_write_without_keys(writer_batch_size):
     output = pa.BufferOutputStream()
     with ArrowWriter(
         stream=output,
Alternatively, if this test is now redundant with test_write (lines 99-111), consider removing it entirely.
src/datasets/info.py (1)

121-123: Align splits docstring with SplitDict type

The splits field is now correctly typed as Optional[SplitDict] and __post_init__ already normalizes dict-like inputs via SplitDict.from_split_dict, which is good.

The attribute docs still describe splits as a dict, though. Consider updating that to mention SplitDict (or “SplitDict-like mapping”) to avoid confusion for users relying on type hints/docstrings.

Also applies to: 150-153, 167-185

src/datasets/iterable_dataset.py (2)

84-89: Key alias looks good; update nearby docstrings to match expanded key types

The new Key union and BuilderKey import align this module with the structured key class, but several docstrings still say keys are only (int/str, dict), e.g. in __iter__ of _BaseExamplesIterable and in _convert_to_arrow’s Args section. To avoid confusion now that keys can be int | str | tuple[int, int] | BuilderKey, consider updating those docstrings accordingly.

257-262: Iterator-based type hints for generators are consistent with usage

Changing generate_examples_fn / generate_tables_fn annotations to return Iterator[tuple[Key, ...]] matches how these call sites consume them (via direct iteration and islice). No functional issues spotted; just ensure any externally typed generators are declared as returning Iterator[...] (or a subtype) so type-checkers stay happy.

Also applies to: 295-300, 343-347, 410-415

src/datasets/arrow_dataset.py (1)

6340-6340: add_item signature change looks correct; consider documenting new_fingerprint.

Appending new_fingerprint: Optional[str] = None is backward compatible for positional callers and aligns add_item with other @fingerprint_transform methods that accept an explicit fingerprint override. To keep the public API clear, you may want to add an Args: entry for new_fingerprint in the docstring similar to other transforms.

src/datasets/splits.py (1)

587-596: YAML export intentionally drops shard-length metadata

_to_yaml_list now removes both shard_lengths and original_shard_lengths before serializing. This is fine if these are considered derived/large metadata, but it means YAML round‑trips won’t preserve them.

If consumers rely on YAML as a full-fidelity representation of SplitInfo, consider documenting this lossy behavior or, if needed, gating the pops behind a flag so they can be retained when required.
src/datasets/builder.py (8)
1320-1340: Clarify Key usage and update _generate_examples documentation

You’ve introduced Key and updated _generate_examples to return Iterator[tuple[Key, dict[str, Any]]], but the docstring still documents key as str/int. Also, the implementation gracefully handles legacy keys (isinstance(key, Key) guard), so the runtime contract is broader than the type hint.

Consider:

Updating the docstring to mention Key and how original_shard_id/item_or_batch_id are expected to be used.

Widening the return type hint to include legacy key types, e.g. Iterator[tuple[Union[Key, int, str], dict[str, Any]]], if you intend to support them.

This will make the new API surface easier to adopt for custom builders.

1387-1399: num_proc vs num_original_shards logic looks correct but relies on _number_of_shards_in_gen_kwargs contract

The num_proc adjustment based on num_original_shards (warnings + clamping) is reasonable and prevents over‑parallelizing when there are too few shards. The behavior depends entirely on _number_of_shards_in_gen_kwargs correctly reflecting “original shard” count.

Just a note: if _number_of_shards_in_gen_kwargs ever returns 0 for some edge builder, this will silently force num_proc to 1 and skip the warning. If that’s not expected, you may want an explicit assertion/log for num_original_shards == 0.

1418-1435: Single-process _prepare_split_single now returns extra metadata; type hints should reflect 7-tuple

In the single-process branch you now wrap a 7‑element result (..., shard_lengths, num_original_shards, original_shard_lengths) into lists:
(
    examples_per_job,
    bytes_per_job,
    features_per_job,
    shards_per_job,
    shard_lengths_per_job,
    original_shards_per_job,
    original_shard_lengths_per_job,
) = ([item] for item in result)
But _prepare_split_single’s return type is still Iterator[tuple[int, bool, tuple[int, int, Features, int, int, int]]], which only accounts for six inner items and doesn’t mention the list type of original_shard_lengths.

Recommend updating the annotation to match reality (and mirroring the change on the Arrow path), e.g.:
-> Iterator[
    tuple[int, bool, tuple[int, int, Features, int, list[int], int, list[int]]]
]
(or similar structure you prefer).

1477-1522: original_shard_lengths propagation looks good, but only set when more than one original shard

After aggregation:
total_original_shards = sum(original_shards_per_job)
...
if total_original_shards > 1:
    split_generator.split_info.original_shard_lengths = [
        original_shard_length
        for original_shard_lengths in original_shard_lengths_per_job
        for original_shard_length in original_shard_lengths
    ]
This means single‑shard splits (or builders that don’t emit Key) will have original_shard_lengths = None, which seems intentional and keeps SplitInfo small. Just ensure downstream code treats None as “single-shard or unknown” rather than “no data.”

If consumers are going to rely on original_shard_lengths being present whenever num_examples > 0, consider setting it even when total_original_shards == 1.

1627-1645: Align _generate_tables docstring and type with new Key usage

_generate_tables is now typed as:
def _generate_tables(self, **kwargs) -> Iterator[tuple[Key, pa.Table]]:
and _prepare_split_single checks isinstance(key, Key). However, the docstring still says:
key: tuple[int, int] original_shard_id and table_idx within that shard
and existing builders may still yield plain integer or tuple keys.

To avoid confusion and ease migration:

Update the docstring to reference Key and the recommended semantics.

Either widen the type hint to include legacy key types (as with _generate_examples) or explicitly document that new implementations should return Key.

Runtime behavior is already backward compatible due to the isinstance guard, so this is mostly about clarity.

1667-1678: Same num_proc vs num_original_shards logic for Arrow path; consider shared helper

ArrowBasedBuilder _prepare_split duplicates the num_proc clamping based on num_original_shards. Behavior is consistent with the GeneratorBasedBuilder path, which is good.

As this logic evolves (e.g., if _number_of_shards_in_gen_kwargs changes), you may want to factor it into a small helper to keep the two code paths in sync and avoid subtle drift.

1705-1713: Arrow _prepare_split_single type hints diverge from actual return shape

As with the generator-based path, the Arrow-based _prepare_split_single now returns 7 inner values, with original_shard_lengths being a list:
(
    total_num_examples,
    total_num_bytes,
    writer._features,
    num_shards,
    shard_lengths,
    num_original_shards,
    original_shard_lengths,
)
but the annotation remains:
-> Iterator[tuple[int, bool, tuple[int, int, Features, int, int, int]]]
Recommend updating the type hint to match, mirroring whatever you choose on the GeneratorBasedBuilder side.

949-949: Wrapped OSError message still preserves original error but drops traceback

In _download_and_prepare, you now rewrap OSError from _prepare_split as:
except OSError as e:
    raise OSError("Cannot find data file. " + "\nOriginal error:\n" + str(e)) from None
This improves the user-facing message but suppresses the original traceback (from None). If some failures are more complex than “cannot find data file”, this may hide useful debugging context.

If you want clearer messaging without losing traceback, consider either:

Re-raising with from e to preserve the chain, or

Restricting this wrapping to known “missing file” patterns instead of all OSError.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 872490c and 0eab74c.

📒 Files selected for processing (30)

src/datasets/arrow_dataset.py (2 hunks)
src/datasets/arrow_writer.py (2 hunks)
src/datasets/builder.py (28 hunks)
src/datasets/features/nifti.py (1 hunks)
src/datasets/info.py (1 hunks)
src/datasets/inspect.py (0 hunks)
src/datasets/iterable_dataset.py (5 hunks)
src/datasets/keyhash.py (0 hunks)
src/datasets/naming.py (1 hunks)
src/datasets/packaged_modules/arrow/arrow.py (2 hunks)
src/datasets/packaged_modules/cache/cache.py (2 hunks)
src/datasets/packaged_modules/csv/csv.py (2 hunks)
src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py (4 hunks)
src/datasets/packaged_modules/generator/generator.py (2 hunks)
src/datasets/packaged_modules/hdf5/hdf5.py (2 hunks)
src/datasets/packaged_modules/json/json.py (3 hunks)
src/datasets/packaged_modules/pandas/pandas.py (2 hunks)
src/datasets/packaged_modules/parquet/parquet.py (2 hunks)
src/datasets/packaged_modules/spark/spark.py (1 hunks)
src/datasets/packaged_modules/sql/sql.py (2 hunks)
src/datasets/packaged_modules/text/text.py (4 hunks)
src/datasets/packaged_modules/webdataset/webdataset.py (2 hunks)
src/datasets/packaged_modules/xml/xml.py (1 hunks)
src/datasets/splits.py (3 hunks)
src/datasets/utils/info_utils.py (1 hunks)
tests/features/test_nifti.py (1 hunks)
tests/packaged_modules/test_spark.py (2 hunks)
tests/test_arrow_dataset.py (1 hunks)
tests/test_arrow_writer.py (1 hunks)
tests/test_builder.py (14 hunks)

💤 Files with no reviewable changes (2)

src/datasets/keyhash.py
src/datasets/inspect.py

🧰 Additional context used

🧬 Code graph analysis (18)

tests/test_builder.py (1)

src/datasets/builder.py (2)

Key (1321-1326)

cache_dir (598-599)

tests/features/test_nifti.py (2)

tests/utils.py (1)

require_nibabel (226-235)

src/datasets/features/nifti.py (3)

Nifti (64-300)

encode_example (110-148)

decode_example (150-211)

src/datasets/packaged_modules/generator/generator.py (2)

src/datasets/builder.py (1)

Key (1321-1326)

src/datasets/utils/sharding.py (2)

_number_of_shards_in_gen_kwargs (4-18)

_split_gen_kwargs (48-64)

src/datasets/packaged_modules/webdataset/webdataset.py (1)

src/datasets/builder.py (1)

Key (1321-1326)

tests/test_arrow_dataset.py (1)

src/datasets/arrow_dataset.py (5)

Dataset (704-6467)

from_dict (974-1035)

add_column (6078-6129)

features (206-207)

features (781-785)

src/datasets/info.py (1)

src/datasets/splits.py (1)

SplitDict (520-600)

src/datasets/packaged_modules/cache/cache.py (1)

src/datasets/builder.py (1)

Key (1321-1326)

src/datasets/packaged_modules/csv/csv.py (1)

src/datasets/builder.py (1)

Key (1321-1326)

tests/test_arrow_writer.py (1)

src/datasets/arrow_writer.py (1)

write (582-598)

src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py (1)

src/datasets/builder.py (1)

Key (1321-1326)

src/datasets/packaged_modules/text/text.py (8)

src/datasets/builder.py (1)

Key (1321-1326)

src/datasets/packaged_modules/arrow/arrow.py (1)

_cast_table (56-61)

src/datasets/packaged_modules/csv/csv.py (1)

_cast_table (166-175)

src/datasets/packaged_modules/json/json.py (1)

_cast_table (88-111)

src/datasets/packaged_modules/pandas/pandas.py (1)

_cast_table (55-60)

src/datasets/packaged_modules/parquet/parquet.py (1)

_cast_table (144-149)

src/datasets/packaged_modules/sql/sql.py (1)

_cast_table (101-110)

src/datasets/packaged_modules/xml/xml.py (1)

_cast_table (48-59)

src/datasets/packaged_modules/arrow/arrow.py (1)

src/datasets/builder.py (1)

Key (1321-1326)

src/datasets/packaged_modules/json/json.py (8)

src/datasets/builder.py (1)

Key (1321-1326)

src/datasets/packaged_modules/arrow/arrow.py (1)

_cast_table (56-61)

src/datasets/packaged_modules/csv/csv.py (1)

_cast_table (166-175)

src/datasets/packaged_modules/pandas/pandas.py (1)

_cast_table (55-60)

src/datasets/packaged_modules/parquet/parquet.py (1)

_cast_table (144-149)

src/datasets/packaged_modules/sql/sql.py (1)

_cast_table (101-110)

src/datasets/packaged_modules/text/text.py (1)

_cast_table (53-64)

src/datasets/packaged_modules/xml/xml.py (1)

_cast_table (48-59)

src/datasets/packaged_modules/pandas/pandas.py (1)

src/datasets/builder.py (1)

Key (1321-1326)

src/datasets/packaged_modules/hdf5/hdf5.py (2)

src/datasets/builder.py (1)

Key (1321-1326)

src/datasets/table.py (1)

cast_table_to_features (2179-2198)

src/datasets/packaged_modules/sql/sql.py (1)

src/datasets/builder.py (1)

Key (1321-1326)

src/datasets/packaged_modules/parquet/parquet.py (1)

src/datasets/builder.py (1)

Key (1321-1326)

src/datasets/iterable_dataset.py (2)

src/datasets/builder.py (1)

Key (1321-1326)

src/datasets/packaged_modules/generator/generator.py (1)

Generator (25-38)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)

GitHub Check: test (unit, windows-latest, deps-minimum)
GitHub Check: test (integration, ubuntu-latest, deps-minimum)
GitHub Check: test (integration, windows-latest, deps-latest)
GitHub Check: test (integration, windows-latest, deps-minimum)
GitHub Check: test (integration, ubuntu-latest, deps-latest)
GitHub Check: test_py314_future (unit, windows-latest, deps-latest)
GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
GitHub Check: test (unit, windows-latest, deps-latest)
GitHub Check: test (unit, ubuntu-latest, deps-minimum)
GitHub Check: test (unit, ubuntu-latest, deps-latest)
GitHub Check: test_py314 (unit, windows-latest, deps-latest)
GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
GitHub Check: build / build_pr_documentation

🔇 Additional comments (30)

src/datasets/utils/info_utils.py (1)

32-32: Clarified ALL_CHECKS description looks accurate and concise

The updated description correctly communicates that ALL_CHECKS runs both split checks and downloaded-file validity checks (counts, checksums) in one line, and the table formatting remains intact. No further changes needed here.

src/datasets/features/nifti.py (1)

28-36: Lazy-loading change looks good; please verify remote xopen use cases.

Switching to dataobj=nifti_image.dataobj avoids eager reads and keeps affine/header/extra/file_map/dtype in sync, so local and in-memory paths should now be lazily loaded as intended. One thing to double-check is the remote path in decode_example that does with xopen(path, "rb") as f: nifti = nib.load(f): previously the eager get_fdata() ran while f was still open, whereas now .get_fdata() may be called only after f is closed. It’s worth adding/ran an integration test that decodes a NIfTI via a remote xopen URL and calls get_fdata() to ensure nibabel’s ArrayProxy still works in that scenario.

src/datasets/packaged_modules/spark/spark.py (1)

49-76: Composite (part_id, row_id) keys are reasonable; confirm downstream expectations.

Emitting (part_id, row_id) instead of a formatted string keeps partition and row indices explicit and in sync with the state_dict logic. Since this changes the key type returned by the Spark examples iterable, ensure all consumers/tests that previously assumed string keys are updated or documented accordingly.

src/datasets/packaged_modules/json/json.py (1)

12-12: Adopting Key for JSON table identifiers is consistent with the new key model.

Importing Key and yielding Key(file_idx, batch_idx_or_0) for both single-object and line-based JSON files aligns this module with the repo-wide Key-based indexing. This should work seamlessly anywhere keys are only stringified, but be aware this is a behavioral change for callers that previously assumed plain tuples/ints and update any such usages/tests accordingly.

Also applies to: 125-125, 190-193

tests/features/test_nifti.py (1)

133-149: Lazy-loading test is well targeted and validates the new behavior.

This test nicely checks that decode_example preserves a proxy dataobj and that get_fdata() still yields the expected shape, so it should catch regressions where the wrapper accidentally materializes data eagerly again.

src/datasets/naming.py (1)

70-84: Single-shard filenames without -of-00001 are clearer.

Requiring len(shard_lengths) > 1 before adding the -SSSSS-of-NNNNN suffix makes single-shard splits use the plain base filename, which is more intuitive and avoids confusing “of 1” artifacts. This looks correct and backward-compatible aside from the intended naming change.

tests/test_arrow_dataset.py (1)

4787-4795: New test_add_column gives a clear smoke test for the public API.

Using from datasets import Dataset here ensures the top-level Dataset alias plus add_column behave as expected end-to-end, and the assertions on features and row contents are sufficient. Given the more detailed test_dataset_add_column above, this is a lightweight but useful complementary check.

src/datasets/packaged_modules/text/text.py (3)

9-9: LGTM: Key import added.

The import is correctly added to support the Key-based refactoring.

86-86: LGTM: Key-based yield for line-based sampling.

The change from tuple to Key object is consistent with the broader refactoring.

113-113: LGTM: Key-based yield for document-based sampling.

The change from tuple to Key object is correct.

src/datasets/packaged_modules/cache/cache.py (2)

14-14: LGTM: Key import added.

Import is correctly added for the Key-based refactoring.

189-189: LGTM: Migration from string to Key object.

The change from f"{file_idx}_{batch_idx}" to Key(file_idx, batch_idx) provides better structure and type safety.

src/datasets/packaged_modules/sql/sql.py (2)

10-10: LGTM: Key import added.

Import correctly added for Key-based refactoring.

120-120: LGTM: Key-based yield with appropriate shard_id.

Using Key(0, chunk_idx) is correct for SQL where there's a single logical data source. The 0 represents the single shard, and chunk_idx represents the batch within that shard.

src/datasets/packaged_modules/arrow/arrow.py (2)

8-8: LGTM: Key import added.

Import correctly added for the Key-based refactoring.

77-77: LGTM: Migration from string to Key object.

The change from string-based key to Key(file_idx, batch_idx) provides better type safety and structured identification.

src/datasets/packaged_modules/parquet/parquet.py (2)

10-10: LGTM: Key import added.

Import correctly added for the Key-based refactoring.

182-182: LGTM: Migration from string to Key object.

The change from "{file_idx}_{batch_idx}" to Key(file_idx, batch_idx) aligns with the broader Key-based refactoring across packaged modules.

src/datasets/packaged_modules/pandas/pandas.py (2)

10-10: LGTM: Key import added.

Import correctly added for the Key-based refactoring.

66-66: LGTM: Key-based yield with appropriate batch_id.

Using Key(i, 0) is correct since each pandas pickle file is read as a single batch. The 0 batch_id indicates this is the only batch from that file.

src/datasets/packaged_modules/csv/csv.py (2)

10-10: LGTM: Key import added.

Import correctly added for the Key-based refactoring.

196-196: LGTM: Migration from tuple to Key object.

The change from (file_idx, batch_idx) tuple to Key(file_idx, batch_idx) provides structured key representation consistent with other packaged modules.

tests/packaged_modules/test_spark.py (1)

23-29: Tuple-based row identifiers look consistent with iterator behavior

Using (part_id, row_idx) in _get_expected_row_ids_and_row_dicts_for_partition_order and asserting row_key == (0, i) in test_spark_examples_iterable keeps test expectations aligned with the new composite key semantics. No functional issues spotted here.

Also applies to: 71-80

src/datasets/packaged_modules/hdf5/hdf5.py (1)

8-10: Key-based shard/batch identifiers for HDF5 tables look correct

Using Key(file_idx, batch_idx) in _generate_tables is consistent with the intended “(original_shard_id, item_or_batch_id)” semantics: file_idx identifies the source HDF5 file, and batch_idx identifies the batch within that file. The surrounding length checks and empty-file skips are unchanged, so behavior should stay correct while enabling richer split metadata.

Also applies to: 75-95

tests/test_builder.py (3)

19-26: Dummy builders updated to Key-based identifiers are consistent with new sharding model

All dummy generator/arrow-based builders in this file now yield Key(shard_idx, item_idx) (or (0, i) where only a trivial shard dimension is needed). This matches the new composite key abstraction in datasets.builder.Key and lets the core builder logic compute shard and original-shard metadata correctly while leaving payloads unchanged.

No issues spotted with these test helpers; they look well-aligned with the library changes.

Also applies to: 64-86, 88-99, 107-119, 146-157, 159-171, 172-191, 193-212

766-773: Extended cache-dir tests for data_dir and configured builders look correct

The new expectations around cache_dir show that:

data_dir is part of the cache-dir fingerprint even when config_name is set, and

configured builders created via configure_builder_class produce stable cache dirs for identical hash/config combos.

These tests are valuable to prevent regressions in cache-dir computation; the logic they assert is coherent with how cache isolation is typically handled.

Also applies to: 775-781

1030-1039: Shard and original-shard length assertions give good coverage

The added assertions on:

shard_lengths == [25] * 4 and original_shard_lengths is None for pure writer-based sharding,

shard_lengths == [200, 200] and original_shard_lengths == [100] * 4 for generator-based multi-proc sharding, and

analogous expectations for arrow-based builders,

provide solid regression coverage for the new split metadata behavior. They clearly distinguish “logical shards” from “original shards” and validate that multi-proc jobs are merged as intended.

No issues; these tests are a good addition.

Also applies to: 1081-1088, 1124-1132, 1173-1180

src/datasets/packaged_modules/webdataset/webdataset.py (1)

11-13: Key-based example identifiers for WebDataset are reasonable

Switching _generate_examples to yield Key(tar_idx, example_idx) instead of a string key keeps the observable key structure consistent with other packaged modules while preserving the existing example contents and feature handling. Using tar_idx as the shard id and example_idx as the per-tar index is a sensible choice.

No functional concerns here.

Also applies to: 110-130

src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py (1)

16-18: Folder-based builder Key usage is coherent across metadata and non-metadata paths

In _generate_examples:

When add_metadata is true, each metadata file becomes a shard (shard_idx) and each metadata row a sample_idx, yielding Key(shard_idx, sample_idx), sample after applying _nested_apply.

Without metadata, each (original_file, downloaded_file_or_dir) pair is treated as a shard, and files within it are indexed by sample_idx, again yielding Key(shard_idx, sample_idx), sample.

This is internally consistent and should integrate well with the new shard/original-shard accounting, while leaving label inference and filtering behavior unchanged.

Also applies to: 357-391, 392-412

src/datasets/arrow_writer.py (1)

19-20: Typing imports are consistent with updated hints

Importing Any and Optional matches the new type annotations added below; nothing to change here.

coderabbitai · 2025-12-06T14:37:33Z

src/datasets/arrow_dataset.py

+        self,
+        name: str,
+        column: Union[list, np.ndarray],
+        new_fingerprint: Optional[str] = None,
+        feature: Optional[FeatureType] = None,
    ):


⚠️ Potential issue | 🟠 Major

Avoid breaking positional callers by reordering new_fingerprint after feature.

Adding new_fingerprint is correct for fingerprint_transform, but inserting it before feature changes the positional parameter order. Existing code that calls add_column(name, column, feature) will now pass feature into new_fingerprint and leave feature as None, silently misbehaving.

To preserve backward compatibility, keep feature as the third positional argument and add new_fingerprint at the end:

- def add_column( - self, - name: str, - column: Union[list, np.ndarray], - new_fingerprint: Optional[str] = None, - feature: Optional[FeatureType] = None, - ): + def add_column( + self, + name: str, + column: Union[list, np.ndarray], + feature: Optional[FeatureType] = None, + new_fingerprint: Optional[str] = None, + ):

It would also be good to extend the docstring’s Args: section to document new_fingerprint for consistency with other fingerprinted transforms.

🤖 Prompt for AI Agents

In src/datasets/arrow_dataset.py around lines 6079-6084, the new_fingerprint parameter was inserted before feature which changes the positional-argument order and breaks callers; restore backward compatibility by moving feature back to be the third positional parameter and place new_fingerprint as an optional fourth (last) parameter, update the function signature accordingly, and add new_fingerprint to the method docstring Args: section to document its purpose and type consistent with other fingerprinted transforms.

coderabbitai · 2025-12-06T14:37:33Z

src/datasets/arrow_writer.py

    def write(
        self,
        example: dict[str, Any],
-        key: Optional[Union[str, int, bytes]] = None,
        writer_batch_size: Optional[int] = None,
    ):
        """Add a given (Example,Key) pair to the write-pool of examples which is written to file.

        Args:
            example: the Example to add.
-            key: Optional, a unique identifier(str, int or bytes) associated with each example
        """
-        # Utilize the keys and duplicate checking when `self._check_duplicates` is passed True
-        if self._check_duplicates:
-            # Create unique hash from key and store as (key, example) pairs
-            hash = self._hasher.hash(key)
-            self.current_examples.append((example, hash))
-            # Maintain record of keys and their respective hashes for checking duplicates
-            self.hkey_record.append((hash, key))
-        else:
-            # Store example as a tuple so as to keep the structure of `self.current_examples` uniform
-            self.current_examples.append((example, ""))
+        # Store example as a tuple so as to keep the structure of `self.current_examples` uniform
+        self.current_examples.append((example, ""))

        if writer_batch_size is None:
            writer_batch_size = self.writer_batch_size
        if writer_batch_size is not None and len(self.current_examples) >= writer_batch_size:
-            if self._check_duplicates:
-                self.check_duplicate_keys()
-                # Re-initializing to empty list for next batch
-                self.hkey_record = []
-
            self.write_examples_on_file()



⚠️ Potential issue | 🟡 Minor

Align ArrowWriter.write docstring and consider simplifying current_examples structure

The implementation now only accepts example (no key) and always appends (example, "") to current_examples, but the docstring still says “(Example,Key) pair”. That’s misleading.

You could:

Update the docstring to describe just example being buffered, and

(Optionally) in a follow-up, simplify current_examples to list[dict[str, Any]] and adjust write_examples_on_file to drop the unused second tuple element, since keys are no longer part of this API.

- """Add a given (Example,Key) pair to the write-pool of examples which is written to file. + """Add a given example to the write-pool of examples which is written to file. @@ - example: the Example to add. + example: the example to add.

🤖 Prompt for AI Agents

In src/datasets/arrow_writer.py around lines 582 to 599, the write() docstring still states it accepts an (Example,Key) pair while the implementation only accepts a single example and appends (example, "") to current_examples; update the docstring to describe that write(example: dict[str, Any], writer_batch_size: Optional[int] = None) buffers a single example (no key) and triggers a flush when the batch size is reached, and optionally add a TODO note suggesting future simplification of current_examples to a list of examples and corresponding adjustments in write_examples_on_file to remove the unused tuple element.

coderabbitai · 2025-12-06T14:37:33Z

src/datasets/packaged_modules/generator/generator.py

+from datasets.builder import Key
+from datasets.utils.sharding import _number_of_shards_in_gen_kwargs, _split_gen_kwargs



⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg "_number_of_shards_in_gen_kwargs|_split_gen_kwargs" -n

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 3491

🏁 Script executed:

sed -n '1,50p' src/datasets/packaged_modules/generator/generator.py | cat -n

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 1721

🏁 Script executed:

rg "class GeneratorBasedBuilder" -A 20 -B 2

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 1582

🏁 Script executed:

sed -n '1380,1450p' src/datasets/builder.py | cat -n

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 3528

🏁 Script executed:

sed -n '1660,1730p' src/datasets/builder.py | cat -n

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 3540

Confirm interaction with base sharding to avoid double-splitting gen_kwargs

This implementation cleanly derives a shard dimension from list-valued gen_kwargs and exposes (shard_idx, sample_idx) via Key, which fits the new composite-key model.

However, there is a real risk of unintended double-splitting: when num_proc > 1, DatasetBuilder._prepare_split splits gen_kwargs across jobs at lines 1440 and 1718, then passes each shard to _generate_examples in a worker. If the Generator subclass also applies _split_gen_kwargs in its _generate_examples (lines 35–36), the gen_kwargs are split twice—once at the job level by the base builder, and again within each worker. This cascade can distort shard metadata and make the composite Key semantics ambiguous.

Either clarify that Generator should only apply internal sharding when used single-process, or ensure _generate_examples detects and skips splitting if the base builder already split the inputs.

🤖 Prompt for AI Agents

In src/datasets/packaged_modules/generator/generator.py around lines 5–7 and the _generate_examples region (lines ~35–36), avoid double-splitting gen_kwargs by detecting when the base DatasetBuilder already sharded inputs: only call _split_gen_kwargs when gen_kwargs clearly represents full (unsplit) list-valued inputs. Concretely, add a guard that skips _split_gen_kwargs if either (a) any gen_kwargs value is not a list (already per-worker), or (b) all list-valued gen_kwargs have length == 1 (already reduced by the builder), or (c) an explicit shard marker/key is present in gen_kwargs; otherwise perform the split as before. Ensure the behavior is documented in a comment so Generator subclasses know splitting is conditional.

…ace#7877) (huggingface#7891) * fix(fingerprint): improve TMPDIR environment variable handling in _TempCacheDir Enhanced the _TempCacheDir.__init__ method to properly respect and handle the TMPDIR environment variable when creating temporary cache directories. Changes: - Add TMPDIR environment variable detection and validation - Normalise paths to handle path resolution issues - Auto-create TMPDIR directory if it doesn't exist to prevent silent fallback to default temporary directory - Validate that TMPDIR is actually a directory before use - Explicitly pass directory to mkdtemp to ensure TMPDIR is respected even if tempfile.gettempdir() was already called and cached - Add appropriate logging for directory creation and fallback scenarios This ensures that when TMPDIR is set, the temporary cache files are created in the specified directory rather than silently falling back to the system default temporary directory. * test(fingerprint): add comprehensive tests for TMPDIR handling in _TempCacheDir Add test coverage for the improved TMPDIR environment variable handling in the _TempCacheDir class. These tests verify the various scenarios for TMPDIR usage and error handling. Changes: - Refactor test_fingerprint_in_multiprocessing to use Pool.map for cleaner test implementation - Add test_temp_cache_dir_with_tmpdir_nonexistent to verify TMPDIR auto-creation when directory doesn't exist - Add test_temp_cache_dir_with_tmpdir_existing to verify correct behaviour when TMPDIR exists and is valid - Add test_temp_cache_dir_without_tmpdir to verify fallback to default temporary directory when TMPDIR is not set - Add test_temp_cache_dir_tmpdir_creation_failure to verify graceful error handling and fallback when TMPDIR creation fails These tests ensure that the TMPDIR improvements work correctly across all scenarios and edge cases, including proper logging and fallback behaviour. * test(fingerprint): tighten TMPDIR error-path tests for _TempCacheDir Refine TMPDIR-related failure tests for _TempCacheDir to assert explicit error conditions instead of fallback behaviour. Changes: - Update test_temp_cache_dir_tmpdir_creation_failure to use _TempCacheDir directly and assert that an OSError is raised with a clear TMPDIR context when directory creation fails - Introduce test_temp_cache_dir_tmpdir_not_directory to verify that pointing TMPDIR at a non-directory raises an OSError with an informative error message These tests better match the intended contract of _TempCacheDir by ensuring invalid TMPDIR configurations fail loudly with descriptive messages rather than silently falling back. * fix(fingerprint): make TMPDIR misconfiguration in _TempCacheDir fail loudly Tighten TMPDIR handling in _TempCacheDir so that invalid configurations raise clear errors instead of silently falling back to the default temporary directory. Changes: - When TMPDIR points to a non-existent directory, raise an OSError with explicit guidance to create it manually or unset TMPDIR - When TMPDIR points to a non-directory path, raise an OSError with guidance to point TMPDIR to a writable directory or unset it - Remove previous warning-and-fallback behaviour to avoid masking configuration issues This ensures that TMPDIR misconfigurations are surfaced early and clearly, aligning runtime behaviour with the stricter expectations codified in the new tests. * Update src/datasets/fingerprint.py * Update fingerprint.py * Fix formatting of TMPDIR retrieval line --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* encode nifti correctly when uploading lazily * update docstring for nifti functionality

) * fix(nifti): enable lazy loading for Nifti1ImageWrapper - Change dataobj initialization from get_fdata() to .dataobj - Preserves nibabel's ArrayProxy for lazy loading - Prevents OOM errors on large 4D fMRI files - Defers I/O errors to usage time for better error handling Resolves #1 * chore: trigger CI

…gingface#7906) don't save original_shard_lengths by default for backward compat

The-Obstacle-Is-The-Way closed this Nov 29, 2025

The-Obstacle-Is-The-Way deleted the feat/nifti-lazy-loading branch November 29, 2025 14:56

The-Obstacle-Is-The-Way restored the feat/nifti-lazy-loading branch November 29, 2025 14:58

The-Obstacle-Is-The-Way reopened this Nov 29, 2025

The-Obstacle-Is-The-Way force-pushed the feat/nifti-lazy-loading branch from c706e73 to 0eab74c Compare December 6, 2025 14:29

coderabbitai bot reviewed Dec 6, 2025

View reviewed changes

The-Obstacle-Is-The-Way force-pushed the feat/nifti-lazy-loading branch 2 times, most recently from d2c1fce to d1d14bd Compare December 9, 2025 15:37

ada-ggf25 and others added 7 commits December 16, 2025 15:20

encode nifti correctly when uploading lazily (huggingface#7892)

d68aaf5

* encode nifti correctly when uploading lazily * update docstring for nifti functionality

Don't save original_shard_lengths by default for backward compat (hug…

58dda42

…gingface#7906) don't save original_shard_lengths by default for backward compat

release: 4.4.2 (huggingface#7907)

37d9615

set dev version (huggingface#7908)

0feb65d

chore: trigger CI

4074164

The-Obstacle-Is-The-Way force-pushed the feat/nifti-lazy-loading branch from d1d14bd to 4074164 Compare December 19, 2025 16:45

The-Obstacle-Is-The-Way closed this Dec 19, 2025

The-Obstacle-Is-The-Way deleted the feat/nifti-lazy-loading branch December 19, 2025 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(nifti): Enable lazy loading for Nifti1ImageWrapper#3

fix(nifti): Enable lazy loading for Nifti1ImageWrapper#3
The-Obstacle-Is-The-Way wants to merge 7 commits intomainfrom
feat/nifti-lazy-loading

The-Obstacle-Is-The-Way commented Nov 29, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 29, 2025 •

edited

Loading

Rate limit exceeded

Uh oh!

The-Obstacle-Is-The-Way commented Nov 29, 2025

Uh oh!

coderabbitai bot commented Nov 29, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 6, 2025

Uh oh!

coderabbitai bot Dec 6, 2025

Uh oh!

coderabbitai bot Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		from datasets.builder import Key
		from datasets.utils.sharding import _number_of_shards_in_gen_kwargs, _split_gen_kwargs

Conversation

The-Obstacle-Is-The-Way commented Nov 29, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Benefits

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

The-Obstacle-Is-The-Way commented Nov 29, 2025

Uh oh!

coderabbitai bot commented Nov 29, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

The-Obstacle-Is-The-Way commented Nov 29, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 29, 2025 •

edited

Loading