Skip to content

fix(nifti): Enable lazy loading for Nifti1ImageWrapper#3

Closed
The-Obstacle-Is-The-Way wants to merge 7 commits intomainfrom
feat/nifti-lazy-loading
Closed

fix(nifti): Enable lazy loading for Nifti1ImageWrapper#3
The-Obstacle-Is-The-Way wants to merge 7 commits intomainfrom
feat/nifti-lazy-loading

Conversation

@The-Obstacle-Is-The-Way
Copy link
Owner

@The-Obstacle-Is-The-Way The-Obstacle-Is-The-Way commented Nov 29, 2025

Summary

  • Change Nifti1ImageWrapper initialization to use .dataobj instead of .get_fdata()
  • Preserves nibabel's ArrayProxy for lazy loading instead of eagerly loading data

Problem

The current implementation calls get_fdata() during wrapper initialization, which:

  1. Memory issues: Loads entire 4D fMRI files into memory immediately
  2. Poor error handling: Corrupted files crash at access time with opaque EOFError
  3. No recovery: Entire dataset iteration fails on one bad file

Solution

# Before (eager - loads immediately)
dataobj=nifti_image.get_fdata()

# After (lazy - loads on demand)
dataobj=nifti_image.dataobj

Benefits

  • Memory efficiency: Large files only loaded when .get_fdata() is called
  • Better errors: Decode errors happen at usage time with clear context
  • Graceful handling: Users can catch errors in processing loops

Test plan

  • Added test_nifti_lazy_loading to verify ArrayProxy is preserved
  • All 22 existing Nifti tests pass
  • Verified with real OpenNeuro BIDS data (ds000102)

Resolves #1

Summary by CodeRabbit

Release Notes

  • New Features
    • NIfTI files now support lazy data loading for improved performance with large datasets
    • Original data types are preserved during file initialization

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Nov 29, 2025

Warning

Rate limit exceeded

@The-Obstacle-Is-The-Way has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 9 minutes and 34 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between d1d14bd and 4074164.

📒 Files selected for processing (10)
  • setup.py (1 hunks)
  • src/datasets/__init__.py (1 hunks)
  • src/datasets/builder.py (2 hunks)
  • src/datasets/config.py (1 hunks)
  • src/datasets/features/features.py (3 hunks)
  • src/datasets/features/nifti.py (3 hunks)
  • src/datasets/fingerprint.py (1 hunks)
  • tests/conftest.py (1 hunks)
  • tests/features/test_nifti.py (1 hunks)
  • tests/test_fingerprint.py (1 hunks)

Walkthrough

Modified Nifti1ImageWrapper initialization to use dataobj attribute directly instead of calling get_fdata(), preserving lazy loading behavior. Added test verifying that decoded NIfTI files maintain ArrayProxy lazy-loading semantics and can be accessed via get_fdata().

Changes

Cohort / File(s) Change Summary
NIfTI Lazy Loading
src/datasets/features/nifti.py
Changed Nifti1ImageWrapper.__init__ to initialize with nifti_image.dataobj instead of nifti_image.get_fdata(), preserving the underlying ArrayProxy for lazy loading
NIfTI Lazy Loading Test
tests/features/test_nifti.py
Added test_nifti_lazy_loading() to verify that decoded NIfTI dataobj remains an ArrayProxy (lazy) and can be accessed via get_fdata()

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

  • Single-line logic change in initialization
  • Straightforward test addition following existing patterns
  • Low risk: minimal behavioral surface area, targeted to one wrapper class

Poem

🐰 Hoppy news, dear data friends!
No more eager loads that never end,
Lazy proxies, light and free,
ArrayProxy as nature meant to be!
Big files dance with memory cheer,

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.75% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(nifti): Enable lazy loading for Nifti1ImageWrapper' accurately and concisely summarizes the main change: enabling lazy loading for the Nifti1ImageWrapper class by switching from eager get_fdata() to lazy dataobj initialization.
Linked Issues check ✅ Passed The pull request fully addresses all coding requirements from issue #1: replaces get_fdata() with nifti_image.dataobj to enable lazy loading, preserves ArrayProxy, reduces memory usage, defers decoding errors to usage time, maintains all constructor arguments, and adds test verification.
Out of Scope Changes check ✅ Passed All changes are directly in scope: the nifti.py modification implements the exact solution proposed in issue #1, and the test addition verifies the lazy loading behavior without introducing unrelated functionality.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@The-Obstacle-Is-The-Way
Copy link
Owner Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Nov 29, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@The-Obstacle-Is-The-Way The-Obstacle-Is-The-Way deleted the feat/nifti-lazy-loading branch November 29, 2025 14:56
@The-Obstacle-Is-The-Way The-Obstacle-Is-The-Way restored the feat/nifti-lazy-loading branch November 29, 2025 14:58
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
src/datasets/packaged_modules/text/text.py (1)

104-109: Inconsistent key type at line 109.

Lines 86 and 104 correctly yield Key(file_idx, batch_idx), but line 109 still yields a plain tuple (file_idx, batch_idx). This inconsistency will cause type mismatches in downstream processing.

Apply this diff to fix the inconsistency:

                 if batch:
                     pa_table = pa.Table.from_arrays([pa.array([batch])], names=pa_table_names)
-                    yield (file_idx, batch_idx), self._cast_table(pa_table)
+                    yield Key(file_idx, batch_idx), self._cast_table(pa_table)
src/datasets/iterable_dataset.py (1)

419-425: Fix undefined attribute in ShuffledDataSourcesArrowExamplesIterable.shift_rngs

shift_rngs currently references self.generate_examples_fn, which doesn’t exist on ShuffledDataSourcesArrowExamplesIterable (it has generate_tables_fn via ArrowExamplesIterable). This will raise an AttributeError when shift_rngs is called.

Suggest using generate_tables_fn instead:

     def shift_rngs(self, value: int) -> "_BaseExamplesIterable":
         new_seed = self.generator.bit_generator.state["state"]["state"] + value
         return ShuffledDataSourcesArrowExamplesIterable(
-            self.generate_examples_fn,
-            self.kwargs,
-            np.random.default_rng(seed=new_seed),
-        )
+            self.generate_tables_fn,
+            self.kwargs,
+            np.random.default_rng(seed=new_seed),
+        )
src/datasets/splits.py (1)

558-567: from_split_dict can crash when called with dict[str, SplitInfo]

from_split_dict converts a dict to list(split_infos.values()), but still assumes the resulting items are dict by calling .get("dataset_name") on split_infos[0]. If the values are SplitInfo instances, this will raise AttributeError.

Consider making dataset name extraction robust to both dict and SplitInfo inputs, e.g.:

-        if dataset_name is None:
-            dataset_name = split_infos[0].get("dataset_name") if split_infos else None
+        if dataset_name is None and split_infos:
+            first = split_infos[0]
+            if isinstance(first, dict):
+                dataset_name = first.get("dataset_name")
+            elif isinstance(first, SplitInfo):
+                dataset_name = first.dataset_name
src/datasets/builder.py (2)

1535-1588: original_shard_lengths accumulation raises IndexError for non-zero or non-contiguous Key.original_shard_id

The code at lines 1575–1578 (and identical logic at 1855–1858) assumes original_shard_id is always contiguous and starts at 0:

if len(original_shard_lengths) == original_shard_id:
    original_shard_lengths.append(1)
else:
    original_shard_lengths[original_shard_id] += 1

If the first Key has original_shard_id > 0 or IDs arrive out of order (e.g., 3 then 0), the condition is false and original_shard_lengths[original_shard_id] raises IndexError. The Key class has no documented constraints enforcing contiguity or ordering.

Recommend either:

  • Document that original_shard_id must start at 0 and remain contiguous/non-decreasing per job, or
  • Harden the code to handle arbitrary non-negative IDs using extend() to pre-allocate slots as needed.

Given this is new behavior that can crash custom builders adopting Key, hardening is strongly recommended.


1807-1868: Fix original_shard_lengths list bounds checking in both _prepare_split_single paths

The Arrow and generator-based paths both access original_shard_lengths[original_shard_id] without ensuring the list is large enough. If the first Key yielded has original_shard_id > 0, an IndexError occurs.

Lines 1855-1858 (Arrow path) and the equivalent in the generator path should grow the list to at least original_shard_id + 1 before incrementing, or enforce a documented contract that original_shard_id values are 0-based and contiguous within each job.

🧹 Nitpick comments (15)
src/datasets/packaged_modules/xml/xml.py (1)

61-68: Composite (file_idx, 0) key matches the (shard, batch) pattern; consider using Key for uniformity.

Yielding (file_idx, 0) gives XML tables a consistent two-part key (file, batch) like other loaders. For longer-term consistency with JSON and other packaged modules, you might eventually wrap this in the shared Key type as well, but that’s an optional cleanup rather than a blocker.

tests/test_arrow_writer.py (1)

136-147: Misleading test name after key removal.

The test is named test_write_with_keys but no longer uses keys in the writer.write() calls (lines 142-143). This makes the test name misleading.

Consider renaming the test to reflect its actual purpose or removing it if it duplicates test_write:

-def test_write_with_keys(writer_batch_size):
+def test_write_without_keys(writer_batch_size):
     output = pa.BufferOutputStream()
     with ArrowWriter(
         stream=output,

Alternatively, if this test is now redundant with test_write (lines 99-111), consider removing it entirely.

src/datasets/info.py (1)

121-123: Align splits docstring with SplitDict type

The splits field is now correctly typed as Optional[SplitDict] and __post_init__ already normalizes dict-like inputs via SplitDict.from_split_dict, which is good.

The attribute docs still describe splits as a dict, though. Consider updating that to mention SplitDict (or “SplitDict-like mapping”) to avoid confusion for users relying on type hints/docstrings.

Also applies to: 150-153, 167-185

src/datasets/iterable_dataset.py (2)

84-89: Key alias looks good; update nearby docstrings to match expanded key types

The new Key union and BuilderKey import align this module with the structured key class, but several docstrings still say keys are only (int/str, dict), e.g. in __iter__ of _BaseExamplesIterable and in _convert_to_arrow’s Args section. To avoid confusion now that keys can be int | str | tuple[int, int] | BuilderKey, consider updating those docstrings accordingly.


257-262: Iterator-based type hints for generators are consistent with usage

Changing generate_examples_fn / generate_tables_fn annotations to return Iterator[tuple[Key, ...]] matches how these call sites consume them (via direct iteration and islice). No functional issues spotted; just ensure any externally typed generators are declared as returning Iterator[...] (or a subtype) so type-checkers stay happy.

Also applies to: 295-300, 343-347, 410-415

src/datasets/arrow_dataset.py (1)

6340-6340: add_item signature change looks correct; consider documenting new_fingerprint.

Appending new_fingerprint: Optional[str] = None is backward compatible for positional callers and aligns add_item with other @fingerprint_transform methods that accept an explicit fingerprint override. To keep the public API clear, you may want to add an Args: entry for new_fingerprint in the docstring similar to other transforms.

src/datasets/splits.py (1)

587-596: YAML export intentionally drops shard-length metadata

_to_yaml_list now removes both shard_lengths and original_shard_lengths before serializing. This is fine if these are considered derived/large metadata, but it means YAML round‑trips won’t preserve them.

If consumers rely on YAML as a full-fidelity representation of SplitInfo, consider documenting this lossy behavior or, if needed, gating the pops behind a flag so they can be retained when required.

src/datasets/builder.py (8)

1320-1340: Clarify Key usage and update _generate_examples documentation

You’ve introduced Key and updated _generate_examples to return Iterator[tuple[Key, dict[str, Any]]], but the docstring still documents key as str/int. Also, the implementation gracefully handles legacy keys (isinstance(key, Key) guard), so the runtime contract is broader than the type hint.

Consider:

  • Updating the docstring to mention Key and how original_shard_id/item_or_batch_id are expected to be used.
  • Widening the return type hint to include legacy key types, e.g. Iterator[tuple[Union[Key, int, str], dict[str, Any]]], if you intend to support them.

This will make the new API surface easier to adopt for custom builders.


1387-1399: num_proc vs num_original_shards logic looks correct but relies on _number_of_shards_in_gen_kwargs contract

The num_proc adjustment based on num_original_shards (warnings + clamping) is reasonable and prevents over‑parallelizing when there are too few shards. The behavior depends entirely on _number_of_shards_in_gen_kwargs correctly reflecting “original shard” count.

Just a note: if _number_of_shards_in_gen_kwargs ever returns 0 for some edge builder, this will silently force num_proc to 1 and skip the warning. If that’s not expected, you may want an explicit assertion/log for num_original_shards == 0.


1418-1435: Single-process _prepare_split_single now returns extra metadata; type hints should reflect 7-tuple

In the single-process branch you now wrap a 7‑element result (..., shard_lengths, num_original_shards, original_shard_lengths) into lists:

(
    examples_per_job,
    bytes_per_job,
    features_per_job,
    shards_per_job,
    shard_lengths_per_job,
    original_shards_per_job,
    original_shard_lengths_per_job,
) = ([item] for item in result)

But _prepare_split_single’s return type is still Iterator[tuple[int, bool, tuple[int, int, Features, int, int, int]]], which only accounts for six inner items and doesn’t mention the list type of original_shard_lengths.

Recommend updating the annotation to match reality (and mirroring the change on the Arrow path), e.g.:

-> Iterator[
    tuple[int, bool, tuple[int, int, Features, int, list[int], int, list[int]]]
]

(or similar structure you prefer).


1477-1522: original_shard_lengths propagation looks good, but only set when more than one original shard

After aggregation:

total_original_shards = sum(original_shards_per_job)
...
if total_original_shards > 1:
    split_generator.split_info.original_shard_lengths = [
        original_shard_length
        for original_shard_lengths in original_shard_lengths_per_job
        for original_shard_length in original_shard_lengths
    ]

This means single‑shard splits (or builders that don’t emit Key) will have original_shard_lengths = None, which seems intentional and keeps SplitInfo small. Just ensure downstream code treats None as “single-shard or unknown” rather than “no data.”

If consumers are going to rely on original_shard_lengths being present whenever num_examples > 0, consider setting it even when total_original_shards == 1.


1627-1645: Align _generate_tables docstring and type with new Key usage

_generate_tables is now typed as:

def _generate_tables(self, **kwargs) -> Iterator[tuple[Key, pa.Table]]:

and _prepare_split_single checks isinstance(key, Key). However, the docstring still says:

key: tuple[int, int] original_shard_id and table_idx within that shard

and existing builders may still yield plain integer or tuple keys.

To avoid confusion and ease migration:

  • Update the docstring to reference Key and the recommended semantics.
  • Either widen the type hint to include legacy key types (as with _generate_examples) or explicitly document that new implementations should return Key.

Runtime behavior is already backward compatible due to the isinstance guard, so this is mostly about clarity.


1667-1678: Same num_proc vs num_original_shards logic for Arrow path; consider shared helper

ArrowBasedBuilder _prepare_split duplicates the num_proc clamping based on num_original_shards. Behavior is consistent with the GeneratorBasedBuilder path, which is good.

As this logic evolves (e.g., if _number_of_shards_in_gen_kwargs changes), you may want to factor it into a small helper to keep the two code paths in sync and avoid subtle drift.


1705-1713: Arrow _prepare_split_single type hints diverge from actual return shape

As with the generator-based path, the Arrow-based _prepare_split_single now returns 7 inner values, with original_shard_lengths being a list:

(
    total_num_examples,
    total_num_bytes,
    writer._features,
    num_shards,
    shard_lengths,
    num_original_shards,
    original_shard_lengths,
)

but the annotation remains:

-> Iterator[tuple[int, bool, tuple[int, int, Features, int, int, int]]]

Recommend updating the type hint to match, mirroring whatever you choose on the GeneratorBasedBuilder side.


949-949: Wrapped OSError message still preserves original error but drops traceback

In _download_and_prepare, you now rewrap OSError from _prepare_split as:

except OSError as e:
    raise OSError("Cannot find data file. " + "\nOriginal error:\n" + str(e)) from None

This improves the user-facing message but suppresses the original traceback (from None). If some failures are more complex than “cannot find data file”, this may hide useful debugging context.

If you want clearer messaging without losing traceback, consider either:

  • Re-raising with from e to preserve the chain, or
  • Restricting this wrapping to known “missing file” patterns instead of all OSError.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 872490c and 0eab74c.

📒 Files selected for processing (30)
  • src/datasets/arrow_dataset.py (2 hunks)
  • src/datasets/arrow_writer.py (2 hunks)
  • src/datasets/builder.py (28 hunks)
  • src/datasets/features/nifti.py (1 hunks)
  • src/datasets/info.py (1 hunks)
  • src/datasets/inspect.py (0 hunks)
  • src/datasets/iterable_dataset.py (5 hunks)
  • src/datasets/keyhash.py (0 hunks)
  • src/datasets/naming.py (1 hunks)
  • src/datasets/packaged_modules/arrow/arrow.py (2 hunks)
  • src/datasets/packaged_modules/cache/cache.py (2 hunks)
  • src/datasets/packaged_modules/csv/csv.py (2 hunks)
  • src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py (4 hunks)
  • src/datasets/packaged_modules/generator/generator.py (2 hunks)
  • src/datasets/packaged_modules/hdf5/hdf5.py (2 hunks)
  • src/datasets/packaged_modules/json/json.py (3 hunks)
  • src/datasets/packaged_modules/pandas/pandas.py (2 hunks)
  • src/datasets/packaged_modules/parquet/parquet.py (2 hunks)
  • src/datasets/packaged_modules/spark/spark.py (1 hunks)
  • src/datasets/packaged_modules/sql/sql.py (2 hunks)
  • src/datasets/packaged_modules/text/text.py (4 hunks)
  • src/datasets/packaged_modules/webdataset/webdataset.py (2 hunks)
  • src/datasets/packaged_modules/xml/xml.py (1 hunks)
  • src/datasets/splits.py (3 hunks)
  • src/datasets/utils/info_utils.py (1 hunks)
  • tests/features/test_nifti.py (1 hunks)
  • tests/packaged_modules/test_spark.py (2 hunks)
  • tests/test_arrow_dataset.py (1 hunks)
  • tests/test_arrow_writer.py (1 hunks)
  • tests/test_builder.py (14 hunks)
💤 Files with no reviewable changes (2)
  • src/datasets/keyhash.py
  • src/datasets/inspect.py
🧰 Additional context used
🧬 Code graph analysis (18)
tests/test_builder.py (1)
src/datasets/builder.py (2)
  • Key (1321-1326)
  • cache_dir (598-599)
tests/features/test_nifti.py (2)
tests/utils.py (1)
  • require_nibabel (226-235)
src/datasets/features/nifti.py (3)
  • Nifti (64-300)
  • encode_example (110-148)
  • decode_example (150-211)
src/datasets/packaged_modules/generator/generator.py (2)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/utils/sharding.py (2)
  • _number_of_shards_in_gen_kwargs (4-18)
  • _split_gen_kwargs (48-64)
src/datasets/packaged_modules/webdataset/webdataset.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
tests/test_arrow_dataset.py (1)
src/datasets/arrow_dataset.py (5)
  • Dataset (704-6467)
  • from_dict (974-1035)
  • add_column (6078-6129)
  • features (206-207)
  • features (781-785)
src/datasets/info.py (1)
src/datasets/splits.py (1)
  • SplitDict (520-600)
src/datasets/packaged_modules/cache/cache.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/csv/csv.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
tests/test_arrow_writer.py (1)
src/datasets/arrow_writer.py (1)
  • write (582-598)
src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/text/text.py (8)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/arrow/arrow.py (1)
  • _cast_table (56-61)
src/datasets/packaged_modules/csv/csv.py (1)
  • _cast_table (166-175)
src/datasets/packaged_modules/json/json.py (1)
  • _cast_table (88-111)
src/datasets/packaged_modules/pandas/pandas.py (1)
  • _cast_table (55-60)
src/datasets/packaged_modules/parquet/parquet.py (1)
  • _cast_table (144-149)
src/datasets/packaged_modules/sql/sql.py (1)
  • _cast_table (101-110)
src/datasets/packaged_modules/xml/xml.py (1)
  • _cast_table (48-59)
src/datasets/packaged_modules/arrow/arrow.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/json/json.py (8)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/arrow/arrow.py (1)
  • _cast_table (56-61)
src/datasets/packaged_modules/csv/csv.py (1)
  • _cast_table (166-175)
src/datasets/packaged_modules/pandas/pandas.py (1)
  • _cast_table (55-60)
src/datasets/packaged_modules/parquet/parquet.py (1)
  • _cast_table (144-149)
src/datasets/packaged_modules/sql/sql.py (1)
  • _cast_table (101-110)
src/datasets/packaged_modules/text/text.py (1)
  • _cast_table (53-64)
src/datasets/packaged_modules/xml/xml.py (1)
  • _cast_table (48-59)
src/datasets/packaged_modules/pandas/pandas.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/hdf5/hdf5.py (2)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/table.py (1)
  • cast_table_to_features (2179-2198)
src/datasets/packaged_modules/sql/sql.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/parquet/parquet.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/iterable_dataset.py (2)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/generator/generator.py (1)
  • Generator (25-38)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: test (unit, windows-latest, deps-minimum)
  • GitHub Check: test (integration, ubuntu-latest, deps-minimum)
  • GitHub Check: test (integration, windows-latest, deps-latest)
  • GitHub Check: test (integration, windows-latest, deps-minimum)
  • GitHub Check: test (integration, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314_future (unit, windows-latest, deps-latest)
  • GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test (unit, windows-latest, deps-latest)
  • GitHub Check: test (unit, ubuntu-latest, deps-minimum)
  • GitHub Check: test (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314 (unit, windows-latest, deps-latest)
  • GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
  • GitHub Check: build / build_pr_documentation
🔇 Additional comments (30)
src/datasets/utils/info_utils.py (1)

32-32: Clarified ALL_CHECKS description looks accurate and concise

The updated description correctly communicates that ALL_CHECKS runs both split checks and downloaded-file validity checks (counts, checksums) in one line, and the table formatting remains intact. No further changes needed here.

src/datasets/features/nifti.py (1)

28-36: Lazy-loading change looks good; please verify remote xopen use cases.

Switching to dataobj=nifti_image.dataobj avoids eager reads and keeps affine/header/extra/file_map/dtype in sync, so local and in-memory paths should now be lazily loaded as intended. One thing to double-check is the remote path in decode_example that does with xopen(path, "rb") as f: nifti = nib.load(f): previously the eager get_fdata() ran while f was still open, whereas now .get_fdata() may be called only after f is closed. It’s worth adding/ran an integration test that decodes a NIfTI via a remote xopen URL and calls get_fdata() to ensure nibabel’s ArrayProxy still works in that scenario.

src/datasets/packaged_modules/spark/spark.py (1)

49-76: Composite (part_id, row_id) keys are reasonable; confirm downstream expectations.

Emitting (part_id, row_id) instead of a formatted string keeps partition and row indices explicit and in sync with the state_dict logic. Since this changes the key type returned by the Spark examples iterable, ensure all consumers/tests that previously assumed string keys are updated or documented accordingly.

src/datasets/packaged_modules/json/json.py (1)

12-12: Adopting Key for JSON table identifiers is consistent with the new key model.

Importing Key and yielding Key(file_idx, batch_idx_or_0) for both single-object and line-based JSON files aligns this module with the repo-wide Key-based indexing. This should work seamlessly anywhere keys are only stringified, but be aware this is a behavioral change for callers that previously assumed plain tuples/ints and update any such usages/tests accordingly.

Also applies to: 125-125, 190-193

tests/features/test_nifti.py (1)

133-149: Lazy-loading test is well targeted and validates the new behavior.

This test nicely checks that decode_example preserves a proxy dataobj and that get_fdata() still yields the expected shape, so it should catch regressions where the wrapper accidentally materializes data eagerly again.

src/datasets/naming.py (1)

70-84: Single-shard filenames without -of-00001 are clearer.

Requiring len(shard_lengths) > 1 before adding the -SSSSS-of-NNNNN suffix makes single-shard splits use the plain base filename, which is more intuitive and avoids confusing “of 1” artifacts. This looks correct and backward-compatible aside from the intended naming change.

tests/test_arrow_dataset.py (1)

4787-4795: New test_add_column gives a clear smoke test for the public API.

Using from datasets import Dataset here ensures the top-level Dataset alias plus add_column behave as expected end-to-end, and the assertions on features and row contents are sufficient. Given the more detailed test_dataset_add_column above, this is a lightweight but useful complementary check.

src/datasets/packaged_modules/text/text.py (3)

9-9: LGTM: Key import added.

The import is correctly added to support the Key-based refactoring.


86-86: LGTM: Key-based yield for line-based sampling.

The change from tuple to Key object is consistent with the broader refactoring.


113-113: LGTM: Key-based yield for document-based sampling.

The change from tuple to Key object is correct.

src/datasets/packaged_modules/cache/cache.py (2)

14-14: LGTM: Key import added.

Import is correctly added for the Key-based refactoring.


189-189: LGTM: Migration from string to Key object.

The change from f"{file_idx}_{batch_idx}" to Key(file_idx, batch_idx) provides better structure and type safety.

src/datasets/packaged_modules/sql/sql.py (2)

10-10: LGTM: Key import added.

Import correctly added for Key-based refactoring.


120-120: LGTM: Key-based yield with appropriate shard_id.

Using Key(0, chunk_idx) is correct for SQL where there's a single logical data source. The 0 represents the single shard, and chunk_idx represents the batch within that shard.

src/datasets/packaged_modules/arrow/arrow.py (2)

8-8: LGTM: Key import added.

Import correctly added for the Key-based refactoring.


77-77: LGTM: Migration from string to Key object.

The change from string-based key to Key(file_idx, batch_idx) provides better type safety and structured identification.

src/datasets/packaged_modules/parquet/parquet.py (2)

10-10: LGTM: Key import added.

Import correctly added for the Key-based refactoring.


182-182: LGTM: Migration from string to Key object.

The change from "{file_idx}_{batch_idx}" to Key(file_idx, batch_idx) aligns with the broader Key-based refactoring across packaged modules.

src/datasets/packaged_modules/pandas/pandas.py (2)

10-10: LGTM: Key import added.

Import correctly added for the Key-based refactoring.


66-66: LGTM: Key-based yield with appropriate batch_id.

Using Key(i, 0) is correct since each pandas pickle file is read as a single batch. The 0 batch_id indicates this is the only batch from that file.

src/datasets/packaged_modules/csv/csv.py (2)

10-10: LGTM: Key import added.

Import correctly added for the Key-based refactoring.


196-196: LGTM: Migration from tuple to Key object.

The change from (file_idx, batch_idx) tuple to Key(file_idx, batch_idx) provides structured key representation consistent with other packaged modules.

tests/packaged_modules/test_spark.py (1)

23-29: Tuple-based row identifiers look consistent with iterator behavior

Using (part_id, row_idx) in _get_expected_row_ids_and_row_dicts_for_partition_order and asserting row_key == (0, i) in test_spark_examples_iterable keeps test expectations aligned with the new composite key semantics. No functional issues spotted here.

Also applies to: 71-80

src/datasets/packaged_modules/hdf5/hdf5.py (1)

8-10: Key-based shard/batch identifiers for HDF5 tables look correct

Using Key(file_idx, batch_idx) in _generate_tables is consistent with the intended “(original_shard_id, item_or_batch_id)” semantics: file_idx identifies the source HDF5 file, and batch_idx identifies the batch within that file. The surrounding length checks and empty-file skips are unchanged, so behavior should stay correct while enabling richer split metadata.

Also applies to: 75-95

tests/test_builder.py (3)

19-26: Dummy builders updated to Key-based identifiers are consistent with new sharding model

All dummy generator/arrow-based builders in this file now yield Key(shard_idx, item_idx) (or (0, i) where only a trivial shard dimension is needed). This matches the new composite key abstraction in datasets.builder.Key and lets the core builder logic compute shard and original-shard metadata correctly while leaving payloads unchanged.

No issues spotted with these test helpers; they look well-aligned with the library changes.

Also applies to: 64-86, 88-99, 107-119, 146-157, 159-171, 172-191, 193-212


766-773: Extended cache-dir tests for data_dir and configured builders look correct

The new expectations around cache_dir show that:

  • data_dir is part of the cache-dir fingerprint even when config_name is set, and
  • configured builders created via configure_builder_class produce stable cache dirs for identical hash/config combos.

These tests are valuable to prevent regressions in cache-dir computation; the logic they assert is coherent with how cache isolation is typically handled.

Also applies to: 775-781


1030-1039: Shard and original-shard length assertions give good coverage

The added assertions on:

  • shard_lengths == [25] * 4 and original_shard_lengths is None for pure writer-based sharding,
  • shard_lengths == [200, 200] and original_shard_lengths == [100] * 4 for generator-based multi-proc sharding, and
  • analogous expectations for arrow-based builders,

provide solid regression coverage for the new split metadata behavior. They clearly distinguish “logical shards” from “original shards” and validate that multi-proc jobs are merged as intended.

No issues; these tests are a good addition.

Also applies to: 1081-1088, 1124-1132, 1173-1180

src/datasets/packaged_modules/webdataset/webdataset.py (1)

11-13: Key-based example identifiers for WebDataset are reasonable

Switching _generate_examples to yield Key(tar_idx, example_idx) instead of a string key keeps the observable key structure consistent with other packaged modules while preserving the existing example contents and feature handling. Using tar_idx as the shard id and example_idx as the per-tar index is a sensible choice.

No functional concerns here.

Also applies to: 110-130

src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py (1)

16-18: Folder-based builder Key usage is coherent across metadata and non-metadata paths

In _generate_examples:

  • When add_metadata is true, each metadata file becomes a shard (shard_idx) and each metadata row a sample_idx, yielding Key(shard_idx, sample_idx), sample after applying _nested_apply.
  • Without metadata, each (original_file, downloaded_file_or_dir) pair is treated as a shard, and files within it are indexed by sample_idx, again yielding Key(shard_idx, sample_idx), sample.

This is internally consistent and should integrate well with the new shard/original-shard accounting, while leaving label inference and filtering behavior unchanged.

Also applies to: 357-391, 392-412

src/datasets/arrow_writer.py (1)

19-20: Typing imports are consistent with updated hints

Importing Any and Optional matches the new type annotations added below; nothing to change here.

Comment on lines 6079 to 6084
self,
name: str,
column: Union[list, np.ndarray],
new_fingerprint: Optional[str] = None,
feature: Optional[FeatureType] = None,
):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid breaking positional callers by reordering new_fingerprint after feature.

Adding new_fingerprint is correct for fingerprint_transform, but inserting it before feature changes the positional parameter order. Existing code that calls add_column(name, column, feature) will now pass feature into new_fingerprint and leave feature as None, silently misbehaving.

To preserve backward compatibility, keep feature as the third positional argument and add new_fingerprint at the end:

-    def add_column(
-        self,
-        name: str,
-        column: Union[list, np.ndarray],
-        new_fingerprint: Optional[str] = None,
-        feature: Optional[FeatureType] = None,
-    ):
+    def add_column(
+        self,
+        name: str,
+        column: Union[list, np.ndarray],
+        feature: Optional[FeatureType] = None,
+        new_fingerprint: Optional[str] = None,
+    ):

It would also be good to extend the docstring’s Args: section to document new_fingerprint for consistency with other fingerprinted transforms.

🤖 Prompt for AI Agents
In src/datasets/arrow_dataset.py around lines 6079-6084, the new_fingerprint
parameter was inserted before feature which changes the positional-argument
order and breaks callers; restore backward compatibility by moving feature back
to be the third positional parameter and place new_fingerprint as an optional
fourth (last) parameter, update the function signature accordingly, and add
new_fingerprint to the method docstring Args: section to document its purpose
and type consistent with other fingerprinted transforms.

Comment on lines 582 to 599
def write(
self,
example: dict[str, Any],
key: Optional[Union[str, int, bytes]] = None,
writer_batch_size: Optional[int] = None,
):
"""Add a given (Example,Key) pair to the write-pool of examples which is written to file.

Args:
example: the Example to add.
key: Optional, a unique identifier(str, int or bytes) associated with each example
"""
# Utilize the keys and duplicate checking when `self._check_duplicates` is passed True
if self._check_duplicates:
# Create unique hash from key and store as (key, example) pairs
hash = self._hasher.hash(key)
self.current_examples.append((example, hash))
# Maintain record of keys and their respective hashes for checking duplicates
self.hkey_record.append((hash, key))
else:
# Store example as a tuple so as to keep the structure of `self.current_examples` uniform
self.current_examples.append((example, ""))
# Store example as a tuple so as to keep the structure of `self.current_examples` uniform
self.current_examples.append((example, ""))

if writer_batch_size is None:
writer_batch_size = self.writer_batch_size
if writer_batch_size is not None and len(self.current_examples) >= writer_batch_size:
if self._check_duplicates:
self.check_duplicate_keys()
# Re-initializing to empty list for next batch
self.hkey_record = []

self.write_examples_on_file()

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Align ArrowWriter.write docstring and consider simplifying current_examples structure

The implementation now only accepts example (no key) and always appends (example, "") to current_examples, but the docstring still says “(Example,Key) pair”. That’s misleading.

You could:

  • Update the docstring to describe just example being buffered, and
  • (Optionally) in a follow-up, simplify current_examples to list[dict[str, Any]] and adjust write_examples_on_file to drop the unused second tuple element, since keys are no longer part of this API.
-        """Add a given (Example,Key) pair to the write-pool of examples which is written to file.
+        """Add a given example to the write-pool of examples which is written to file.
@@
-            example: the Example to add.
+            example: the example to add.
🤖 Prompt for AI Agents
In src/datasets/arrow_writer.py around lines 582 to 599, the write() docstring
still states it accepts an (Example,Key) pair while the implementation only
accepts a single example and appends (example, "") to current_examples; update
the docstring to describe that write(example: dict[str, Any], writer_batch_size:
Optional[int] = None) buffers a single example (no key) and triggers a flush
when the batch size is reached, and optionally add a TODO note suggesting future
simplification of current_examples to a list of examples and corresponding
adjustments in write_examples_on_file to remove the unused tuple element.

Comment on lines 5 to 7
from datasets.builder import Key
from datasets.utils.sharding import _number_of_shards_in_gen_kwargs, _split_gen_kwargs

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg "_number_of_shards_in_gen_kwargs|_split_gen_kwargs" -n

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 3491


🏁 Script executed:

sed -n '1,50p' src/datasets/packaged_modules/generator/generator.py | cat -n

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 1721


🏁 Script executed:

rg "class GeneratorBasedBuilder" -A 20 -B 2

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 1582


🏁 Script executed:

sed -n '1380,1450p' src/datasets/builder.py | cat -n

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 3528


🏁 Script executed:

sed -n '1660,1730p' src/datasets/builder.py | cat -n

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 3540


Confirm interaction with base sharding to avoid double-splitting gen_kwargs

This implementation cleanly derives a shard dimension from list-valued gen_kwargs and exposes (shard_idx, sample_idx) via Key, which fits the new composite-key model.

However, there is a real risk of unintended double-splitting: when num_proc > 1, DatasetBuilder._prepare_split splits gen_kwargs across jobs at lines 1440 and 1718, then passes each shard to _generate_examples in a worker. If the Generator subclass also applies _split_gen_kwargs in its _generate_examples (lines 35–36), the gen_kwargs are split twice—once at the job level by the base builder, and again within each worker. This cascade can distort shard metadata and make the composite Key semantics ambiguous.

Either clarify that Generator should only apply internal sharding when used single-process, or ensure _generate_examples detects and skips splitting if the base builder already split the inputs.

🤖 Prompt for AI Agents
In src/datasets/packaged_modules/generator/generator.py around lines 5–7 and the
_generate_examples region (lines ~35–36), avoid double-splitting gen_kwargs by
detecting when the base DatasetBuilder already sharded inputs: only call
_split_gen_kwargs when gen_kwargs clearly represents full (unsplit) list-valued
inputs. Concretely, add a guard that skips _split_gen_kwargs if either (a) any
gen_kwargs value is not a list (already per-worker), or (b) all list-valued
gen_kwargs have length == 1 (already reduced by the builder), or (c) an explicit
shard marker/key is present in gen_kwargs; otherwise perform the split as
before. Ensure the behavior is documented in a comment so Generator subclasses
know splitting is conditional.

@The-Obstacle-Is-The-Way The-Obstacle-Is-The-Way force-pushed the feat/nifti-lazy-loading branch 2 times, most recently from d2c1fce to d1d14bd Compare December 9, 2025 15:37
ada-ggf25 and others added 7 commits December 16, 2025 15:20
…ace#7877) (huggingface#7891)

* fix(fingerprint): improve TMPDIR environment variable handling in _TempCacheDir

Enhanced the _TempCacheDir.__init__ method to properly respect and handle
the TMPDIR environment variable when creating temporary cache directories.

Changes:
- Add TMPDIR environment variable detection and validation
- Normalise paths to handle path resolution issues
- Auto-create TMPDIR directory if it doesn't exist to prevent silent
  fallback to default temporary directory
- Validate that TMPDIR is actually a directory before use
- Explicitly pass directory to mkdtemp to ensure TMPDIR is respected even
  if tempfile.gettempdir() was already called and cached
- Add appropriate logging for directory creation and fallback scenarios

This ensures that when TMPDIR is set, the temporary cache files are
created in the specified directory rather than silently falling back to
the system default temporary directory.

* test(fingerprint): add comprehensive tests for TMPDIR handling in _TempCacheDir

Add test coverage for the improved TMPDIR environment variable handling
in the _TempCacheDir class. These tests verify the various scenarios
for TMPDIR usage and error handling.

Changes:
- Refactor test_fingerprint_in_multiprocessing to use Pool.map for
  cleaner test implementation
- Add test_temp_cache_dir_with_tmpdir_nonexistent to verify TMPDIR
  auto-creation when directory doesn't exist
- Add test_temp_cache_dir_with_tmpdir_existing to verify correct
  behaviour when TMPDIR exists and is valid
- Add test_temp_cache_dir_without_tmpdir to verify fallback to default
  temporary directory when TMPDIR is not set
- Add test_temp_cache_dir_tmpdir_creation_failure to verify graceful
  error handling and fallback when TMPDIR creation fails

These tests ensure that the TMPDIR improvements work correctly across
all scenarios and edge cases, including proper logging and fallback
behaviour.

* test(fingerprint): tighten TMPDIR error-path tests for _TempCacheDir

Refine TMPDIR-related failure tests for _TempCacheDir to assert explicit
error conditions instead of fallback behaviour.

Changes:
- Update test_temp_cache_dir_tmpdir_creation_failure to use _TempCacheDir
  directly and assert that an OSError is raised with a clear TMPDIR
  context when directory creation fails
- Introduce test_temp_cache_dir_tmpdir_not_directory to verify that
  pointing TMPDIR at a non-directory raises an OSError with an
  informative error message

These tests better match the intended contract of _TempCacheDir by
ensuring invalid TMPDIR configurations fail loudly with descriptive
messages rather than silently falling back.

* fix(fingerprint): make TMPDIR misconfiguration in _TempCacheDir fail loudly

Tighten TMPDIR handling in _TempCacheDir so that invalid configurations
raise clear errors instead of silently falling back to the default
temporary directory.

Changes:
- When TMPDIR points to a non-existent directory, raise an OSError with
  explicit guidance to create it manually or unset TMPDIR
- When TMPDIR points to a non-directory path, raise an OSError with
  guidance to point TMPDIR to a writable directory or unset it
- Remove previous warning-and-fallback behaviour to avoid masking
  configuration issues

This ensures that TMPDIR misconfigurations are surfaced early and
clearly, aligning runtime behaviour with the stricter expectations
codified in the new tests.

* Update src/datasets/fingerprint.py

* Update fingerprint.py

* Fix formatting of TMPDIR retrieval line

---------

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* encode nifti correctly when uploading lazily

* update docstring for nifti functionality
)

* fix(nifti): enable lazy loading for Nifti1ImageWrapper

- Change dataobj initialization from get_fdata() to .dataobj
- Preserves nibabel's ArrayProxy for lazy loading
- Prevents OOM errors on large 4D fMRI files
- Defers I/O errors to usage time for better error handling

Resolves #1

* chore: trigger CI
…gingface#7906)

don't save original_shard_lengths by default for backward compat
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nifti feature eagerly loads data - limited error handling for corrupted files

4 participants