Skip to content

Conversation

@The-Obstacle-Is-The-Way
Copy link
Owner

@The-Obstacle-Is-The-Way The-Obstacle-Is-The-Way commented Dec 2, 2025

Summary

  • Fixes a critical silent failure where NIfTI files are uploaded as empty bytes (0 bytes) to HuggingFace Hub
  • The stable release of datasets (4.4.x) has a bug where Nifti.embed_storage was broken
  • The fix is only available in the dev branch (4.4.2.dev0+)

Changes

  • Add [tool.uv.sources] to override datasets to install from git
  • Bump datasets minimum version to >=4.4.0 (Nifti feature support)
  • Add huggingface-hub>=0.32.0 dependency (XET storage support)
  • Add Critical Dependency Requirement section to README with:
    • Explanation of the problem
    • Verification steps
    • Manual installation instructions for non-template users

Why This Matters

Without this fix:

  • push_to_hub(embed_external_files=True) uploads 0-byte files
  • Datasets appear to have data but all NIfTI files are empty
  • Silent failure - no error is raised, only visible when loading

References

Test Plan

  • All 43 existing tests pass
  • ruff check passes
  • mypy passes
  • Verified datasets==4.4.2.dev0 is installed from git

Note

This requirement will change once the fix is merged into a stable release. We should monitor HuggingFace datasets releases and update when a stable version includes the fix.

Summary by CodeRabbit

  • Documentation

    • Added critical dependency installation instructions and warnings in README
  • Chores

    • Updated datasets dependency to version 4.4.0
    • Added huggingface-hub>=0.32.0 as required dependency
    • Configured GitHub as alternative package source for datasets

✏️ Tip: You can customize this high-level summary in your review settings.

The stable release of HuggingFace datasets (4.4.x) has a critical bug
where NIfTI files are uploaded as empty bytes (0 bytes) to Hub because
Nifti.embed_storage was broken. The fix is only in the dev branch.

Changes:
- Add [tool.uv.sources] to override datasets to git version
- Bump datasets minimum to >=4.4.0 (Nifti support)
- Add huggingface-hub>=0.32.0 (XET storage support)
- Add Critical Dependency section to README with verification steps

See: huggingface/datasets#7815
@The-Obstacle-Is-The-Way
Copy link
Owner Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Dec 2, 2025

Walkthrough

The PR updates dependency versions (datasets 3.4.0→4.4.0, adds huggingface-hub>=0.32.0) and documents a critical dependency requirement via README updates and pyproject.toml configuration changes using git-based source override.

Changes

Cohort / File(s) Summary
Documentation
README.md
Adds critical dependency warning at top; introduces new section detailing HuggingFace Datasets dependency issues with PyPI releases and GitHub-based fix; includes verification steps and manual installation instructions for uv, pip, requirements.txt, and direct commands.
Dependency Management
pyproject.toml
Updates datasets version from 3.4.0 to 4.4.0 with NIfTI support note; adds huggingface-hub>=0.32.0 dependency; introduces [tool.uv.sources] section with git-based datasets override targeting critical NIfTI fix.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

  • Changes are configuration and documentation only with no functional code logic
  • Straightforward dependency version updates and instruction additions
  • Minimal complexity; verify dependency versions are correct and instructions are clear

Poem

🐰 The datasets dance, a version leap from three to four,
With HuggingFace helping files we did adore,
A warning shines atop the README's page so bright,
Git sources guide the way—our NIfTI files take flight! ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: requiring datasets from git to fix the Nifti.embed_storage bug. It directly reflects the primary objective of the pull request.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch claude/fix-datasets-nifti-embed-storage

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1bb192a and 2ecfa68.

📒 Files selected for processing (2)
  • README.md (2 hunks)
  • pyproject.toml (1 hunks)
🔇 Additional comments (4)
README.md (2)

5-5: Critical warning placement is effective for visibility.

The prominent banner immediately after the title correctly surfaces the dependency override requirement before users start setup. The anchor link to the detailed section below follows good documentation patterns.


48-102: Comprehensive critical-dependency documentation with strong consistency.

The new section provides clear problem statement, fix explanation, verification steps, and installation guidance for all supported package managers (uv, pip, direct command). Content is consistent with pyproject.toml dependency versions (datasets>=4.4.0, huggingface-hub>=0.32.0) and git source configuration. References to the original issue (huggingface/datasets#7815) and release notes are helpful. The note about future updates once the fix reaches stable release appropriately sets expectations.

One minor suggestion: The version string "4.4.2.dev0" in the verification code block (line 70) uses a specific dev version. Consider whether this should remain specific for accuracy or become "4.4.2.dev0+" to account for possible future dev iterations. This depends on how rigid your verification expectation should be.

pyproject.toml (2)

25-26: Dependency version bumps and new dependency are well-justified.

The bump to datasets>=4.4.0 (from >=3.4.0 per AI summary) correctly targets the Nifti support version. The new huggingface-hub>=0.32.0 dependency is explicit and aligns with XET storage support mentioned in PR objectives. Inline comments clearly reference the critical override and XET requirement, aiding future maintainers.


33-39: Git source override is well-documented and follows uv conventions.

The [tool.uv.sources] configuration correctly overrides datasets to use the git repository for the Nifti.embed_storage fix. The comment block thoroughly explains the issue, links to the original PR (huggingface/datasets#7815), and acknowledges this is temporary. Absence of an explicit branch/ref means uv defaults to main, which is correct for pulling the latest dev fixes.

One consideration: Without a pinned commit SHA, reproducibility relies on uv.lock to lock the resolved git commit. Verify that:

  1. uv.lock is committed to the repository (it should be for reproducible builds)
  2. CI/CD tests confirm the lock file is used during dependency resolution

This is a standard pattern with uv, but worth confirming for supply-chain security and reproducibility.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai
Copy link

coderabbitai bot commented Dec 2, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@The-Obstacle-Is-The-Way The-Obstacle-Is-The-Way merged commit 2e3219e into main Dec 2, 2025
1 check passed
@The-Obstacle-Is-The-Way The-Obstacle-Is-The-Way deleted the claude/fix-datasets-nifti-embed-storage branch December 2, 2025 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants