Skip to content

fix: use temp files instead of BytesIO in push_to_hub to prevent OOM#7

Closed
The-Obstacle-Is-The-Way wants to merge 4 commits intomainfrom
fix/push-to-hub-oom
Closed

fix: use temp files instead of BytesIO in push_to_hub to prevent OOM#7
The-Obstacle-Is-The-Way wants to merge 4 commits intomainfrom
fix/push-to-hub-oom

Conversation

@The-Obstacle-Is-The-Way
Copy link
Owner

@The-Obstacle-Is-The-Way The-Obstacle-Is-The-Way commented Dec 3, 2025

Summary

Fixes memory accumulation in _push_parquet_shards_to_hub_single that caused OOM when uploading large datasets.

Problem: The previous implementation held ALL parquet shard bytes in memory via the additions list. For large datasets (e.g., 902 shards × 300MB = 270GB), this caused OOM.

Solution: Write parquet to temp file, pass file path to CommitOperationAdd, delete temp file after upload.

Changes

  • src/datasets/arrow_dataset.py:

    • Write parquet to temp file instead of BytesIO
    • Pass file path to CommitOperationAdd (huggingface_hub streams from disk)
    • Delete temp file after preupload_lfs_files completes
    • Remove unused BytesIO import
  • tests/test_push_to_hub_memory.py:

    • Add regression tests for memory-safe upload

Test Plan

  • test_push_to_hub_uses_file_path_not_bytes_in_commit_operation - Verifies file paths are used
  • test_push_to_hub_cleans_up_temp_files - Verifies temp files are cleaned up
  • test_push_to_hub_uploaded_size_still_calculated - Verifies size calculation still works
  • Existing parquet tests pass

Related

Summary by CodeRabbit

  • Bug Fixes

    • Improved memory efficiency when uploading large datasets by optimizing shard handling, preventing out-of-memory errors during the push operation.
    • Enhanced temporary file cleanup to promptly free disk space after uploads.
  • Tests

    • Added comprehensive test suite validating memory-safe behavior for large dataset uploads.

✏️ Tip: You can customize this high-level summary in your review settings.

When uploading large datasets with many shards, the previous implementation
held all parquet shard bytes in memory via the `additions` list. This caused
OOM for datasets with hundreds of shards (e.g., 902 shards × 300MB = 270GB).

Changes:
- Write parquet to temp file instead of BytesIO
- Pass file path to CommitOperationAdd (huggingface_hub streams from disk)
- Delete temp file after preupload_lfs_files completes
- Remove unused BytesIO import

Memory usage now stays constant (~1-2 shard sizes) instead of growing
linearly with the number of shards.

Fixes: #5
@The-Obstacle-Is-The-Way
Copy link
Owner Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Dec 3, 2025

Warning

Rate limit exceeded

@The-Obstacle-Is-The-Way has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 7 minutes and 51 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between efd1ef2 and 44936c9.

📒 Files selected for processing (2)
  • src/datasets/arrow_dataset.py (1 hunks)
  • tests/test_push_to_hub_memory.py (1 hunks)

Walkthrough

The pull request optimizes push_to_hub to eliminate out-of-memory issues when uploading large datasets. Parquet shards are now written to temporary disk files instead of being accumulated as in-memory bytes before upload, with proper cleanup afterward. No public API changes.

Changes

Cohort / File(s) Change Summary
Memory-safe shard uploads
src/datasets/arrow_dataset.py
Replaced in-memory parquet buffering with temporary file writing. Shards now write to NamedTemporaryFile and reference file paths in CommitOperationAdd operations rather than holding byte strings in memory. Added cleanup to delete temporary files post-upload. Removed unused BytesIO import.
Memory safety test suite
tests/test_push_to_hub_memory.py
Added TestPushToHubMemorySafe with three test cases validating file-path references in commit operations, temporary file cleanup after shard uploads, and correctness of uploaded size calculations with file-based uploads.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Arrow dataset changes: Verify temp file creation/cleanup logic is bulletproof; check for race conditions or orphaned files in error paths; ensure CommitOperationAdd correctly uses file paths instead of bytes.
  • Test coverage: Confirm mocking strategy adequately simulates large dataset scenarios; verify all edge cases (cleanup on success/failure, concurrent shards) are covered.
  • Memory validation: Cross-reference implementation against the linked issue's root cause analysis to ensure the fix fully resolves the OOM accumulation pattern.

Poem

🐰 No more bytes in burrows deep,
Temp files now handle the heap,
Write to disk, upload with care,
Clean up after—memory to spare!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main fix: replacing BytesIO with temporary files in push_to_hub to prevent OOM errors.
Linked Issues check ✅ Passed All objectives from issue #5 are met: parquet shards now write to temp files, use file paths in CommitOperationAdd, and temp files are cleaned up after upload.
Out of Scope Changes check ✅ Passed All changes directly address the OOM issue: temp file writing, path-based commits, cleanup, removed BytesIO import, and regression tests.

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai
Copy link

coderabbitai bot commented Dec 3, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Wrap the fallible upload operations in try...finally to guarantee
temp file cleanup even if preupload_lfs_files() throws (network timeout,
500 error, KeyboardInterrupt).

Without this, a flaky upload of 900 shards could leave hundreds of
~300MB temp files in /tmp, potentially filling the disk.
@The-Obstacle-Is-The-Way
Copy link
Owner Author

@coderabbitai review

Added try...finally to ensure temp file cleanup even if upload fails (network timeout, 500 error, KeyboardInterrupt). This prevents disk exhaustion from leaked temp files during flaky uploads of large datasets.

@coderabbitai
Copy link

coderabbitai bot commented Dec 3, 2025

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get the file with the changes to review the try...finally implementation
cat src/datasets/arrow_dataset.py | grep -A 50 -B 10 "def _push_parquet_shards_to_hub_single"

Length of output: 2498


🏁 Script executed:

#!/bin/bash
# Also check if there are any recent commits that show the try...finally addition
git log --oneline -10 --all

Length of output: 482


🏁 Script executed:

#!/bin/bash
# Get more context around the temp file handling
cat src/datasets/arrow_dataset.py | grep -A 80 "def _push_parquet_shards_to_hub_single"

Length of output: 3166


@The-Obstacle-Is-The-Way, excellent implementation! The try...finally pattern correctly ensures temp file cleanup in all failure scenarios:

What's working well:

  1. Temp file scope: temp_path is captured before the try block (in the with statement), ensuring it's always defined when finally executes
  2. Comprehensive error handling: The finally block runs even on:
    • Network timeouts during preupload_lfs_files
    • HTTP 500 errors from the API
    • KeyboardInterrupt (Ctrl+C)
    • Any other exception during parquet write, size calculation, or upload
  3. Safe cleanup: os.path.exists check prevents errors if the file was somehow already deleted
  4. Per-shard isolation: Each shard has its own try...finally, so one failure doesn't leak other shards' temp files

The implementation properly prevents disk exhaustion during flaky uploads of large datasets. The pattern is exactly what's needed here.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
tests/test_push_to_hub_memory.py (1)

14-153: Consider adding error handling tests.

The current tests validate the happy path well, but consider adding tests for error scenarios:

  • What happens if preupload_lfs_files raises an exception?
  • What if to_parquet fails during shard writing?
  • Are temporary files still cleaned up in error cases?

This would help ensure robustness and catch potential resource leaks. Note that the current implementation in src/datasets/arrow_dataset.py doesn't appear to have try-finally protection around os.unlink(temp_path), so errors could leave temp files behind—error-case tests would expose this.

Example test structure:

def test_push_to_hub_cleans_up_temp_files_on_error(self):
    """Temp files should be cleaned up even if upload fails."""
    ds = Dataset.from_dict({"x": list(range(100))})
    
    with patch("datasets.arrow_dataset.HfApi") as mock_api_class:
        mock_api = MagicMock()
        mock_api_class.return_value = mock_api
        # Simulate upload failure
        mock_api.preupload_lfs_files.side_effect = Exception("Upload failed")
        
        with pytest.raises(Exception):
            list(ds._push_parquet_shards_to_hub_single(...))
        
        # Verify temp files were still cleaned up
        # (would require tracking as in test_push_to_hub_cleans_up_temp_files)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 004a5bf and efd1ef2.

📒 Files selected for processing (2)
  • src/datasets/arrow_dataset.py (1 hunks)
  • tests/test_push_to_hub_memory.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_push_to_hub_memory.py (1)
src/datasets/arrow_dataset.py (4)
  • Dataset (703-6466)
  • from_dict (973-1034)
  • _push_parquet_shards_to_hub_single (5513-5572)
  • split (172-174)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: test (integration, windows-latest, deps-minimum)
  • GitHub Check: test (integration, ubuntu-latest, deps-latest)
  • GitHub Check: test (integration, ubuntu-latest, deps-minimum)
  • GitHub Check: test (integration, windows-latest, deps-latest)
  • GitHub Check: test (unit, ubuntu-latest, deps-minimum)
  • GitHub Check: test (unit, windows-latest, deps-latest)
  • GitHub Check: test (unit, windows-latest, deps-minimum)
  • GitHub Check: test (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314_future (unit, windows-latest, deps-latest)
  • GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314 (unit, windows-latest, deps-latest)
  • GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
  • GitHub Check: build / build_pr_documentation
🔇 Additional comments (2)
tests/test_push_to_hub_memory.py (2)

17-62: LGTM: Test correctly validates file-based commit operations.

This test effectively verifies the core fix—that CommitOperationAdd uses file paths instead of in-memory bytes, which enables streaming from disk and prevents OOM.


64-113: LGTM: Test correctly validates temp file cleanup.

The test properly verifies that temporary files are deleted after upload, which is essential to prevent disk exhaustion when processing large datasets with many shards.

@@ -0,0 +1,153 @@
"""Tests for memory-safe push_to_hub with large datasets.

Regression tests for https://github.com/huggingface/datasets/issues/XXXX
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Replace the issue number placeholder.

The placeholder "XXXX" should be replaced with the actual issue number. Based on the PR objectives, this appears to be issue #5.

Apply this diff:

-Regression tests for https://github.com/huggingface/datasets/issues/XXXX
+Regression tests for https://github.com/huggingface/datasets/issues/5
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Regression tests for https://github.com/huggingface/datasets/issues/XXXX
Regression tests for https://github.com/huggingface/datasets/issues/5
🤖 Prompt for AI Agents
In tests/test_push_to_hub_memory.py around line 3, replace the placeholder
"https://github.com/huggingface/datasets/issues/XXXX" with the actual issue URL
"https://github.com/huggingface/datasets/issues/5" so the regression test
references the correct issue number; update the line text accordingly to reflect
issue #5.

Addresses CodeRabbit's CRITICAL review feedback:

1. Only delete temp file when _upload_mode == "lfs"
   - For LFS uploads, content is already on the Hub after preupload_lfs_files
   - For regular uploads, create_commit still needs to read from disk
   - Deleting temp files unconditionally broke non-LFS uploads

2. Replace XXXX placeholder with actual issue reference

3. Update tests to properly simulate huggingface_hub behavior:
   - Mock preupload_lfs_files to set _upload_mode attribute
   - Add test for regular (non-LFS) upload behavior
   - Mock upload_info to make size assertion meaningful
@The-Obstacle-Is-The-Way
Copy link
Owner Author

@coderabbitai review

Addressed all 3 actionable comments:

  1. CRITICAL: Only delete temp files when _upload_mode == 'lfs' - regular uploads need the file until create_commit
  2. Minor: Replaced XXXX placeholder with actual issue Bug: push_to_hub accumulates all shard bytes in memory, causing OOM on large datasets #5
  3. Major: Updated tests to properly mock preupload_lfs_files behavior (sets _upload_mode and upload_info)

Added new test test_push_to_hub_keeps_temp_files_for_regular_uploads to verify the fix.

@coderabbitai
Copy link

coderabbitai bot commented Dec 3, 2025

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the current implementation of temp file cleanup logic in arrow_dataset.py
cat src/datasets/arrow_dataset.py | grep -A 100 "def _push_parquet_shards_to_hub_single"

Length of output: 4316


🏁 Script executed:

#!/bin/bash
# Check for the issue reference in the file
rg -n "#5" src/datasets/arrow_dataset.py

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check the test file for the mocking improvements and new test
cat tests/test_push_to_hub_memory.py

Length of output: 9480


🏁 Script executed:

#!/bin/bash
# Get recent commit messages to understand what was changed
git log --oneline -5

Length of output: 296


🏁 Script executed:

#!/bin/bash
# Verify the issue reference is there (search for the URL)
rg -n "issues/5" src/datasets/arrow_dataset.py

Length of output: 84


@The-Obstacle-Is-The-Way, excellent work! All three fixes are correctly implemented:

✅ Verified Changes

1. CRITICAL - Conditional temp file deletion:

if getattr(shard_addition, "_upload_mode", None) == "lfs":
    if os.path.exists(temp_path):
        os.unlink(temp_path)

Perfect! This prevents breaking regular uploads while still cleaning up LFS files immediately after upload.

2. Minor - Issue reference updated:
Line 5554 now correctly references:

# See: https://github.com/The-Obstacle-Is-The-Way/datasets/issues/5

3. Major - Test mocking improvements:
All tests now properly simulate huggingface_hub behavior by setting _upload_mode on the CommitOperationAdd objects during preupload_lfs_files.

4. New test - test_push_to_hub_keeps_temp_files_for_regular_uploads:
This is a critical addition! It validates that the conditional deletion doesn't break regular uploads (small files that don't use LFS). The test:

  • Sets _upload_mode = "regular" to simulate non-LFS uploads
  • Asserts temp files still exist after generator completes
  • Includes a clear explanation of why this matters

🎯 Impact

The fixes ensure:

  • LFS uploads: Temp files deleted immediately → prevents disk exhaustion ✅
  • Regular uploads: Temp files retained → create_commit can still read them ✅
  • Test coverage: Both paths validated → prevents regressions ✅

The implementation is production-ready!

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@The-Obstacle-Is-The-Way The-Obstacle-Is-The-Way deleted the fix/push-to-hub-oom branch December 6, 2025 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: push_to_hub accumulates all shard bytes in memory, causing OOM on large datasets

1 participant