fix: use temp files instead of BytesIO in push_to_hub to prevent OOM by The-Obstacle-Is-The-Way · Pull Request #7 · The-Obstacle-Is-The-Way/datasets

The-Obstacle-Is-The-Way · 2025-12-03T04:11:28Z

Summary

Fixes memory accumulation in _push_parquet_shards_to_hub_single that caused OOM when uploading large datasets.

Problem: The previous implementation held ALL parquet shard bytes in memory via the additions list. For large datasets (e.g., 902 shards × 300MB = 270GB), this caused OOM.

Solution: Write parquet to temp file, pass file path to CommitOperationAdd, delete temp file after upload.

Changes

src/datasets/arrow_dataset.py:
- Write parquet to temp file instead of BytesIO
- Pass file path to CommitOperationAdd (huggingface_hub streams from disk)
- Delete temp file after preupload_lfs_files completes
- Remove unused BytesIO import
tests/test_push_to_hub_memory.py:
- Add regression tests for memory-safe upload

Test Plan

test_push_to_hub_uses_file_path_not_bytes_in_commit_operation - Verifies file paths are used
test_push_to_hub_cleans_up_temp_files - Verifies temp files are cleaned up
test_push_to_hub_uploaded_size_still_calculated - Verifies size calculation still works
Existing parquet tests pass

Summary by CodeRabbit

Bug Fixes
- Improved memory efficiency when uploading large datasets by optimizing shard handling, preventing out-of-memory errors during the push operation.
- Enhanced temporary file cleanup to promptly free disk space after uploads.
Tests
- Added comprehensive test suite validating memory-safe behavior for large dataset uploads.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

When uploading large datasets with many shards, the previous implementation held all parquet shard bytes in memory via the `additions` list. This caused OOM for datasets with hundreds of shards (e.g., 902 shards × 300MB = 270GB). Changes: - Write parquet to temp file instead of BytesIO - Pass file path to CommitOperationAdd (huggingface_hub streams from disk) - Delete temp file after preupload_lfs_files completes - Remove unused BytesIO import Memory usage now stays constant (~1-2 shard sizes) instead of growing linearly with the number of shards. Fixes: #5

The-Obstacle-Is-The-Way · 2025-12-03T04:11:36Z

@coderabbitai review

coderabbitai · 2025-12-03T04:11:38Z

Warning

Rate limit exceeded

@The-Obstacle-Is-The-Way has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 7 minutes and 51 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between efd1ef2 and 44936c9.

📒 Files selected for processing (2)

src/datasets/arrow_dataset.py (1 hunks)
tests/test_push_to_hub_memory.py (1 hunks)

Walkthrough

The pull request optimizes push_to_hub to eliminate out-of-memory issues when uploading large datasets. Parquet shards are now written to temporary disk files instead of being accumulated as in-memory bytes before upload, with proper cleanup afterward. No public API changes.

Changes

Cohort / File(s)	Change Summary
Memory-safe shard uploads `src/datasets/arrow_dataset.py`	Replaced in-memory parquet buffering with temporary file writing. Shards now write to `NamedTemporaryFile` and reference file paths in `CommitOperationAdd` operations rather than holding byte strings in memory. Added cleanup to delete temporary files post-upload. Removed unused `BytesIO` import.
Memory safety test suite `tests/test_push_to_hub_memory.py`	Added `TestPushToHubMemorySafe` with three test cases validating file-path references in commit operations, temporary file cleanup after shard uploads, and correctness of uploaded size calculations with file-based uploads.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Arrow dataset changes: Verify temp file creation/cleanup logic is bulletproof; check for race conditions or orphaned files in error paths; ensure CommitOperationAdd correctly uses file paths instead of bytes.
Test coverage: Confirm mocking strategy adequately simulates large dataset scenarios; verify all edge cases (cleanup on success/failure, concurrent shards) are covered.
Memory validation: Cross-reference implementation against the linked issue's root cause analysis to ensure the fix fully resolves the OOM accumulation pattern.

Poem

🐰 No more bytes in burrows deep,
Temp files now handle the heap,
Write to disk, upload with care,
Clean up after—memory to spare! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main fix: replacing BytesIO with temporary files in push_to_hub to prevent OOM errors.
Linked Issues check	✅ Passed	All objectives from issue #5 are met: parquet shards now write to temp files, use file paths in CommitOperationAdd, and temp files are cleaned up after upload.
Out of Scope Changes check	✅ Passed	All changes directly address the OOM issue: temp file writing, path-based commits, cleanup, removed BytesIO import, and regression tests.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai · 2025-12-03T04:11:41Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Wrap the fallible upload operations in try...finally to guarantee temp file cleanup even if preupload_lfs_files() throws (network timeout, 500 error, KeyboardInterrupt). Without this, a flaky upload of 900 shards could leave hundreds of ~300MB temp files in /tmp, potentially filling the disk.

The-Obstacle-Is-The-Way · 2025-12-03T04:17:08Z

@coderabbitai review

Added try...finally to ensure temp file cleanup even if upload fails (network timeout, 500 error, KeyboardInterrupt). This prevents disk exhaustion from leaked temp files during flaky uploads of large datasets.

coderabbitai · 2025-12-03T04:17:52Z

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get the file with the changes to review the try...finally implementation
cat src/datasets/arrow_dataset.py | grep -A 50 -B 10 "def _push_parquet_shards_to_hub_single"

Length of output: 2498

🏁 Script executed:

#!/bin/bash
# Also check if there are any recent commits that show the try...finally addition
git log --oneline -10 --all

Length of output: 482

🏁 Script executed:

#!/bin/bash
# Get more context around the temp file handling
cat src/datasets/arrow_dataset.py | grep -A 80 "def _push_parquet_shards_to_hub_single"

Length of output: 3166

@The-Obstacle-Is-The-Way, excellent implementation! The try...finally pattern correctly ensures temp file cleanup in all failure scenarios:

✅ What's working well:

Temp file scope: temp_path is captured before the try block (in the with statement), ensuring it's always defined when finally executes
Comprehensive error handling: The finally block runs even on:
- Network timeouts during preupload_lfs_files
- HTTP 500 errors from the API
- KeyboardInterrupt (Ctrl+C)
- Any other exception during parquet write, size calculation, or upload
Safe cleanup: os.path.exists check prevents errors if the file was somehow already deleted
Per-shard isolation: Each shard has its own try...finally, so one failure doesn't leak other shards' temp files

The implementation properly prevents disk exhaustion during flaky uploads of large datasets. The pattern is exactly what's needed here.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

tests/test_push_to_hub_memory.py (1)
14-153: Consider adding error handling tests.

The current tests validate the happy path well, but consider adding tests for error scenarios:

What happens if preupload_lfs_files raises an exception?

What if to_parquet fails during shard writing?

Are temporary files still cleaned up in error cases?

This would help ensure robustness and catch potential resource leaks. Note that the current implementation in src/datasets/arrow_dataset.py doesn't appear to have try-finally protection around os.unlink(temp_path), so errors could leave temp files behind—error-case tests would expose this.

Example test structure:
def test_push_to_hub_cleans_up_temp_files_on_error(self):
    """Temp files should be cleaned up even if upload fails."""
    ds = Dataset.from_dict({"x": list(range(100))})
    
    with patch("datasets.arrow_dataset.HfApi") as mock_api_class:
        mock_api = MagicMock()
        mock_api_class.return_value = mock_api
        # Simulate upload failure
        mock_api.preupload_lfs_files.side_effect = Exception("Upload failed")
        
        with pytest.raises(Exception):
            list(ds._push_parquet_shards_to_hub_single(...))
        
        # Verify temp files were still cleaned up
        # (would require tracking as in test_push_to_hub_cleans_up_temp_files)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 004a5bf and efd1ef2.

📒 Files selected for processing (2)

src/datasets/arrow_dataset.py (1 hunks)
tests/test_push_to_hub_memory.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/test_push_to_hub_memory.py (1)

src/datasets/arrow_dataset.py (4)

Dataset (703-6466)

from_dict (973-1034)

_push_parquet_shards_to_hub_single (5513-5572)

split (172-174)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)

GitHub Check: test (integration, windows-latest, deps-minimum)
GitHub Check: test (integration, ubuntu-latest, deps-latest)
GitHub Check: test (integration, ubuntu-latest, deps-minimum)
GitHub Check: test (integration, windows-latest, deps-latest)
GitHub Check: test (unit, ubuntu-latest, deps-minimum)
GitHub Check: test (unit, windows-latest, deps-latest)
GitHub Check: test (unit, windows-latest, deps-minimum)
GitHub Check: test (unit, ubuntu-latest, deps-latest)
GitHub Check: test_py314_future (unit, windows-latest, deps-latest)
GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
GitHub Check: test_py314 (unit, windows-latest, deps-latest)
GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
GitHub Check: build / build_pr_documentation

🔇 Additional comments (2)

tests/test_push_to_hub_memory.py (2)

17-62: LGTM: Test correctly validates file-based commit operations.

This test effectively verifies the core fix—that CommitOperationAdd uses file paths instead of in-memory bytes, which enables streaming from disk and prevents OOM.

64-113: LGTM: Test correctly validates temp file cleanup.

The test properly verifies that temporary files are deleted after upload, which is essential to prevent disk exhaustion when processing large datasets with many shards.

src/datasets/arrow_dataset.py

coderabbitai · 2025-12-03T04:21:54Z

tests/test_push_to_hub_memory.py

@@ -0,0 +1,153 @@
+"""Tests for memory-safe push_to_hub with large datasets.
+
+Regression tests for https://github.com/huggingface/datasets/issues/XXXX


⚠️ Potential issue | 🟡 Minor

Replace the issue number placeholder.

The placeholder "XXXX" should be replaced with the actual issue number. Based on the PR objectives, this appears to be issue #5.

Apply this diff:

-Regression tests for https://github.com/huggingface/datasets/issues/XXXX +Regression tests for https://github.com/huggingface/datasets/issues/5

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Regression tests for https://github.com/huggingface/datasets/issues/XXXX

Regression tests for https://github.com/huggingface/datasets/issues/5

🤖 Prompt for AI Agents

In tests/test_push_to_hub_memory.py around line 3, replace the placeholder "https://github.com/huggingface/datasets/issues/XXXX" with the actual issue URL "https://github.com/huggingface/datasets/issues/5" so the regression test references the correct issue number; update the line text accordingly to reflect issue #5.

tests/test_push_to_hub_memory.py

Addresses CodeRabbit's CRITICAL review feedback: 1. Only delete temp file when _upload_mode == "lfs" - For LFS uploads, content is already on the Hub after preupload_lfs_files - For regular uploads, create_commit still needs to read from disk - Deleting temp files unconditionally broke non-LFS uploads 2. Replace XXXX placeholder with actual issue reference 3. Update tests to properly simulate huggingface_hub behavior: - Mock preupload_lfs_files to set _upload_mode attribute - Add test for regular (non-LFS) upload behavior - Mock upload_info to make size assertion meaningful

The-Obstacle-Is-The-Way · 2025-12-03T04:29:25Z

@coderabbitai review

Addressed all 3 actionable comments:

CRITICAL: Only delete temp files when _upload_mode == 'lfs' - regular uploads need the file until create_commit
Minor: Replaced XXXX placeholder with actual issue Bug: push_to_hub accumulates all shard bytes in memory, causing OOM on large datasets #5
Major: Updated tests to properly mock preupload_lfs_files behavior (sets _upload_mode and upload_info)

Added new test test_push_to_hub_keeps_temp_files_for_regular_uploads to verify the fix.

coderabbitai · 2025-12-03T04:30:30Z

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the current implementation of temp file cleanup logic in arrow_dataset.py
cat src/datasets/arrow_dataset.py | grep -A 100 "def _push_parquet_shards_to_hub_single"

Length of output: 4316

🏁 Script executed:

#!/bin/bash
# Check for the issue reference in the file
rg -n "#5" src/datasets/arrow_dataset.py

Length of output: 0

🏁 Script executed:

#!/bin/bash
# Check the test file for the mocking improvements and new test
cat tests/test_push_to_hub_memory.py

Length of output: 9480

🏁 Script executed:

#!/bin/bash
# Get recent commit messages to understand what was changed
git log --oneline -5

Length of output: 296

🏁 Script executed:

#!/bin/bash
# Verify the issue reference is there (search for the URL)
rg -n "issues/5" src/datasets/arrow_dataset.py

Length of output: 84

@The-Obstacle-Is-The-Way, excellent work! All three fixes are correctly implemented:

✅ Verified Changes

1. CRITICAL - Conditional temp file deletion:

if getattr(shard_addition, "_upload_mode", None) == "lfs":
    if os.path.exists(temp_path):
        os.unlink(temp_path)

Perfect! This prevents breaking regular uploads while still cleaning up LFS files immediately after upload.

2. Minor - Issue reference updated:
Line 5554 now correctly references:

# See: https://github.com/The-Obstacle-Is-The-Way/datasets/issues/5

3. Major - Test mocking improvements:
All tests now properly simulate huggingface_hub behavior by setting _upload_mode on the CommitOperationAdd objects during preupload_lfs_files.

4. New test - test_push_to_hub_keeps_temp_files_for_regular_uploads:
This is a critical addition! It validates that the conditional deletion doesn't break regular uploads (small files that don't use LFS). The test:

Sets _upload_mode = "regular" to simulate non-LFS uploads
Asserts temp files still exist after generator completes
Includes a clear explanation of why this matters

🎯 Impact

The fixes ensure:

LFS uploads: Temp files deleted immediately → prevents disk exhaustion ✅
Regular uploads: Temp files retained → create_commit can still read them ✅
Test coverage: Both paths validated → prevents regressions ✅

The implementation is production-ready!

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai bot reviewed Dec 3, 2025

View reviewed changes

chore: remove fork-specific issue URL for upstream compatibility

44936c9

The-Obstacle-Is-The-Way closed this Dec 6, 2025

The-Obstacle-Is-The-Way deleted the fix/push-to-hub-oom branch December 6, 2025 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use temp files instead of BytesIO in push_to_hub to prevent OOM#7

fix: use temp files instead of BytesIO in push_to_hub to prevent OOM#7
The-Obstacle-Is-The-Way wants to merge 4 commits intomainfrom
fix/push-to-hub-oom

The-Obstacle-Is-The-Way commented Dec 3, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

The-Obstacle-Is-The-Way commented Dec 3, 2025

Uh oh!

coderabbitai bot commented Dec 3, 2025 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai bot commented Dec 3, 2025

Uh oh!

The-Obstacle-Is-The-Way commented Dec 3, 2025

Uh oh!

coderabbitai bot commented Dec 3, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot Dec 3, 2025

Uh oh!

Uh oh!

The-Obstacle-Is-The-Way commented Dec 3, 2025

Uh oh!

coderabbitai bot commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -0,0 +1,153 @@
		"""Tests for memory-safe push_to_hub with large datasets.

		Regression tests for https://github.com/huggingface/datasets/issues/XXXX

Conversation

The-Obstacle-Is-The-Way commented Dec 3, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Plan

Related

Summary by CodeRabbit

Uh oh!

The-Obstacle-Is-The-Way commented Dec 3, 2025

Uh oh!

coderabbitai bot commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot commented Dec 3, 2025

Uh oh!

The-Obstacle-Is-The-Way commented Dec 3, 2025

Uh oh!

coderabbitai bot commented Dec 3, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

The-Obstacle-Is-The-Way commented Dec 3, 2025

Uh oh!

coderabbitai bot commented Dec 3, 2025

✅ Verified Changes

🎯 Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

The-Obstacle-Is-The-Way commented Dec 3, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 3, 2025 •

edited

Loading