fix: use temp files instead of BytesIO in push_to_hub to prevent OOM#7
fix: use temp files instead of BytesIO in push_to_hub to prevent OOM#7The-Obstacle-Is-The-Way wants to merge 4 commits intomainfrom
Conversation
When uploading large datasets with many shards, the previous implementation held all parquet shard bytes in memory via the `additions` list. This caused OOM for datasets with hundreds of shards (e.g., 902 shards × 300MB = 270GB). Changes: - Write parquet to temp file instead of BytesIO - Pass file path to CommitOperationAdd (huggingface_hub streams from disk) - Delete temp file after preupload_lfs_files completes - Remove unused BytesIO import Memory usage now stays constant (~1-2 shard sizes) instead of growing linearly with the number of shards. Fixes: #5
|
@coderabbitai review |
|
Warning Rate limit exceeded@The-Obstacle-Is-The-Way has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 7 minutes and 51 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (2)
WalkthroughThe pull request optimizes Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Comment |
✅ Actions performedReview triggered.
|
Wrap the fallible upload operations in try...finally to guarantee temp file cleanup even if preupload_lfs_files() throws (network timeout, 500 error, KeyboardInterrupt). Without this, a flaky upload of 900 shards could leave hundreds of ~300MB temp files in /tmp, potentially filling the disk.
|
@coderabbitai review Added try...finally to ensure temp file cleanup even if upload fails (network timeout, 500 error, KeyboardInterrupt). This prevents disk exhaustion from leaked temp files during flaky uploads of large datasets. |
🧩 Analysis chain🏁 Script executed: Length of output: 2498 🏁 Script executed: Length of output: 482 🏁 Script executed: Length of output: 3166 @The-Obstacle-Is-The-Way, excellent implementation! The ✅ What's working well:
The implementation properly prevents disk exhaustion during flaky uploads of large datasets. The pattern is exactly what's needed here. ✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tests/test_push_to_hub_memory.py (1)
14-153: Consider adding error handling tests.The current tests validate the happy path well, but consider adding tests for error scenarios:
- What happens if
preupload_lfs_filesraises an exception?- What if
to_parquetfails during shard writing?- Are temporary files still cleaned up in error cases?
This would help ensure robustness and catch potential resource leaks. Note that the current implementation in
src/datasets/arrow_dataset.pydoesn't appear to have try-finally protection aroundos.unlink(temp_path), so errors could leave temp files behind—error-case tests would expose this.Example test structure:
def test_push_to_hub_cleans_up_temp_files_on_error(self): """Temp files should be cleaned up even if upload fails.""" ds = Dataset.from_dict({"x": list(range(100))}) with patch("datasets.arrow_dataset.HfApi") as mock_api_class: mock_api = MagicMock() mock_api_class.return_value = mock_api # Simulate upload failure mock_api.preupload_lfs_files.side_effect = Exception("Upload failed") with pytest.raises(Exception): list(ds._push_parquet_shards_to_hub_single(...)) # Verify temp files were still cleaned up # (would require tracking as in test_push_to_hub_cleans_up_temp_files)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/datasets/arrow_dataset.py(1 hunks)tests/test_push_to_hub_memory.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_push_to_hub_memory.py (1)
src/datasets/arrow_dataset.py (4)
Dataset(703-6466)from_dict(973-1034)_push_parquet_shards_to_hub_single(5513-5572)split(172-174)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: test (integration, windows-latest, deps-minimum)
- GitHub Check: test (integration, ubuntu-latest, deps-latest)
- GitHub Check: test (integration, ubuntu-latest, deps-minimum)
- GitHub Check: test (integration, windows-latest, deps-latest)
- GitHub Check: test (unit, ubuntu-latest, deps-minimum)
- GitHub Check: test (unit, windows-latest, deps-latest)
- GitHub Check: test (unit, windows-latest, deps-minimum)
- GitHub Check: test (unit, ubuntu-latest, deps-latest)
- GitHub Check: test_py314_future (unit, windows-latest, deps-latest)
- GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
- GitHub Check: test_py314 (unit, windows-latest, deps-latest)
- GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
- GitHub Check: build / build_pr_documentation
🔇 Additional comments (2)
tests/test_push_to_hub_memory.py (2)
17-62: LGTM: Test correctly validates file-based commit operations.This test effectively verifies the core fix—that
CommitOperationAdduses file paths instead of in-memory bytes, which enables streaming from disk and prevents OOM.
64-113: LGTM: Test correctly validates temp file cleanup.The test properly verifies that temporary files are deleted after upload, which is essential to prevent disk exhaustion when processing large datasets with many shards.
tests/test_push_to_hub_memory.py
Outdated
| @@ -0,0 +1,153 @@ | |||
| """Tests for memory-safe push_to_hub with large datasets. | |||
|
|
|||
| Regression tests for https://github.com/huggingface/datasets/issues/XXXX | |||
There was a problem hiding this comment.
Replace the issue number placeholder.
The placeholder "XXXX" should be replaced with the actual issue number. Based on the PR objectives, this appears to be issue #5.
Apply this diff:
-Regression tests for https://github.com/huggingface/datasets/issues/XXXX
+Regression tests for https://github.com/huggingface/datasets/issues/5📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Regression tests for https://github.com/huggingface/datasets/issues/XXXX | |
| Regression tests for https://github.com/huggingface/datasets/issues/5 |
🤖 Prompt for AI Agents
In tests/test_push_to_hub_memory.py around line 3, replace the placeholder
"https://github.com/huggingface/datasets/issues/XXXX" with the actual issue URL
"https://github.com/huggingface/datasets/issues/5" so the regression test
references the correct issue number; update the line text accordingly to reflect
issue #5.
Addresses CodeRabbit's CRITICAL review feedback: 1. Only delete temp file when _upload_mode == "lfs" - For LFS uploads, content is already on the Hub after preupload_lfs_files - For regular uploads, create_commit still needs to read from disk - Deleting temp files unconditionally broke non-LFS uploads 2. Replace XXXX placeholder with actual issue reference 3. Update tests to properly simulate huggingface_hub behavior: - Mock preupload_lfs_files to set _upload_mode attribute - Add test for regular (non-LFS) upload behavior - Mock upload_info to make size assertion meaningful
|
@coderabbitai review Addressed all 3 actionable comments:
Added new test |
🧩 Analysis chain🏁 Script executed: Length of output: 4316 🏁 Script executed: Length of output: 0 🏁 Script executed: Length of output: 9480 🏁 Script executed: Length of output: 296 🏁 Script executed: Length of output: 84 @The-Obstacle-Is-The-Way, excellent work! All three fixes are correctly implemented: ✅ Verified Changes1. CRITICAL - Conditional temp file deletion: if getattr(shard_addition, "_upload_mode", None) == "lfs":
if os.path.exists(temp_path):
os.unlink(temp_path)Perfect! This prevents breaking regular uploads while still cleaning up LFS files immediately after upload. 2. Minor - Issue reference updated: # See: https://github.com/The-Obstacle-Is-The-Way/datasets/issues/53. Major - Test mocking improvements: 4. New test -
🎯 ImpactThe fixes ensure:
The implementation is production-ready! ✅ Actions performedReview triggered.
|
Summary
Fixes memory accumulation in
_push_parquet_shards_to_hub_singlethat caused OOM when uploading large datasets.Problem: The previous implementation held ALL parquet shard bytes in memory via the
additionslist. For large datasets (e.g., 902 shards × 300MB = 270GB), this caused OOM.Solution: Write parquet to temp file, pass file path to
CommitOperationAdd, delete temp file after upload.Changes
src/datasets/arrow_dataset.py:BytesIOCommitOperationAdd(huggingface_hub streams from disk)preupload_lfs_filescompletesBytesIOimporttests/test_push_to_hub_memory.py:Test Plan
test_push_to_hub_uses_file_path_not_bytes_in_commit_operation- Verifies file paths are usedtest_push_to_hub_cleans_up_temp_files- Verifies temp files are cleaned uptest_push_to_hub_uploaded_size_still_calculated- Verifies size calculation still worksRelated
burner_docs/BUGS/spec-push-to-hub-oom/SPEC.mdSummary by CodeRabbit
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.