Skip to content

Add filesystem zero copy during pause via reflink (XFS, Btrfs)#2249

Merged
dobrac merged 9 commits intomainfrom
zero-copy-fs
Apr 2, 2026
Merged

Add filesystem zero copy during pause via reflink (XFS, Btrfs)#2249
dobrac merged 9 commits intomainfrom
zero-copy-fs

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas commented Mar 28, 2026

Check/add tests or code for the following things:

  • Ensure that on orchestrator start, the pause and overlay paths are always on the same disk
    • Testing on cluster with XFS file systems succeeds, implying that we are not copying across mountpoints.
    • Also added fallback logic. If CopyFileRange fails (due to copying across disks or because the system call is not supported), we fall back to normal copy
  • Check that this works even without XFS
  • Check that this works for copying bigger ranges than the max rw correctly
    • Tested with writing the whole disk and everything works fine.
  • Test that the changes to flushing are correct
  • Measure impact of the diff+mapping building and prepare for perf(block): sync.Map -> atomic bitmap for cache dirty tracking #2235

@cursor
Copy link
Copy Markdown

cursor bot commented Mar 28, 2026

PR Summary

Medium Risk
Touches low-level diff export and filesystem syscalls (SyncFileRange/CopyFileRange) and changes the diff writer interface to require *os.File, so regressions could corrupt diffs or leak resources on error paths. Fallback copy and added cache-close-on-timeout mitigate some failure cases but this still affects snapshot correctness/performance.

Overview
Switches rootfs diff export during sandbox pause from per-block mmap reads/writes to a range-based file copy path using unix.CopyFileRange (enabling reflink/zero-copy on supporting filesystems) with a normal io.Copy fallback when unsupported. The diff export pipeline is updated to pass an *os.File through DiffCreator/rootfs.Provider, builds dirty metadata without sorting, adds SyncFileRange as an early writeback hint, emits new telemetry timings, and ensures the cache mmap is closed on context timeout/export failure to avoid leaks.

Written by Cursor Bugbot for commit 0af2e06. This will update automatically on new commits. Configure here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bb3920427b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ValentaTomas
Copy link
Copy Markdown
Member Author

ValentaTomas commented Mar 28, 2026

  • This falls back to normal copy on disks not supporting reflink
  • It removes the checks for empty blocks, but that is something that should not be very common + if we want empty checks we should do them either by implementing the NBD related functionality or by checking at the time of the write. Also, the followup compression should handle the empty parts too.

@ValentaTomas
Copy link
Copy Markdown
Member Author

I also added measurements of how long parts of the rootfs pause (flush, copy, etc.) take to the trace attributes.

Copy link
Copy Markdown
Contributor

@matthewlouisbrockman matthewlouisbrockman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ran tests to make sure no degredation #2252 that all pass

@ValentaTomas ValentaTomas marked this pull request as ready for review March 31, 2026 20:39
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: No fallback when range copy unsupported
    • Added fallback to regular I/O (io.CopyN) when CopyFileRange fails, allowing diff export to work across different filesystems and mounts.
  • ✅ Fixed: Hard failure on unsupported sync syscall
    • Made SyncFileRange non-fatal by ignoring its errors since it's an optimization, not a correctness requirement.

Create PR

Or push these changes by commenting:

@cursor push d89a067423
Preview (d89a067423)
diff --git a/packages/orchestrator/pkg/sandbox/block/cache.go b/packages/orchestrator/pkg/sandbox/block/cache.go
--- a/packages/orchestrator/pkg/sandbox/block/cache.go
+++ b/packages/orchestrator/pkg/sandbox/block/cache.go
@@ -4,6 +4,7 @@
 	"context"
 	"errors"
 	"fmt"
+	"io"
 	"math"
 	"math/rand"
 	"os"
@@ -129,10 +130,8 @@
 
 	// Explicit mmap flush is not necessary, because the kernel will handle that as part of the copy_file_range syscall.
 	// Calling sync_file_range marks the range for writeback and starts it early.
-	err = unix.SyncFileRange(src, 0, c.size, unix.SYNC_FILE_RANGE_WRITE)
-	if err != nil {
-		return nil, fmt.Errorf("error syncing file: %w", err)
-	}
+	// This is an optimization and not required for correctness, so we ignore errors.
+	_ = unix.SyncFileRange(src, 0, c.size, unix.SYNC_FILE_RANGE_WRITE)
 
 	buildStart := time.Now()
 	builder := header.NewDiffMetadataBuilder(c.size, c.blockSize)
@@ -150,6 +149,7 @@
 	dst := int(out.Fd())
 	var writeOffset int64
 	var totalRanges int64
+	useFallback := false
 
 	copyStart := time.Now()
 	for r := range BitsetRanges(diffMetadata.Dirty, diffMetadata.BlockSize) {
@@ -157,27 +157,46 @@
 		remaining := int(r.Size)
 		readOffset := r.Start
 
-		// The kernel may return short writes (e.g. capped at MAX_RW_COUNT on non-reflink filesystems),
-		// so we loop until the full range is copied. The offset pointers are advanced by the kernel.
-		for remaining > 0 {
-			// On XFS this uses reflink automatically.
-			n, err := unix.CopyFileRange(
-				src,
-				&readOffset,
-				dst,
-				&writeOffset,
-				remaining,
-				0,
-			)
-			if err != nil {
-				return nil, fmt.Errorf("error copying file range: %w", err)
+		if !useFallback {
+			// The kernel may return short writes (e.g. capped at MAX_RW_COUNT on non-reflink filesystems),
+			// so we loop until the full range is copied. The offset pointers are advanced by the kernel.
+			for remaining > 0 {
+				// On XFS this uses reflink automatically.
+				n, err := unix.CopyFileRange(
+					src,
+					&readOffset,
+					dst,
+					&writeOffset,
+					remaining,
+					0,
+				)
+				if err != nil {
+					// Fall back to regular I/O if copy_file_range fails (e.g., cross-filesystem or unsupported).
+					useFallback = true
+					break
+				}
+
+				if n == 0 {
+					return nil, fmt.Errorf("copy_file_range returned 0 with %d bytes remaining", remaining)
+				}
+
+				remaining -= n
 			}
+		}
 
-			if n == 0 {
-				return nil, fmt.Errorf("copy_file_range returned 0 with %d bytes remaining", remaining)
+		if useFallback && remaining > 0 {
+			// Use regular I/O as fallback
+			if _, err := f.Seek(readOffset, io.SeekStart); err != nil {
+				return nil, fmt.Errorf("error seeking source file: %w", err)
 			}
-
-			remaining -= n
+			if _, err := out.Seek(writeOffset, io.SeekStart); err != nil {
+				return nil, fmt.Errorf("error seeking destination file: %w", err)
+			}
+			n, err := io.CopyN(out, f, int64(remaining))
+			if err != nil {
+				return nil, fmt.Errorf("error copying data: %w", err)
+			}
+			writeOffset += n
 		}
 	}

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

@bchalios bchalios force-pushed the zero-copy-fs branch 3 times, most recently from 0ca57c5 to d80183b Compare April 2, 2026 13:08
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

ValentaTomas and others added 9 commits April 2, 2026 15:40
Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
In nbd.ExportDiff() we were returning early in case something was going
wrong downstream without calling cache.Close(). This causes a leak of
mmap'ed memory.

Make sure we call cache.Close() in those cases. cache.Close() itself,
might fail. In these cases, log a warning and return the initial error.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
SyncFileRange is an optimization to let the kernel start wrting back
data from the mmap to the disk on XFS. If that fails, we can rely on the
CopyFileRange to actually performs the operation.

Just log a warning on SyncFileRange failures and hard fail on
CopyFileRange.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
When copy_file_range fails with EXDEV, EOPNOTSUPP, or ENOSYS (e.g.
cross-device copies or filesystems that don't support the syscall),
fall back to a regular io.Copy for the remainder of the export.
Other errors still hard-fail.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
mmap.MapRegion expects the library's own constants (mmap.RDWR), not raw
unix PROT_* flags. Passing unix.PROT_READ|unix.PROT_WRITE (=3) happened
to work due to numerical coincidence, but is fragile.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
@dobrac dobrac added the improvement Improvement for current functionality label Apr 2, 2026
@bchalios
Copy link
Copy Markdown
Contributor

bchalios commented Apr 2, 2026

I've rebased the tests from here: #2252 and let them re-run. Tests are passing.

@dobrac dobrac merged commit 98e0431 into main Apr 2, 2026
36 checks passed
@dobrac dobrac deleted the zero-copy-fs branch April 2, 2026 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement for current functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants