Conversation
PR SummaryMedium Risk Overview Written by Cursor Bugbot for commit 10c5309. This will update automatically on new commits. Configure here. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| var nfsOptions = strings.Join([]string{ | ||
| // wait for data to be sent to proxy server before returning. | ||
| // async might cause issues if the sandbox is shut down suddenly. | ||
| "sync", |
There was a problem hiding this comment.
NFS sync option not removed despite PR discussion agreement
Medium Severity
The sync NFS mount option appears to still be present despite the author agreeing in the PR discussion to remove it ("Despite the performance gains, I'll remove that one"). The sync option forces synchronous writes, which can significantly degrade NFS throughput (2-3x slower than the default async), partially negating the performance gains from the rsize/wsize buffer increases. This contradicts the PR's stated goal of improving performance by 5-10%.
There was a problem hiding this comment.
lol.
- claude, locally: "we're missing async, that'll make it faster"
- claude, in pr: "async will cause data issues, use sync"
- cursor, in pr: "you still have sync, I thought you wanted to switch to async"
data corruption overrules speed, we're done here.
| var nfsOptions = strings.Join([]string{ | ||
| // wait for data to be sent to proxy server before returning. | ||
| // async might cause issues if the sandbox is shut down suddenly. | ||
| "sync", | ||
|
|
||
| "rsize=1048576", // 1 MB read buffer | ||
| "wsize=1048576", // 1 MB write buffer | ||
| "mountproto=tcp", // nfs proxy only supports tcp | ||
| "mountport=2049", // nfs proxy only supports mounting on port 2049 | ||
| "proto=tcp", // nfs proxy only supports tcp | ||
| "port=2049", // nfs proxy only supports mounting on port 2049 | ||
| "nfsvers=3", // nfs proxy is nfs version 3 | ||
| "noacl", // no reason for acl in the sandbox | ||
| }, ",") | ||
|
|
||
| func (a *API) setupNfs(ctx context.Context, nfsTarget, path string) { | ||
| commands := [][]string{ | ||
| {"mkdir", "-p", path}, | ||
| {"mount", "-v", "-t", "nfs", "-o", "fg,hard,mountproto=tcp,mountport=2049,proto=tcp,port=2049,nfsvers=3,noacl", nfsTarget, path}, | ||
| {"mount", "-v", "-t", "nfs", "-o", "fg,hard," + nfsOptions, nfsTarget, path}, | ||
| } | ||
|
|
||
| for _, command := range commands { |
There was a problem hiding this comment.
🔴 This PR introduces sync on a hard-mounted NFS volume — a dangerous combination where every write blocks until the NFS proxy acknowledges it and hard means there is no timeout. If the NFS proxy becomes unreachable after the mount is established (crash, network partition, restart), any sandbox process writing to the volume will hang indefinitely in uninterruptible sleep (D state) with no way to kill it. A safer approach that still preserves write durability is sync + soft + timeo=N + retrans=M, which allows the kernel to eventually return an error to the application instead of blocking forever.
Extended reasoning...
Bug: hard + sync NFS mount creates indefinite I/O hang risk
What the bug is and how it manifests
Before this PR, the mount command used fg,hard,mountproto=tcp,... with no explicit async/sync flag, defaulting to NFS async mode. With async, writes complete as soon as they are buffered in the client page cache; the kernel retries delivery to the server in the background. This PR introduces an explicit sync option, which changes write semantics entirely: every write() syscall now blocks until the NFS proxy sends an acknowledgement. Combined with the pre-existing hard mount option (which means the client retries indefinitely and never returns an error to the application), any process performing write I/O on the mount point will block indefinitely if the NFS proxy becomes unavailable.
The specific code path
In packages/envd/internal/api/init.go (lines 235-256), nfsOptions is constructed with "sync" as the first entry. The setupNfs function then passes "fg,hard," + nfsOptions to the mount command. The hard flag was present before this PR; the sync flag is new. Together they produce a mount with both hard and sync active.
Why existing code does not prevent it
There is no timeout, watchdog, or sandbox-level health check that would detect and recover from a process stuck in D state. The NFS hard option is explicitly designed to prevent the kernel from ever returning EIO or ETIMEDOUT to the application. Adding sync on top means the kernel cannot complete the write optimistically by buffering; it must wait for the server. There is no timeo, retrans, or intr option to bound the wait.
Impact
If the NFS proxy crashes, is restarted, or a network partition occurs after a volume is mounted — all plausible in a sandbox environment — every sandbox process that subsequently attempts a write will enter uninterruptible sleep (D state). These processes cannot be killed with SIGKILL. The only recovery is a full VM termination. This turns a transient infrastructure hiccup into a permanent sandbox freeze.
Step-by-step proof of the hang
- Sandbox starts; setupNfs mounts the proxy volume with fg,hard,sync,...
- A sandbox user process opens a file on the mounted path and calls write().
- The Linux NFS client sends a WRITE RPC to the proxy and blocks the process, waiting for the NFS ACK (sync requires server acknowledgment before returning to userspace).
- At this moment, the NFS proxy crashes or becomes unreachable.
- The NFS client retries the WRITE RPC repeatedly per the hard semantics — indefinitely, with no timeout.
- The user process remains stuck in D state, unkillable. Any other process that subsequently writes to the same mount also gets stuck.
- The only recovery is to terminate the VM.
How to fix it
Replace hard with soft and add explicit timeo and retrans values: e.g., sync,soft,timeo=100,retrans=3,rsize=1048576,wsize=1048576,.... With soft, after retrans failed retries the kernel returns EIO to the application instead of retrying forever. sync can be kept to preserve the data-durability intent. The soft + timeo + retrans combination is the most portable and operationally safe choice.


Seems to improve things about 5-10%