Implement sandbox exec through command router - Go#242
Conversation
Same as dfef157 but for Go.
modal-go/sandbox.go
Outdated
| } | ||
|
|
||
| // Close closes any resources associated with the Sandbox. | ||
| // This should be called when the Sandbox is no longer needed. |
There was a problem hiding this comment.
The sandbox could still be running after close:
| // This should be called when the Sandbox is no longer needed. | |
| // This should be called when the Sandbox is no longer needed | |
| // in the local client. The sandbox can still be running and | |
| // accessed in other clients. |
modal-go/sandbox.go
Outdated
|
|
||
| // Terminate stops the Sandbox. | ||
| func (sb *Sandbox) Terminate(ctx context.Context) error { | ||
| sb.Close() |
There was a problem hiding this comment.
Closing the command router here is different from Python, although I think it's okay to do.
There was a problem hiding this comment.
@freider Would you be okay with closing in sandbox.Terminate?
My intuition is that if you are terminating a sandbox, then you also want to clean up any open streams. If you only want to close the streams and continue to use the sandbox, you can run sandbox.Close and use the sandbox in other clients.
Note, in Python, we wanted to call this method sandbox.cleanup. In go, the naming of sb.Close, does not immediately give the impression of "closing streams". An alternative is sandbox.Disconnect, which is more explicit about what it is actually doing.
There was a problem hiding this comment.
I think termination and disconnection might need to be mutually distinct:
- you may need to disconnect from a sandbox that you want to keep running (resume in a later session etc)
- you may want to terminate a sandbox but still read buffered output streams from it. This is probably less common but I can see this being required in some cases (eg termination of the entrypoint triggers some output to be written that you want to read)
For Python I'm envisioning something like a context manager for "sandbox interaction" which doesn't terminate the sandbox but terminates all connections/resources when exiting, outside of which you can't read associated streams or do exec etc. For go I guess the same would be accomplished with a "detach" in a defer statement. For full control/granularity I suppose we should also have a detach for ContainerProcess but that feels less critical imo (we also lack terminate for those)
| "google.golang.org/grpc/status" | ||
| ) | ||
|
|
||
| // tlsCredsNoALPN is a TLS credential that skips ALPN enforcement. |
There was a problem hiding this comment.
Can we include a comment about what interface that tlsCredsNoALPN is implementing? I suspect it's https://pkg.go.dev/google.golang.org/grpc@v1.78.0/credentials#TransportCredentials
| payloadB64 += "=" | ||
| } | ||
|
|
||
| payloadJSON, err := base64.StdEncoding.DecodeString(payloadB64) |
There was a problem hiding this comment.
Should this match the python decoder, which uses urlsafe_b64decode?
| if err != nil { | ||
| st, ok := status.FromError(err) | ||
| if (ok && st.Code() == codes.DeadlineExceeded) || errors.Is(err, errDeadlineExceeded) { | ||
| return nil, fmt.Errorf("deadline exceeded while polling for exec %s", execID) |
There was a problem hiding this comment.
Should we introduce an error like Python's ExecTimeoutError here too?
| ) { | ||
| if deadline != nil { | ||
| var cancel context.CancelFunc | ||
| ctx, cancel = context.WithDeadline(ctx, *deadline) |
There was a problem hiding this comment.
Should the caller set context.WithDeadline such that deadline does not need to passed in here anymore?
c5d396f to
cf17046
Compare
|
Thanks @thomasjpfan ! Pushed fixes and added comments now. |
| var resp *pb.TaskExecStartResponse | ||
| _, err := callWithRetriesOnTransientErrors(ctx, func() (struct{}, error) { | ||
| callErr := c.callWithAuthRetry(ctx, func(authCtx context.Context) error { | ||
| var err error | ||
| resp, err = c.stub.TaskExecStart(authCtx, request) | ||
| return err | ||
| }) | ||
| return struct{}{}, callErr | ||
| }, defaultRetryOptions()) | ||
| return resp, err |
There was a problem hiding this comment.
This feels a little hard to follow because of how resp is mutated in the inner scope and then returned in the outer scope.
I opened a PR targeting your branch to showcase the approach. This way ExecStart can be written as:
// ExecStart starts a command execution.
func (c *TaskCommandRouterClient) ExecStart(ctx context.Context, request *pb.TaskExecStartRequest) (*pb.TaskExecStartResponse, error) {
return callWithRetriesOnTransientErrors(ctx, func() (*pb.TaskExecStartResponse, error) {
return callWithAuthRetry(ctx, c, func(authCtx context.Context) (*pb.TaskExecStartResponse, error) {
return c.stub.TaskExecStart(authCtx, request)
})
}, defaultRetryOptions())
}
modal-go/sandbox.go
Outdated
| defer sb.mu.Unlock() | ||
|
|
||
| if sb.commandRouterClient != nil { | ||
| _ = sb.commandRouterClient.Close() |
There was a problem hiding this comment.
Does calling commandRouterClient.Close also end all the goroutines spawned by ExecStdioRead?
There was a problem hiding this comment.
Pushed a change in 3769bc8 now that propagates cancellation.
modal-go/sandbox.go
Outdated
|
|
||
| // Terminate stops the Sandbox. | ||
| func (sb *Sandbox) Terminate(ctx context.Context) error { | ||
| sb.Close() |
There was a problem hiding this comment.
@freider Would you be okay with closing in sandbox.Terminate?
My intuition is that if you are terminating a sandbox, then you also want to clean up any open streams. If you only want to close the streams and continue to use the sandbox, you can run sandbox.Close and use the sandbox in other clients.
Note, in Python, we wanted to call this method sandbox.cleanup. In go, the naming of sb.Close, does not immediately give the impression of "closing streams". An alternative is sandbox.Disconnect, which is more explicit about what it is actually doing.
modal-go/sandbox.go
Outdated
| // This should be called when the Sandbox is no longer needed | ||
| // in the local client. The sandbox can still be running and | ||
| // accessed in other clients. | ||
| func (sb *Sandbox) Close() { |
There was a problem hiding this comment.
Should this also close the streams associated with sb.Stdout and sb.Stderr?
There was a problem hiding this comment.
I think yes, pushed a fix now!
e57855f to
e3b8e8e
Compare
e3b8e8e to
edbb15d
Compare
edbb15d to
ac41d27
Compare
modal-go/sandbox.go
Outdated
| var stdoutConfig pb.TaskExecStdoutConfig | ||
| switch stdout { | ||
| case Pipe: |
There was a problem hiding this comment.
I think we do not need to do the stdout == "" part with:
switch params.Stdout {
case Pipe, "":
modal-go/sandbox.go
Outdated
| } | ||
|
|
||
| // BuildTaskExecStartRequestProto builds a TaskExecStartRequest proto from command and options. | ||
| func BuildTaskExecStartRequestProto(taskID, execID string, command []string, params SandboxExecParams) (*pb.TaskExecStartRequest, error) { |
There was a problem hiding this comment.
Nit: Looking at this with fresh eyes, I kind of wished that SandboxExecParams.Stdout was a pointer, so we can use nil as the default.
There was a problem hiding this comment.
Yeah, good point... it breaks the public interface though, so perhaps best to push it to a separate PR?
There was a problem hiding this comment.
Yea, let's not break public API.
modal-go/sandbox.go
Outdated
| func (sb *Sandbox) Terminate(ctx context.Context) error { | ||
| sb.mu.Lock() | ||
| if sb.commandRouterClient != nil { | ||
| _ = sb.commandRouterClient.Close() |
There was a problem hiding this comment.
To match the Python SDK, I do not think we can close the commandRouterClient here?
There was a problem hiding this comment.
Yeah I'm for aligning it, but is there a point in having an active client if the SB is terminated?
modal-go/sandbox.go
Outdated
| } | ||
|
|
||
| if sb.Stdin != nil { | ||
| _ = sb.Stdin.Close() |
There was a problem hiding this comment.
If any of the close errors, should we return the error up from Detach?
modal-go/sandbox.go
Outdated
| mu sync.Mutex // protects offset | ||
| offset uint64 |
There was a problem hiding this comment.
Is this protecting the ExecStdinWrite and not just the offset? If it's just the offset, can we use an atomic uint64?
There was a problem hiding this comment.
Good point, updated the comment now
|
|
||
| jwt := resp.GetJwt() | ||
| c.jwt.Store(&jwt) | ||
| jwtExp := parseJwtExpiration(ctx, jwt, c.logger) |
There was a problem hiding this comment.
Currently, parseJwtExpiration does not return an error if jwt is malformed and return nil. I suspect that we'll no longer be able to communicate to the task command router.
Should we return an error if jwt is malformed?
There was a problem hiding this comment.
Yeah good point. This used to fall back to the server path, but we won't have that in Go. I'll think of something!
thomasjpfan
left a comment
There was a problem hiding this comment.
From the failing tests, I guess we can not terminate then detach.
Looking at the backend, we can not SandboxStdinWrite on a terminated sandbox, but we need SandboxStdinWrite to send the EOF.
|
Yeah, I just know this test passed on Dec. 26, so was meaning to figure out if/why this changed! Will look into it tomorrow. |
saltzm
left a comment
There was a problem hiding this comment.
Looks good! Just a few comments/questions
| if err := sb.commandRouterClient.Close(); err != nil { | ||
| errs = append(errs, fmt.Errorf("commandRouterClient.Close: %w", err)) | ||
| } | ||
| sb.commandRouterClient = nil |
There was a problem hiding this comment.
Seems like there's nothing to prevent someone from doing a new exec on the same sandbox object after calling Detach(), is that intended? (Note: I didn't follow the detach implementation in Python so don't have all the context) Seems like you'd want future exec calls to fail with some error
There was a problem hiding this comment.
Just saw there are explicit test cases for this. Do you know the context there?
There was a problem hiding this comment.
Changing this behavior so exec after detach is reported as an error instead, as discussed offline!
modal-go/sandbox.go
Outdated
| execID string | ||
| commandRouterClient *TaskCommandRouterClient | ||
|
|
||
| mu sync.Mutex // serializes writes to maintain offset consistency |
There was a problem hiding this comment.
Hmm... I don't think we would expect ContainerProcess APIs to be thread-safe would we?
There was a problem hiding this comment.
In other words, I wouldn't expect a user to be able to call stdin.write from multiple threads simultaneously - that'd be a weird thing to do I think. So I don't think there's a need for a mutex?
There was a problem hiding this comment.
Good point, removed it!
| @@ -0,0 +1,176 @@ | |||
| package modal | |||
There was a problem hiding this comment.
Could we add some tests for retries on auth errors?
| if isRetryableGrpc(err) && retries > 0 { | ||
| retries-- | ||
| continue | ||
| } |
There was a problem hiding this comment.
Missing FailedPrecondition handling in cpStdin.Close
Low Severity
The sbStdin.Close() method was updated with FailedPrecondition error handling to gracefully return nil when the sandbox has already terminated or stdin is closed. However, cpStdin.Close() does not have equivalent handling and will return errors when called after an exec has terminated. This inconsistency means closing exec stdin after process termination returns an error while closing sandbox stdin in the same scenario returns nil.
Additional Locations (1)
PR SummaryMigrates Sandbox exec to a dedicated Task Command Router with JWT auth, retry/backoff, and deadline-aware stdio streaming.
Written by Cursor Bugbot for commit 9918b08. This will update automatically on new commits. Configure here. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
modal-go/sandbox.go
Outdated
| sb.mu.Unlock() | ||
| return nil, InvalidError{Exception: "cannot call Exec on a detached Sandbox"} | ||
| } | ||
| sb.mu.Unlock() |
There was a problem hiding this comment.
Race condition allows exec after sandbox detach
Medium Severity
The Exec method checks sb.detached under lock at lines 565-570, then releases the lock before calling getOrCreateCommandRouterClient. The getOrCreateCommandRouterClient function doesn't re-check the detached flag. If Detach is called concurrently between the check and getOrCreateCommandRouterClient, a new command router client will be created and the exec will proceed despite the sandbox being detached. This is a TOCTOU (time-of-check-time-of-use) race condition.
Additional Locations (1)
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| func (sb *Sandbox) Terminate(ctx context.Context) error { | ||
| if err := sb.closeTaskCommandRouterClient(); err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
Sandbox termination blocked by local client close failure
Medium Severity
In Terminate(), if closeTaskCommandRouterClient() returns an error (which can happen if the gRPC connection close fails), the function returns early without calling SandboxTerminate on the server. This means the sandbox won't actually be terminated, contradicting user expectations. The JS implementation doesn't have this issue because its close() method is void and always proceeds to call sandboxTerminate. The server-side termination should proceed regardless of whether closing the local client connection succeeds.
Same as #238 but for Go.
Note
Moves Sandbox exec I/O off the control plane to a dedicated Task Command Router with auth refresh, retries, and deadline-aware streaming.
TaskCommandRouterClient(task_command_router_client.go) with JWT parsing/refresh, retry/backoff, stdio streaming,ExecStart/Wait/Poll, mount/snapshot helpers, and custom TLS creds (optional insecure viaMODAL_TASK_COMMAND_ROUTER_INSECURE); applies 64 MiB gRPC window sizes and 100 MiB message limitsSandbox.Execrewritten to use the router (per-execexec_id, deadline propagation);ContainerProcessstdin/stdout/stderr andWaitnow use router APIsSandbox.Detach()and adjustsTerminate(); output streams refactored; prevents ARG_MAX issues viaValidateExecArgs; newExecTimeoutError; richerTaskExecStartrequest (stdout/stderr config, PTY, workdir, timeout)ProfilewithTaskCommandRouterInsecureand env var override; client sets initial window sizesWritten by Cursor Bugbot for commit bf4118f. This will update automatically on new commits. Configure here.