Skip to content

Implement sandbox exec through command router - Go#242

Open
ehdr wants to merge 39 commits intomainfrom
ehdr/sb-command-router-go
Open

Implement sandbox exec through command router - Go#242
ehdr wants to merge 39 commits intomainfrom
ehdr/sb-command-router-go

Conversation

@ehdr
Copy link
Contributor

@ehdr ehdr commented Dec 23, 2025

Same as #238 but for Go.


Note

Moves Sandbox exec I/O off the control plane to a dedicated Task Command Router with auth refresh, retries, and deadline-aware streaming.

  • New TaskCommandRouterClient (task_command_router_client.go) with JWT parsing/refresh, retry/backoff, stdio streaming, ExecStart/Wait/Poll, mount/snapshot helpers, and custom TLS creds (optional insecure via MODAL_TASK_COMMAND_ROUTER_INSECURE); applies 64 MiB gRPC window sizes and 100 MiB message limits
  • Sandbox.Exec rewritten to use the router (per-exec exec_id, deadline propagation); ContainerProcess stdin/stdout/stderr and Wait now use router APIs
  • Adds Sandbox.Detach() and adjusts Terminate(); output streams refactored; prevents ARG_MAX issues via ValidateExecArgs; new ExecTimeoutError; richer TaskExecStart request (stdout/stderr config, PTY, workdir, timeout)
  • Config expands Profile with TaskCommandRouterInsecure and env var override; client sets initial window sizes
  • Tests: extensive unit/integration coverage for exec proto, timeouts, stdin/stdout, detach/idempotency, snapshot flow; JS SDK parity updates (Exec timeout error, detach, retryable codes, flow-control options)

Written by Cursor Bugbot for commit bf4118f. This will update automatically on new commits. Configure here.

@ehdr ehdr requested review from saltzm and thomasjpfan December 23, 2025 16:28
Copy link
Contributor

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a first pass through the implementation.

}

// Close closes any resources associated with the Sandbox.
// This should be called when the Sandbox is no longer needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sandbox could still be running after close:

Suggested change
// This should be called when the Sandbox is no longer needed.
// This should be called when the Sandbox is no longer needed
// in the local client. The sandbox can still be running and
// accessed in other clients.


// Terminate stops the Sandbox.
func (sb *Sandbox) Terminate(ctx context.Context) error {
sb.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing the command router here is different from Python, although I think it's okay to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freider Would you be okay with closing in sandbox.Terminate?

My intuition is that if you are terminating a sandbox, then you also want to clean up any open streams. If you only want to close the streams and continue to use the sandbox, you can run sandbox.Close and use the sandbox in other clients.

Note, in Python, we wanted to call this method sandbox.cleanup. In go, the naming of sb.Close, does not immediately give the impression of "closing streams". An alternative is sandbox.Disconnect, which is more explicit about what it is actually doing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think termination and disconnection might need to be mutually distinct:

  • you may need to disconnect from a sandbox that you want to keep running (resume in a later session etc)
  • you may want to terminate a sandbox but still read buffered output streams from it. This is probably less common but I can see this being required in some cases (eg termination of the entrypoint triggers some output to be written that you want to read)

For Python I'm envisioning something like a context manager for "sandbox interaction" which doesn't terminate the sandbox but terminates all connections/resources when exiting, outside of which you can't read associated streams or do exec etc. For go I guess the same would be accomplished with a "detach" in a defer statement. For full control/granularity I suppose we should also have a detach for ContainerProcess but that feels less critical imo (we also lack terminate for those)

"google.golang.org/grpc/status"
)

// tlsCredsNoALPN is a TLS credential that skips ALPN enforcement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include a comment about what interface that tlsCredsNoALPN is implementing? I suspect it's https://pkg.go.dev/google.golang.org/grpc@v1.78.0/credentials#TransportCredentials

payloadB64 += "="
}

payloadJSON, err := base64.StdEncoding.DecodeString(payloadB64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this match the python decoder, which uses urlsafe_b64decode?

if err != nil {
st, ok := status.FromError(err)
if (ok && st.Code() == codes.DeadlineExceeded) || errors.Is(err, errDeadlineExceeded) {
return nil, fmt.Errorf("deadline exceeded while polling for exec %s", execID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we introduce an error like Python's ExecTimeoutError here too?

) {
if deadline != nil {
var cancel context.CancelFunc
ctx, cancel = context.WithDeadline(ctx, *deadline)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the caller set context.WithDeadline such that deadline does not need to passed in here anymore?

@ehdr ehdr force-pushed the ehdr/sb-command-router-go branch from c5d396f to cf17046 Compare December 24, 2025 07:58
@ehdr
Copy link
Contributor Author

ehdr commented Dec 24, 2025

Thanks @thomasjpfan ! Pushed fixes and added comments now.

Comment on lines 356 to 365
var resp *pb.TaskExecStartResponse
_, err := callWithRetriesOnTransientErrors(ctx, func() (struct{}, error) {
callErr := c.callWithAuthRetry(ctx, func(authCtx context.Context) error {
var err error
resp, err = c.stub.TaskExecStart(authCtx, request)
return err
})
return struct{}{}, callErr
}, defaultRetryOptions())
return resp, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a little hard to follow because of how resp is mutated in the inner scope and then returned in the outer scope.

I opened a PR targeting your branch to showcase the approach. This way ExecStart can be written as:

// ExecStart starts a command execution.
func (c *TaskCommandRouterClient) ExecStart(ctx context.Context, request *pb.TaskExecStartRequest) (*pb.TaskExecStartResponse, error) {
	return callWithRetriesOnTransientErrors(ctx, func() (*pb.TaskExecStartResponse, error) {
		return callWithAuthRetry(ctx, c, func(authCtx context.Context) (*pb.TaskExecStartResponse, error) {
			return c.stub.TaskExecStart(authCtx, request)
		})
	}, defaultRetryOptions())
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Merged!

defer sb.mu.Unlock()

if sb.commandRouterClient != nil {
_ = sb.commandRouterClient.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does calling commandRouterClient.Close also end all the goroutines spawned by ExecStdioRead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a change in 3769bc8 now that propagates cancellation.


// Terminate stops the Sandbox.
func (sb *Sandbox) Terminate(ctx context.Context) error {
sb.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freider Would you be okay with closing in sandbox.Terminate?

My intuition is that if you are terminating a sandbox, then you also want to clean up any open streams. If you only want to close the streams and continue to use the sandbox, you can run sandbox.Close and use the sandbox in other clients.

Note, in Python, we wanted to call this method sandbox.cleanup. In go, the naming of sb.Close, does not immediately give the impression of "closing streams". An alternative is sandbox.Disconnect, which is more explicit about what it is actually doing.

// This should be called when the Sandbox is no longer needed
// in the local client. The sandbox can still be running and
// accessed in other clients.
func (sb *Sandbox) Close() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also close the streams associated with sb.Stdout and sb.Stderr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes, pushed a fix now!

@ehdr ehdr force-pushed the ehdr/sb-command-router-go branch from e57855f to e3b8e8e Compare December 26, 2025 09:40
@ehdr ehdr force-pushed the ehdr/sb-command-router-go branch from e3b8e8e to edbb15d Compare December 26, 2025 09:46
@ehdr ehdr force-pushed the ehdr/sb-command-router-go branch from edbb15d to ac41d27 Compare December 26, 2025 12:53
@ehdr ehdr requested review from freider and thomasjpfan December 26, 2025 14:30
Comment on lines 521 to 523
var stdoutConfig pb.TaskExecStdoutConfig
switch stdout {
case Pipe:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do not need to do the stdout == "" part with:

switch params.Stdout {
    case Pipe, "":

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

// BuildTaskExecStartRequestProto builds a TaskExecStartRequest proto from command and options.
func BuildTaskExecStartRequestProto(taskID, execID string, command []string, params SandboxExecParams) (*pb.TaskExecStartRequest, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Looking at this with fresh eyes, I kind of wished that SandboxExecParams.Stdout was a pointer, so we can use nil as the default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point... it breaks the public interface though, so perhaps best to push it to a separate PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, let's not break public API.

func (sb *Sandbox) Terminate(ctx context.Context) error {
sb.mu.Lock()
if sb.commandRouterClient != nil {
_ = sb.commandRouterClient.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To match the Python SDK, I do not think we can close the commandRouterClient here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm for aligning it, but is there a point in having an active client if the SB is terminated?

}

if sb.Stdin != nil {
_ = sb.Stdin.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of the close errors, should we return the error up from Detach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 1073 to 1074
mu sync.Mutex // protects offset
offset uint64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this protecting the ExecStdinWrite and not just the offset? If it's just the offset, can we use an atomic uint64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, updated the comment now


jwt := resp.GetJwt()
c.jwt.Store(&jwt)
jwtExp := parseJwtExpiration(ctx, jwt, c.logger)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, parseJwtExpiration does not return an error if jwt is malformed and return nil. I suspect that we'll no longer be able to communicate to the task command router.

Should we return an error if jwt is malformed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point. This used to fall back to the server path, but we won't have that in Go. I'll think of something!

Copy link
Contributor

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the failing tests, I guess we can not terminate then detach.

Looking at the backend, we can not SandboxStdinWrite on a terminated sandbox, but we need SandboxStdinWrite to send the EOF.

@ehdr
Copy link
Contributor Author

ehdr commented Jan 8, 2026

Yeah, I just know this test passed on Dec. 26, so was meaning to figure out if/why this changed! Will look into it tomorrow.

Copy link

@saltzm saltzm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few comments/questions

if err := sb.commandRouterClient.Close(); err != nil {
errs = append(errs, fmt.Errorf("commandRouterClient.Close: %w", err))
}
sb.commandRouterClient = nil
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there's nothing to prevent someone from doing a new exec on the same sandbox object after calling Detach(), is that intended? (Note: I didn't follow the detach implementation in Python so don't have all the context) Seems like you'd want future exec calls to fail with some error

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw there are explicit test cases for this. Do you know the context there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this behavior so exec after detach is reported as an error instead, as discussed offline!

execID string
commandRouterClient *TaskCommandRouterClient

mu sync.Mutex // serializes writes to maintain offset consistency
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I don't think we would expect ContainerProcess APIs to be thread-safe would we?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, I wouldn't expect a user to be able to call stdin.write from multiple threads simultaneously - that'd be a weird thing to do I think. So I don't think there's a need for a mutex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, removed it!

@@ -0,0 +1,176 @@
package modal
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some tests for retries on auth errors?

if isRetryableGrpc(err) && retries > 0 {
retries--
continue
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing FailedPrecondition handling in cpStdin.Close

Low Severity

The sbStdin.Close() method was updated with FailedPrecondition error handling to gracefully return nil when the sandbox has already terminated or stdin is closed. However, cpStdin.Close() does not have equivalent handling and will return errors when called after an exec has terminated. This inconsistency means closing exec stdin after process termination returns an error while closing sandbox stdin in the same scenario returns nil.

Additional Locations (1)

Fix in Cursor Fix in Web

@cursor
Copy link

cursor bot commented Jan 24, 2026

PR Summary

Migrates Sandbox exec to a dedicated Task Command Router with JWT auth, retry/backoff, and deadline-aware stdio streaming.

  • Adds TaskCommandRouterClient (Go) with custom TLS (no-ALPN), token refresh, transient error retries, ExecStart/Wait/Poll/StdioRead, stdin offset writes, and directory mount/snapshot helpers
  • Refactors Sandbox.Exec/ContainerProcess to use router (per-exec exec_id, stdout/stderr pipes, stdin offsets, deadline handling) and closes router on Terminate; introduces ExecTimeoutError and ValidateExecArgs
  • Config: new task_command_router_insecure (and MODAL_TASK_COMMAND_ROUTER_INSECURE) to skip TLS verify; increases gRPC initial window sizes
  • JS parity: adds ExecTimeoutError, centralizes auth retry (callWithAuthRetry), shares retryable codes, sets flow-control window, and updates tests
  • Extensive unit/integration tests for request building, retries/auth refresh, stdio, timeouts, and filesystem snapshot via exec

Written by Cursor Bugbot for commit 9918b08. This will update automatically on new commits. Configure here.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

sb.mu.Unlock()
return nil, InvalidError{Exception: "cannot call Exec on a detached Sandbox"}
}
sb.mu.Unlock()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition allows exec after sandbox detach

Medium Severity

The Exec method checks sb.detached under lock at lines 565-570, then releases the lock before calling getOrCreateCommandRouterClient. The getOrCreateCommandRouterClient function doesn't re-check the detached flag. If Detach is called concurrently between the check and getOrCreateCommandRouterClient, a new command router client will be created and the exec will proceed despite the sandbox being detached. This is a TOCTOU (time-of-check-time-of-use) race condition.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

func (sb *Sandbox) Terminate(ctx context.Context) error {
if err := sb.closeTaskCommandRouterClient(); err != nil {
return err
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sandbox termination blocked by local client close failure

Medium Severity

In Terminate(), if closeTaskCommandRouterClient() returns an error (which can happen if the gRPC connection close fails), the function returns early without calling SandboxTerminate on the server. This means the sandbox won't actually be terminated, contradicting user expectations. The JS implementation doesn't have this issue because its close() method is void and always proceeds to call sandboxTerminate. The server-side termination should proceed regardless of whether closing the local client connection succeeds.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants