Implement sandbox exec through command router - Go by ehdr · Pull Request #242 · modal-labs/libmodal

ehdr · 2025-12-23T16:27:49Z

Same as #238 but for Go.

Note

Moves Sandbox exec I/O off the control plane to a dedicated Task Command Router with auth refresh, retries, and deadline-aware streaming.

New TaskCommandRouterClient (task_command_router_client.go) with JWT parsing/refresh, retry/backoff, stdio streaming, ExecStart/Wait/Poll, mount/snapshot helpers, and custom TLS creds (optional insecure via MODAL_TASK_COMMAND_ROUTER_INSECURE); applies 64 MiB gRPC window sizes and 100 MiB message limits
Sandbox.Exec rewritten to use the router (per-exec exec_id, deadline propagation); ContainerProcess stdin/stdout/stderr and Wait now use router APIs
Adds Sandbox.Detach() and adjusts Terminate(); output streams refactored; prevents ARG_MAX issues via ValidateExecArgs; new ExecTimeoutError; richer TaskExecStart request (stdout/stderr config, PTY, workdir, timeout)
Config expands Profile with TaskCommandRouterInsecure and env var override; client sets initial window sizes
Tests: extensive unit/integration coverage for exec proto, timeouts, stdin/stdout, detach/idempotency, snapshot flow; JS SDK parity updates (Exec timeout error, detach, retryable codes, flow-control options)

^{Written by Cursor Bugbot for commit bf4118f. This will update automatically on new commits. Configure here.}

Same as dfef157 but for Go.

modal-go/task_command_router_client.go

modal-go/sandbox.go

modal-go/task_command_router_client.go

thomasjpfan

This is a first pass through the implementation.

thomasjpfan · 2025-12-23T21:13:13Z

modal-go/sandbox.go

+}
+
+// Close closes any resources associated with the Sandbox.
+// This should be called when the Sandbox is no longer needed.


The sandbox could still be running after close:

Suggested change

// This should be called when the Sandbox is no longer needed.

// This should be called when the Sandbox is no longer needed

// in the local client. The sandbox can still be running and

// accessed in other clients.

thomasjpfan · 2025-12-23T21:13:50Z

modal-go/sandbox.go

+
 // Terminate stops the Sandbox.
 func (sb *Sandbox) Terminate(ctx context.Context) error {
+	sb.Close()


Closing the command router here is different from Python, although I think it's okay to do.

@freider Would you be okay with closing in sandbox.Terminate?

My intuition is that if you are terminating a sandbox, then you also want to clean up any open streams. If you only want to close the streams and continue to use the sandbox, you can run sandbox.Close and use the sandbox in other clients.

Note, in Python, we wanted to call this method sandbox.cleanup. In go, the naming of sb.Close, does not immediately give the impression of "closing streams". An alternative is sandbox.Disconnect, which is more explicit about what it is actually doing.

I think termination and disconnection might need to be mutually distinct:

you may need to disconnect from a sandbox that you want to keep running (resume in a later session etc)

you may want to terminate a sandbox but still read buffered output streams from it. This is probably less common but I can see this being required in some cases (eg termination of the entrypoint triggers some output to be written that you want to read)

For Python I'm envisioning something like a context manager for "sandbox interaction" which doesn't terminate the sandbox but terminates all connections/resources when exiting, outside of which you can't read associated streams or do exec etc. For go I guess the same would be accomplished with a "detach" in a defer statement. For full control/granularity I suppose we should also have a detach for ContainerProcess but that feels less critical imo (we also lack terminate for those)

thomasjpfan · 2025-12-23T21:25:04Z

modal-go/task_command_router_client.go

+	"google.golang.org/grpc/status"
+)
+
+// tlsCredsNoALPN is a TLS credential that skips ALPN enforcement.


Can we include a comment about what interface that tlsCredsNoALPN is implementing? I suspect it's https://pkg.go.dev/google.golang.org/grpc@v1.78.0/credentials#TransportCredentials

thomasjpfan · 2025-12-23T21:28:27Z

modal-go/task_command_router_client.go

+		payloadB64 += "="
+	}
+
+	payloadJSON, err := base64.StdEncoding.DecodeString(payloadB64)


Should this match the python decoder, which uses urlsafe_b64decode?

modal-go/task_command_router_client.go

thomasjpfan · 2025-12-23T22:41:19Z

modal-go/task_command_router_client.go

+	if err != nil {
+		st, ok := status.FromError(err)
+		if (ok && st.Code() == codes.DeadlineExceeded) || errors.Is(err, errDeadlineExceeded) {
+			return nil, fmt.Errorf("deadline exceeded while polling for exec %s", execID)


Should we introduce an error like Python's ExecTimeoutError here too?

thomasjpfan · 2025-12-23T22:47:10Z

modal-go/task_command_router_client.go

+) {
+	if deadline != nil {
+		var cancel context.CancelFunc
+		ctx, cancel = context.WithDeadline(ctx, *deadline)


Should the caller set context.WithDeadline such that deadline does not need to passed in here anymore?

modal-go/task_command_router_client_test.go

modal-go/task_command_router_client.go

ehdr · 2025-12-24T08:40:36Z

Thanks @thomasjpfan ! Pushed fixes and added comments now.

thomasjpfan · 2025-12-24T15:50:59Z

modal-go/task_command_router_client.go

+	var resp *pb.TaskExecStartResponse
+	_, err := callWithRetriesOnTransientErrors(ctx, func() (struct{}, error) {
+		callErr := c.callWithAuthRetry(ctx, func(authCtx context.Context) error {
+			var err error
+			resp, err = c.stub.TaskExecStart(authCtx, request)
+			return err
+		})
+		return struct{}{}, callErr
+	}, defaultRetryOptions())
+	return resp, err


This feels a little hard to follow because of how resp is mutated in the inner scope and then returned in the outer scope.

I opened a PR targeting your branch to showcase the approach. This way ExecStart can be written as:

// ExecStart starts a command execution. func (c *TaskCommandRouterClient) ExecStart(ctx context.Context, request *pb.TaskExecStartRequest) (*pb.TaskExecStartResponse, error) { return callWithRetriesOnTransientErrors(ctx, func() (*pb.TaskExecStartResponse, error) { return callWithAuthRetry(ctx, c, func(authCtx context.Context) (*pb.TaskExecStartResponse, error) { return c.stub.TaskExecStart(authCtx, request) }) }, defaultRetryOptions()) }

Very nice! Merged!

thomasjpfan · 2025-12-24T16:00:20Z

modal-go/sandbox.go

+	defer sb.mu.Unlock()
+
+	if sb.commandRouterClient != nil {
+		_ = sb.commandRouterClient.Close()


Does calling commandRouterClient.Close also end all the goroutines spawned by ExecStdioRead?

Pushed a change in 3769bc8 now that propagates cancellation.

thomasjpfan · 2025-12-24T16:01:55Z

modal-go/sandbox.go

+
 // Terminate stops the Sandbox.
 func (sb *Sandbox) Terminate(ctx context.Context) error {
+	sb.Close()


@freider Would you be okay with closing in sandbox.Terminate?

My intuition is that if you are terminating a sandbox, then you also want to clean up any open streams. If you only want to close the streams and continue to use the sandbox, you can run sandbox.Close and use the sandbox in other clients.

Note, in Python, we wanted to call this method sandbox.cleanup. In go, the naming of sb.Close, does not immediately give the impression of "closing streams". An alternative is sandbox.Disconnect, which is more explicit about what it is actually doing.

thomasjpfan · 2025-12-24T16:08:04Z

modal-go/sandbox.go

+// This should be called when the Sandbox is no longer needed
+// in the local client. The sandbox can still be running and
+// accessed in other clients.
+func (sb *Sandbox) Close() {


Should this also close the streams associated with sb.Stdout and sb.Stderr?

I think yes, pushed a fix now!

modal-go/test/sandbox_test.go

modal-go/task_command_router_client.go

thomasjpfan · 2026-01-06T15:13:48Z

modal-go/sandbox.go

+	var stdoutConfig pb.TaskExecStdoutConfig
+	switch stdout {
+	case Pipe:


I think we do not need to do the stdout == "" part with:

switch params.Stdout { case Pipe, "":

thomasjpfan · 2026-01-06T15:17:16Z

modal-go/sandbox.go

+}
+
+// BuildTaskExecStartRequestProto builds a TaskExecStartRequest proto from command and options.
+func BuildTaskExecStartRequestProto(taskID, execID string, command []string, params SandboxExecParams) (*pb.TaskExecStartRequest, error) {


Nit: Looking at this with fresh eyes, I kind of wished that SandboxExecParams.Stdout was a pointer, so we can use nil as the default.

Yeah, good point... it breaks the public interface though, so perhaps best to push it to a separate PR?

Yea, let's not break public API.

thomasjpfan · 2026-01-06T16:05:35Z

modal-go/sandbox.go

 func (sb *Sandbox) Terminate(ctx context.Context) error {
+	sb.mu.Lock()
+	if sb.commandRouterClient != nil {
+		_ = sb.commandRouterClient.Close()


To match the Python SDK, I do not think we can close the commandRouterClient here?

Yeah I'm for aligning it, but is there a point in having an active client if the SB is terminated?

thomasjpfan · 2026-01-06T16:07:00Z

modal-go/sandbox.go

+	}
+
+	if sb.Stdin != nil {
+		_ = sb.Stdin.Close()


If any of the close errors, should we return the error up from Detach?

thomasjpfan · 2026-01-06T16:11:38Z

modal-go/sandbox.go

+	mu     sync.Mutex // protects offset
+	offset uint64


Is this protecting the ExecStdinWrite and not just the offset? If it's just the offset, can we use an atomic uint64?

Good point, updated the comment now

thomasjpfan · 2026-01-06T16:37:23Z

modal-go/task_command_router_client.go

+
+		jwt := resp.GetJwt()
+		c.jwt.Store(&jwt)
+		jwtExp := parseJwtExpiration(ctx, jwt, c.logger)


Currently, parseJwtExpiration does not return an error if jwt is malformed and return nil. I suspect that we'll no longer be able to communicate to the task command router.

Should we return an error if jwt is malformed?

Yeah good point. This used to fall back to the server path, but we won't have that in Go. I'll think of something!

thomasjpfan

From the failing tests, I guess we can not terminate then detach.

Looking at the backend, we can not SandboxStdinWrite on a terminated sandbox, but we need SandboxStdinWrite to send the EOF.

ehdr · 2026-01-08T21:36:11Z

Yeah, I just know this test passed on Dec. 26, so was meaning to figure out if/why this changed! Will look into it tomorrow.

saltzm

Looks good! Just a few comments/questions

saltzm · 2026-01-09T15:31:19Z

modal-go/sandbox.go

+		if err := sb.commandRouterClient.Close(); err != nil {
+			errs = append(errs, fmt.Errorf("commandRouterClient.Close: %w", err))
+		}
+		sb.commandRouterClient = nil


Seems like there's nothing to prevent someone from doing a new exec on the same sandbox object after calling Detach(), is that intended? (Note: I didn't follow the detach implementation in Python so don't have all the context) Seems like you'd want future exec calls to fail with some error

Just saw there are explicit test cases for this. Do you know the context there?

Changing this behavior so exec after detach is reported as an error instead, as discussed offline!

saltzm · 2026-01-09T15:35:06Z

modal-go/sandbox.go

+	execID              string
+	commandRouterClient *TaskCommandRouterClient
+
+	mu     sync.Mutex // serializes writes to maintain offset consistency


Hmm... I don't think we would expect ContainerProcess APIs to be thread-safe would we?

In other words, I wouldn't expect a user to be able to call stdin.write from multiple threads simultaneously - that'd be a weird thing to do I think. So I don't think there's a need for a mutex?

Good point, removed it!

saltzm · 2026-01-09T16:00:44Z

modal-go/task_command_router_client_test.go

@@ -0,0 +1,176 @@
+package modal


Could we add some tests for retries on auth errors?

cursor · 2026-01-12T08:52:00Z

modal-go/sandbox.go

-				if isRetryableGrpc(err) && retries > 0 {
-					retries--
-					continue
-				}


Missing FailedPrecondition handling in cpStdin.Close

Low Severity

The sbStdin.Close() method was updated with FailedPrecondition error handling to gracefully return nil when the sandbox has already terminated or stdin is closed. However, cpStdin.Close() does not have equivalent handling and will return errors when called after an exec has terminated. This inconsistency means closing exec stdin after process termination returns an error while closing sandbox stdin in the same scenario returns nil.

Additional Locations (1)

modal-go/sandbox.go#L1064-L1067

cursor · 2026-01-24T11:45:18Z

PR Summary

Migrates Sandbox exec to a dedicated Task Command Router with JWT auth, retry/backoff, and deadline-aware stdio streaming.

Adds TaskCommandRouterClient (Go) with custom TLS (no-ALPN), token refresh, transient error retries, ExecStart/Wait/Poll/StdioRead, stdin offset writes, and directory mount/snapshot helpers
Refactors Sandbox.Exec/ContainerProcess to use router (per-exec exec_id, stdout/stderr pipes, stdin offsets, deadline handling) and closes router on Terminate; introduces ExecTimeoutError and ValidateExecArgs
Config: new task_command_router_insecure (and MODAL_TASK_COMMAND_ROUTER_INSECURE) to skip TLS verify; increases gRPC initial window sizes
JS parity: adds ExecTimeoutError, centralizes auth retry (callWithAuthRetry), shares retryable codes, sets flow-control window, and updates tests
Extensive unit/integration tests for request building, retries/auth refresh, stdio, timeouts, and filesystem snapshot via exec

^{Written by Cursor Bugbot for commit 9918b08. This will update automatically on new commits. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-24T11:57:09Z

modal-go/sandbox.go

+		sb.mu.Unlock()
+		return nil, InvalidError{Exception: "cannot call Exec on a detached Sandbox"}
+	}
+	sb.mu.Unlock()


Race condition allows exec after sandbox detach

Medium Severity

The Exec method checks sb.detached under lock at lines 565-570, then releases the lock before calling getOrCreateCommandRouterClient. The getOrCreateCommandRouterClient function doesn't re-check the detached flag. If Detach is called concurrently between the check and getOrCreateCommandRouterClient, a new command router client will be created and the exec will proceed despite the sandbox being detached. This is a TOCTOU (time-of-check-time-of-use) race condition.

Additional Locations (1)

modal-go/sandbox.go#L695-L717

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-26T22:57:00Z

modal-go/sandbox.go

 func (sb *Sandbox) Terminate(ctx context.Context) error {
+	if err := sb.closeTaskCommandRouterClient(); err != nil {
+		return err
+	}


Sandbox termination blocked by local client close failure

Medium Severity

In Terminate(), if closeTaskCommandRouterClient() returns an error (which can happen if the gRPC connection close fails), the function returns early without calling SandboxTerminate on the server. This means the sandbox won't actually be terminated, contradicting user expectations. The JS implementation doesn't have this issue because its close() method is void and always proceeds to call sandboxTerminate. The server-side termination should proceed regardless of whether closing the local client connection succeeds.

ehdr added 2 commits December 22, 2025 14:12

Implement sandbox exec through command router - Go

8598b9f

Same as dfef157 but for Go.

ALPN work-around

317bd28

ehdr requested review from saltzm and thomasjpfan December 23, 2025 16:28

cursor bot reviewed Dec 23, 2025

View reviewed changes

modal-go/task_command_router_client.go Outdated Show resolved Hide resolved

modal-go/sandbox.go Show resolved Hide resolved

modal-go/task_command_router_client.go Show resolved Hide resolved

ehdr added 5 commits December 23, 2025 22:21

Sandbox lock on crClient

d3053ac

Improve concurrenct handling for JWT refresh

4480e14

Protect offset in cpStdin with lock

8a29162

Pass context to parseJwtExpiration

1b7ec91

Don't defer cancel in loop

2fca8e9

cursor bot reviewed Dec 23, 2025

View reviewed changes

modal-go/task_command_router_client.go Show resolved Hide resolved

ehdr added 2 commits December 23, 2025 23:10

Properly check for deadline errors

fb11aaa

Deflake test

e2247db

thomasjpfan reviewed Dec 23, 2025

View reviewed changes

ehdr added 3 commits December 24, 2025 09:00

Clarify how Sandbox.Close should be used

dfda531

Add Sandbox.close to JS

4d7c894

Clarify that tlsCredsNoALPN implements TransportCredentials

d853530

cursor bot reviewed Dec 24, 2025

View reviewed changes

modal-go/task_command_router_client_test.go Show resolved Hide resolved

Use only urlsafe decode in parseJwtExpiration

7b1d778

ehdr force-pushed the ehdr/sb-command-router-go branch from c5d396f to cf17046 Compare December 24, 2025 07:58

cursor bot reviewed Dec 24, 2025

View reviewed changes

modal-go/task_command_router_client.go Outdated Show resolved Hide resolved

thomasjpfan reviewed Dec 24, 2025

View reviewed changes

modal-go/test/sandbox_test.go Show resolved Hide resolved

cursor bot reviewed Dec 25, 2025

View reviewed changes

modal-go/task_command_router_client.go Outdated Show resolved Hide resolved

ehdr force-pushed the ehdr/sb-command-router-go branch from e57855f to e3b8e8e Compare December 26, 2025 09:40

ehdr added 4 commits December 26, 2025 11:46

Add ExecTimeoutError type

3bfc2ae

Use deadline from context in streamStdio

d2209e5

Handle MODAL_TASK_COMMAND_ROUTER_INSECURE

ce2e1c1

Align window size with Python client

9ff05f3

ehdr force-pushed the ehdr/sb-command-router-go branch from e3b8e8e to edbb15d Compare December 26, 2025 09:46

Refine sb.Terminate and .Detach

ac41d27

ehdr force-pushed the ehdr/sb-command-router-go branch from edbb15d to ac41d27 Compare December 26, 2025 12:53

ehdr requested review from freider and thomasjpfan December 26, 2025 14:30

thomasjpfan reviewed Jan 6, 2026

View reviewed changes

ehdr added 5 commits January 7, 2026 15:12

Make buildTaskExecStartRequestProto private

7187404

Make Detach return internal errors

7570bc7

More explicit error handling in parseJwtExpiration

78d1dbf

Simplify pipe init logic

7f60805

Clarify mutex scope

35f6232

thomasjpfan reviewed Jan 8, 2026

View reviewed changes

saltzm reviewed Jan 9, 2026

View reviewed changes

Handle stream close on already closed Sandboxes

bf4118f

cursor bot reviewed Jan 12, 2026

View reviewed changes

ehdr added 2 commits January 12, 2026 22:01

Exec after detach is an error

0fec347

Remove stdin write mutex since not thread-safe anyway

4ace8aa

cursor bot reviewed Jan 24, 2026

View reviewed changes

thomasjpfan added 8 commits January 26, 2026 11:00

Remove detached in go

ae9a841

Remove detached in js

4a5d3fe

Add test for auth retrying

5f0dbc8

Remove background refreshes

7187343

Add tests for modal-js and extend go tests

ff3d391

linter

3452700

Fix lint

49f391d

Revert changes

9918b08

thomasjpfan mentioned this pull request Jan 26, 2026

sandbox: dynamic directory mount & snapshot in go + adds detach #246

Open

cursor bot reviewed Jan 26, 2026

View reviewed changes

Conversation

ehdr commented Dec 23, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ehdr commented Dec 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

ehdr commented Jan 8, 2026

ehdr commented Dec 23, 2025 •

edited by cursor bot

Loading

thomasjpfan left a comment •

edited

Loading

cursor bot commented Jan 24, 2026 •

edited

Loading