Skip to content

Feat/k8s sandbox#3171

Draft
pedrofrxncx wants to merge 19 commits intomainfrom
feat/k8s-sandbox
Draft

Feat/k8s sandbox#3171
pedrofrxncx wants to merge 19 commits intomainfrom
feat/k8s-sandbox

Conversation

@pedrofrxncx
Copy link
Copy Markdown
Collaborator

@pedrofrxncx pedrofrxncx commented Apr 24, 2026

What is this contribution about?

Describe your changes and why they're needed.

Screenshots/Demonstration

Add screenshots or a Loom video if your changes affect the UI.

How to Test

Provide step-by-step instructions for reviewers to test your changes:

  1. Step one
  2. Step two
  3. Expected outcome

Migration Notes

If this PR requires database migrations, configuration changes, or other setup steps, document them here. Remove this section if not applicable.

Review Checklist

  • PR title is clear and descriptive
  • Changes are tested and working
  • Documentation is updated (if needed)
  • No breaking changes

Summary by cubic

Adds an agent-sandbox runner (Kubernetes via kubernetes-sigs/agent-sandbox) and a preview subdomain reverse-proxy that routes *.preview.<domain> to each sandbox daemon. Also ships Helm support, local kind scripts, monitoring, UI/SDK updates, improved daemon proxying (WS + dynamic port discovery), logging, and a release workflow. Runner kind/env are now agent-sandbox and STUDIO_SANDBOX_RUNNER.

  • New Features

    • Agent-sandbox runner: @decocms/sandbox/runner/agent-sandbox (opt in with STUDIO_SANDBOX_RUNNER=agent-sandbox). Uses Bun TLS fetch with kubeconfig, per-tenant pod labels (org/user), per-claim DAEMON_TOKEN, readiness watch, and a single port-forward to the daemon that also carries preview traffic; optional public previews via previewUrlPattern. Rehydrates on daemon bootId and lazy-loads @kubernetes/client-node.
    • Preview networking: reverse-proxy routes <handle>.preview.<base-domain> to the matching sandbox daemon on port 9000 and upgrades WS; preview-admin paths are blocked. WS buffering caps pending frames.
    • Config: STUDIO_SANDBOX_PREVIEW_URL_PATTERN enables public preview URLs; wired through Helm (sandbox.agentSandbox.previewUrlPattern/previewGateway) and configMap.meshConfig.
    • Mesh/SDK/UI: widened unions to include "agent-sandbox"; mesh dynamically imports @decocms/sandbox/runner/agent-sandbox only when selected.
    • Helm: adds charts/agent-sandbox subchart (vendors operator + CRDs), shared SandboxTemplate, optional SandboxWarmPool, mesh RBAC, and a NetworkPolicy. New knobs: nodeSelector (default amd64, override for arm64), tolerations, hostUsers, readOnlyRootFilesystem, plus optional wildcard Gateway + Certificate for previews. Operator namespace gets PodSecurity labels. Vendored files marked generated via .gitattributes.
    • Local dev: deploy/k8s-sandbox/local/* kind scripts (up.sh/down.sh/reload-image.sh) and an end-to-end smoke.ts. Optional monitoring stack (kube-prometheus-stack + OTel collector) and a Grafana dashboard with updated CPU/network metrics.
    • Daemon: WebSocket reverse-proxy for HMR, dynamic descendant port discovery, smarter probe scoring, and SSE logs teed to stdout for kubectl logs.
    • CI/Deps: new workflow builds/pushes the mesh-sandbox image to ghcr.io; @opentelemetry/api and @kubernetes/client-node added to @decocms/sandbox.
  • Migration

    • Default remains Docker; no changes required.
    • To try agent-sandbox: set sandbox.agentSandbox.enabled=true in Helm. The chart sets STUDIO_SANDBOX_RUNNER=agent-sandbox if not provided. For public previews, set STUDIO_SANDBOX_PREVIEW_URL_PATTERN and configure sandbox.agentSandbox.previewGateway.*. For local kind, use deploy/k8s-sandbox/local/up.sh and reload-image.sh; remove with down.sh. On arm64, override the default nodeSelector.

Written for commit 34ada69. Summary will update on new commits. Review in cubic

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

Release Options

Suggested: Patch (2.283.3) — default (no conventional commit prefix detected)

React with an emoji to override the release type:

Reaction Type Next Version
👍 Prerelease 2.283.3-alpha.1
🎉 Patch 2.283.3
❤️ Minor 2.284.0
🚀 Major 3.0.0

Current version: 2.283.2

Note: If multiple reactions exist, the smallest bump wins. If no reactions, the suggested bump is used (default: patch).

@github-actions
Copy link
Copy Markdown
Contributor

🧪 Benchmark

Should we run the Virtual MCP strategy benchmark for this PR?

React with 👍 to run the benchmark.

Reaction Action
👍 Run quick benchmark (10 & 128 tools)

Benchmark will run on the next push after you react.

New Kubernetes runner lives at mesh-plugin-user-sandbox/runner/k8s and sits
behind its own subpath export so docker/freestyle deploys never pull in
@kubernetes/client-node. Opt in with MESH_SANDBOX_RUNNER=kubernetes; docker
stays the dev default.

Image side: bumps Bun to 1.3.11 so the daemon can read modern bun.lock
(configVersion: 1), and adds a per-boot BOOT_ID to /health so the runner can
detect container restarts (OOMKill, eviction) and re-bootstrap the workdir
instead of stranding a live pod with an empty /app.
- Bumped version of decocms to 2.274.0.
- Updated various dependencies including @ai-sdk/anthropic, @ai-sdk/gateway, @ai-sdk/google, @ai-sdk/openai, and @ai-sdk/provider-utils to their latest versions.
- Added new entries for @anthropic-ai/claude-agent-sdk and its platform-specific variants.
@tlgimenes
Copy link
Copy Markdown
Contributor

Heads up — PR #3178 lands a unified daemon codebase at packages/sandbox/daemon/ shared by freestyle and docker, shipped as a single daemon/dist/daemon.js bundle. Once it merges, your K8s runner can COPY (or mount) the same bundle instead of duplicating image/daemon.mjs.

The unified daemon already exposes /health with bootId for restart detection, so your runner can drop that bit and consume it from the shared surface. Paths are /_decopilot_vm/* with base64-wrapped bodies everywhere.

Happy to pair on the rebase when #3178 is in — should be small for the k8s side since the daemon contract is the same across runners.

VM_START used to block until clone+install finished (~30s on medium
repos). The Terminal tab opens its SSE connection only after VM_START
returns, so users saw the entire setup output dumped at once via
`replayTo` instead of streaming live.

Run repo bootstrap in the background (Docker + K8s runners) and persist
the handle BEFORE bootstrap so /api/sandbox/<handle> resolves while clone
is still running. Bootstrap output streams through the daemon's log ring
under a `setup` source so the Terminal tab can subscribe via SSE.

Daemon side: split bash output on any CR/LF run so git's progress
reports surface as distinct log lines instead of accumulating until the
trailing newline. Set CI=1 in dev-process env so Vite's interactive
shortcut reader doesn't EOF and exit when stdio is ignored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pedrofrxncx and others added 16 commits April 27, 2026 21:46
Resolved conflicts to align with main's daemon-orchestrator model
(PR #3175): the Bun.serve daemon now owns clone + install + dev-server
boot, driven by env vars (DAEMON_TOKEN, DAEMON_BOOT_ID, CLONE_URL,
BRANCH, GIT_USER_*, RUNTIME, PACKAGE_MANAGER) on container/pod start.

Key resolutions:
- Dropped sandbox-daemon.ts route (deleted on main; daemon access now
  via internal vm-tools that call SandboxRunner.proxyDaemonRequest).
- Dropped image/*.mjs files (deleted on main; replaced by Bun.serve TS
  daemon at packages/sandbox/daemon/).
- Refactored k8s runner to mirror docker's env contract: pass full
  env (CLONE_URL, BRANCH, RUNTIME, etc.) through SandboxClaim.spec.env
  so the daemon orchestrates setup itself. Removed bootstrapAndStart,
  bootstrapPromise, repoAttached fields, startDevServer/stopDevServer
  calls, and bootstrapRepo helper — all subsumed by daemon-side
  resume-on-restart and dev-autostart.
- Reverted docker runner background-bootstrap fields/methods (main's
  approach via daemon orchestrator covers the same SSE-log streaming
  goal more thoroughly).
- Moved packages/mesh-plugin-user-sandbox/server/runner/k8s/* →
  packages/sandbox/server/runner/k8s/* (main renamed the package).
- Updated lifecycle.ts k8s import to @decocms/sandbox/runner/k8s.
- Regenerated bun.lock via bun install.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… smoke.ts, and up.sh scripts

- Introduced .gitattributes to mark generated files for the Helm chart in agent-sandbox.
- Updated README to clarify the build process for the daemon bundle and image.
- Modified reload-image.sh and up.sh to build the daemon bundle before building the Docker image, ensuring the correct context is used.
- Adjusted import paths in smoke.ts to reflect the new package structure.
- Added stdout logging in the Broadcaster class to mirror SSE subscriber output for better visibility in `kubectl logs` and k9s.
- Refactored KubernetesSandboxRunner to streamline port-forwarding logic, removing the devForward property and ensuring that preview traffic is routed through the daemon port-forward.
- Updated comments for clarity on the daemon's role in handling traffic and the implications of the new structure.
- Introduced a WebSocket proxy to handle upgrades for Vite's HMR and other dev-server WebSocket connections, ensuring seamless communication through the daemon.
- Enhanced port discovery logic to dynamically identify listening ports of descendant processes, improving the accuracy of the dev server's operational context.
- Refactored the upstream probing mechanism to utilize candidate ports, allowing for more robust detection of the active development server.
- Updated the proxy handler to resolve the actual listening port dynamically, ensuring consistent routing of requests.
- Added comprehensive comments and documentation to clarify the new functionality and its implications for the daemon's operation.
- Added support for nodeSelector and tolerations in the sandbox values, allowing for better pod scheduling and resource management.
- Introduced hostUsers option to enable user namespace remapping, enhancing security by preventing container escapes to real node UIDs.
- Implemented readOnlyRootFilesystem configuration to improve security and stability, with provisions for necessary volume mounts.
- Updated agent-sandbox manifest to include PodSecurity admission labels, enforcing baseline security policies for the namespace.

These changes aim to improve the security and configurability of sandbox deployments.
- Introduced a new environment variable, MESH_SANDBOX_PREVIEW_URL_PATTERN, to allow the specification of a public URL pattern for sandbox previews.
- Updated Docker and Kubernetes sandbox runners to utilize the preview URL pattern, improving accessibility for users.
- Modified Helm chart values to include the new preview URL pattern configuration, ensuring proper deployment in production environments.

These changes aim to enhance the configurability and accessibility of sandbox previews in Kubernetes deployments.
- Introduced a sandbox preview reverse-proxy to route requests for `<handle>.preview.<base-domain>` to the corresponding sandbox daemon, improving accessibility for preview environments.
- Enhanced WebSocket handling to support upgrades and message processing for preview connections, ensuring seamless communication for development workflows.
- Updated the Helm chart to include configurations for the new preview URL pattern and related settings, facilitating deployment in Kubernetes environments.

These changes aim to improve the functionality and usability of sandbox previews in the development process.
- Updated `values-kube-prometheus-stack.yaml` to use camelCase for `kubeStateMetrics` and added comments for clarity on subchart toggles.
- Modified `values-otel-collector.yaml` to change `serviceMonitor` to `podMonitor`, reflecting the new scraping strategy, and added comments regarding omitted metadata.
- Adjusted `sandbox-overview.json` to replace deprecated metrics expressions with updated ones for CPU and network utilization, ensuring accurate monitoring data.

These changes enhance the clarity and functionality of the monitoring setup in the Kubernetes sandbox environment.
- Bumped versions for `decocms` to 2.281.2 and `@decocms/runtime` to 1.6.0 in `bun.lock`.
- Added `@opentelemetry/api` as a new dependency in the sandbox package.
- Introduced new monitoring configurations for Kubernetes, including updated values for Prometheus and OpenTelemetry collector, and added a new dashboard for sandbox overview.
- Improved WebSocket handling in the sandbox to manage pending frames and prevent memory exhaustion.

These changes aim to enhance the observability and performance of the sandbox environment while ensuring accurate monitoring and dependency management.
…ecture

- Configured `nodeSelector` in `values.yaml` to specify `kubernetes.io/arch: amd64`, ensuring compatibility with amd64 node groups.
- Added comments to guide users on overriding this setting for arm64 clusters, enhancing clarity for deployment configurations.

These changes improve the deployment flexibility for different architecture environments.
…anagement. This cleanup helps streamline the project structure.
- Introduced a GitHub Actions workflow to build and push the Studio Sandbox Docker image upon changes to the `packages/sandbox` directory.
- Updated environment variable references from `MESH_SANDBOX_PREVIEW_URL_PATTERN` to `STUDIO_SANDBOX_PREVIEW_URL_PATTERN` across multiple files to reflect the new naming convention.
- Adjusted related test cases and configurations to ensure consistency with the new sandbox naming scheme.

These changes enhance the deployment process and improve clarity in the codebase regarding the Studio Sandbox environment.
- Updated environment variable references and type definitions to replace `MESH_SANDBOX_RUNNER` and `kubernetes` with `STUDIO_SANDBOX_RUNNER` and `agent-sandbox` across multiple files.
- Adjusted comments and documentation to reflect the new naming convention for the agent-sandbox runner.
- Enhanced test cases and configurations to ensure consistency with the updated sandbox runner implementation.

These changes improve clarity and maintainability in the codebase regarding the sandbox environment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants