Skip to content

feat(sandbox): add shared build cache PVC to sandbox-env chart#3221

Open
pedrofrxncx wants to merge 17 commits intomainfrom
feat/sandbox-build-cache
Open

feat(sandbox): add shared build cache PVC to sandbox-env chart#3221
pedrofrxncx wants to merge 17 commits intomainfrom
feat/sandbox-build-cache

Conversation

@pedrofrxncx
Copy link
Copy Markdown
Collaborator

@pedrofrxncx pedrofrxncx commented Apr 29, 2026

Summary

  • Adds cache-pvc.yaml template to sandbox-env chart — creates a shared PersistentVolumeClaim (<name>-cache) in agent-sandbox-system
  • Wires the PVC into the SandboxTemplate with env vars that redirect every package manager's cache to /mnt/cache (npm, pnpm, yarn, bun, deno, XDG)
  • Off by default (cache.enabled: false); opt-in per environment via values
  • values-kind.yaml enables it with standard/ReadWriteOnce/10Gi for local dev (single-node kind, no EFS needed)

How it works

Package managers have a two-level storage model: a content-addressed global cache (tarballs, keyed by name+version+hash) and the per-project node_modules/. The PVC holds the global cache layer. On a cold start the first bun install populates it; every subsequent sandbox that shares a dep at the same version skips the registry download and links from the local cache instead.

Production setup (EKS)

Requires the AWS EFS CSI driver and a provisioned EFS filesystem. Set:

cache:
  enabled: true
  storageClass: efs-sc   # EFS-backed StorageClass with ReadWriteMany
  accessMode: ReadWriteMany
  size: 50Gi

Test plan

  • helm template with values-kind.yaml renders PVC + env vars + volumeMount in SandboxTemplate
  • helm template with default values renders no cache resources
  • Deploy to kind with cache.enabled: true — PVC binds on first sandbox pod, /mnt/cache is mounted, BUN_INSTALL_CACHE_DIR etc. are set
  • Second sandbox install reuses cached tarballs (faster cold start, no registry traffic)

🤖 Generated with Claude Code


Summary by cubic

Adds an opt‑in shared build and source cache to the sandbox-env chart. Speeds up clone, install, and Next.js cold starts while cutting registry and Git traffic. Adds safer fallbacks when the cache PVC is unavailable or full.

  • New Features

    • Shared <sandbox>-cache PVC in agent-sandbox-system, mounted at cache.mountPath (default /mnt/cache), kept across Helm upgrades.
    • Template passes CACHE_DIR; the daemon redirects npm/pnpm/yarn/bun/deno/XDG caches, adds lockfile‑keyed shared node_modules with flock + sentinel, and a per‑repo Next.js .next/cache via SANDBOX_CACHE_KEY. Installs use --prefer-offline/--frozen-lockfile to reduce network.
    • Git clones use shallow bare mirrors on the PVC wired as alternates; hourly TTL refresh touches HEAD. If the cache dir is unavailable, falls back to direct clone.
    • Nightly GC CronJob evicts stale node_modules slots and git mirrors; configure via cache.gc.* (enabled, schedule, ttlDays).
    • Helm validation blocks ReadWriteMany without cache.storageClass; README adds EFS setup guidance. Configure via cache.* (enabled, storageClass, accessMode, size, mountPath, gc.*). examples/values-kind.yaml enables local dev with standard/ReadWriteOnce/10Gi.
    • On install fallback, the daemon now removes the node_modules symlink to avoid writing into a bad PVC slot.
  • Migration

    • EKS: install EFS CSI, create an RWX StorageClass (e.g. efs-sc), then set cache.enabled: true, cache.storageClass: efs-sc, cache.accessMode: ReadWriteMany, and size as needed. Optional: tune cache.gc.schedule and cache.gc.ttlDays.

Written for commit dd07d10. Summary will update on new commits. Review in cubic

pedrofrxncx and others added 17 commits April 28, 2026 18:04
Helm chart, vendored agent-sandbox subchart, local kind dev tooling, and
monitoring stack for the agent-sandbox runner shipped in 2f293ba. The
runner remains opt-in (STUDIO_SANDBOX_RUNNER=agent-sandbox); installing
this chart with sandbox.agentSandbox.enabled=false is a no-op.

- deploy/helm/charts/agent-sandbox: vendored upstream subchart (CRDs +
  operator), refreshed via charts/agent-sandbox/vendor.sh. CRDs and
  templates are marked linguist-generated in .gitattributes so the diff
  collapses in review.
- deploy/helm/templates: SandboxTemplate, optional SandboxWarmPool, mesh
  RBAC, NetworkPolicy, and optional wildcard Gateway + Certificate for
  *.preview.<base-domain>. New values: nodeSelector (default amd64),
  tolerations, hostUsers, readOnlyRootFilesystem,
  sandbox.agentSandbox.previewUrlPattern / previewGateway. configMap
  surfaces STUDIO_SANDBOX_PREVIEW_URL_PATTERN.
- deploy/k8s-sandbox/local: kind scripts (up.sh / down.sh /
  reload-image.sh), shared sandbox-template, and an end-to-end smoke.ts.
- deploy/k8s-sandbox/monitoring: kube-prometheus-stack + OTel collector
  values and a Grafana dashboard.
- .github/workflows/release-studio-sandbox.yaml: builds and pushes the
  mesh-sandbox image to ghcr.io.
- Updated .gitignore to exclude cached .tgz files for agent-sandbox.
- Enhanced GitHub Actions workflow to skip republishing if the image version tag already exists, preventing unnecessary pushes.
- Added checks for existing version tags in the release workflow to avoid overwriting images.
- Updated helm chart values to pin the image version to "0.1.0" for consistency and reliability.
- Improved documentation in README.md regarding the security implications of preview handles.
- Refactored templates to use common label definitions for sandbox resources, ensuring consistency across the chart.
- Added checksum verification in vendor.sh to enhance security when fetching upstream assets.
- Added a new Helm chart for the agent-sandbox, including CRDs and templates for the Kubernetes operator.
- Implemented vendor.sh for fetching and managing upstream assets securely.
- Updated .gitattributes to mark generated files appropriately.
- Enhanced .gitignore to exclude unnecessary cached files.
- Created GitHub Actions workflow for packaging and releasing the Helm chart to GitHub Container Registry.
- Added example values for local development with kind.
- Improved documentation in README.md regarding the chart's structure and usage.
…ture

- Updated the README.md to reflect the standalone nature of the agent-sandbox Helm chart, including installation instructions and configuration details.
- Expanded documentation on the chart's components, such as CRDs, templates, and ArgoCD application setup.
- Clarified the role of the `vendor.sh` script in managing upstream assets and provided guidance on version bumping and CRD upgrades.
- Improved the layout section to better illustrate the chart's directory structure and included additional examples for local development.
- Refactored `values-kind.yaml` for the agent-sandbox to clarify its usage in local kind clusters.
- Removed outdated Postgres example manifest and values file to streamline the deployment process.
- Updated `Chart.lock` and `Chart.yaml` to reflect the latest dependencies and versioning.
- Enhanced `README.md` to provide clearer instructions on the agent-sandbox runner and its configuration.
- Adjusted PVC template to ensure default storage class is used when none is specified.
- Merged the controller Deployments from manifest.yaml and extensions.yaml into a single Deployment to ensure proper functionality of the agent-sandbox reconciler.
- Removed the duplicate Deployment definition from extensions.yaml and added the `--extensions` argument to the manifest.yaml Deployment.
- Implemented transformations in vendor.sh to automate these changes while maintaining checksum verification for upstream assets.
…ring stack

The standalone monitoring values install their own kube-prometheus-stack
and OTel collector — they don't reuse the cluster's existing observability
stack — so they're not wanted in the PR. The kind/local dev scripts go
along with them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ine values.yaml

- Deleted the section on the agent-sandbox runner from README.md to eliminate redundancy and clarify the documentation.
- Cleaned up values.yaml by removing commented-out instructions related to the agent-sandbox chart, ensuring a more concise configuration file.
- Removed an extra blank line in the README.md to enhance readability and maintain a consistent formatting style.
…BOX_IMAGE → STUDIO_SANDBOX_IMAGE

Stale comments referenced deploy/k8s-sandbox/ paths that no longer
ship in this PR. Env var name was the last leftover from the
mesh-sandbox → studio-sandbox rename — aligning it with
STUDIO_SANDBOX_RUNNER and the studio-sandbox image/labels.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Introduced a GitHub Actions workflow for linting, rendering, and vendor drift checks on the Helm charts in the agent-sandbox directory.
- Added validation to ensure the chart is installed in the correct namespace (`agent-sandbox-system`).
- Updated .gitattributes to remove the binary designation for agent-sandbox .tgz files.
- Enhanced README.md with prerequisites and details on the preview gateway authentication model.
- Deprecated legacy network policy configurations in values.yaml and templates, with plans for future removal.
… charts

- Replaced the agent-sandbox chart with two new charts: sandbox-operator and sandbox-env, allowing for better separation of concerns and environment-specific configurations.
- Updated GitHub Actions workflows to support the new chart structure, including linting, rendering, and vendor drift checks.
- Enhanced .gitattributes and .gitignore to reflect the new chart paths and prevent unnecessary file tracking.
- Added comprehensive README.md documentation for both new charts, detailing installation instructions, prerequisites, and configuration options.
- Implemented environment-specific resource naming to avoid collisions in the shared namespace.
- Introduced validation checks to ensure proper installation and configuration of the new charts.
The committed agent-sandbox-manifest.yaml has hand-edited PodSecurity
admission labels on the agent-sandbox-system Namespace, with a
"LOCAL EDIT — preserve when re-running vendor.sh" comment in the
header. But vendor.sh wasn't actually preserving them — every refresh
stripped both the comment and the labels back to upstream's bare
Namespace, which is exactly what the helm-test workflow's vendor-drift
check just caught.

Two fixes:
- Add a third downstream patch (after the existing two — drop duplicate
  Deployment, add --extensions arg) that injects the six PodSecurity
  labels into the agent-sandbox-system Namespace doc. Post-patch grep
  asserts the injection happened.
- Update HEADER_TMPL to regenerate the LOCAL EDIT comment block.

If upstream ever ships its own labels: block on this Namespace, the awk
will produce a duplicate `labels:` key and helm lint will fail — the
canary for "merge into existing labels block" is documented inline.
Adds an opt-in shared PVC that mounts at /mnt/cache inside every sandbox
pod and redirects all package manager cache dirs there via env vars
(npm, pnpm, yarn, bun, deno, XDG). Sandboxes that share deps at the
same version skip the registry download entirely on warm cache.

New values:
  cache.enabled       — off by default, opt-in per env
  cache.storageClass  — empty = cluster default (efs-sc for EKS/EFS)
  cache.accessMode    — ReadWriteMany for multi-node, ReadWriteOnce for kind
  cache.size          — 50Gi default
  cache.mountPath     — /mnt/cache

values-kind.yaml enables it with standard/ReadWriteOnce/10Gi for local dev.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Introduces a comprehensive shared build cache system for sandbox pods, allowing for efficient package management by redirecting cache directories to a mounted PVC at /mnt/cache. Key updates include:

- New configuration options for cache management:
  - `cache.enabled`: Enables the shared cache (default: false).
  - `cache.storageClass`: Specifies the storage class for the PVC (required for ReadWriteMany).
  - `cache.accessMode`: Configurable for multi-node (ReadWriteMany) or single-node (ReadWriteOnce) setups.
  - `cache.size`: Default size set to 50Gi.
  - `cache.gc.enabled`: Enables a CronJob for garbage collection of stale cache entries.

- Validation checks to ensure proper configuration of the cache settings.
- Integration of cache management into the daemon's setup process, allowing for symlinked node_modules and git mirrors to persist across sandbox instances.

This update significantly improves build efficiency by reducing redundant package downloads across sandboxes sharing dependencies.
Updated the logic for creating and managing shallow bare clones in the git mirror setup. Key changes include:

- Clarified comments regarding the use of shallow clones and their integration as alternates, ensuring compatibility with `git clone --reference`.
- Streamlined the command sequence for initializing the repository and managing alternates, enhancing clarity and efficiency.
- Ensured that TTL refresh logic maintains consistency with shallow clone behavior.

These adjustments enhance the robustness of the cloning process while maintaining performance across sandbox instances.
Enhanced the error handling in the cache and clone setup processes. Key changes include:

- Added logic to remove symlinks for node_modules when fallback installations occur, preventing potential issues with full PVC slots.
- Improved the mkdirSync error handling to gracefully skip mirror creation if the PVC is full or unavailable, allowing for a fallback to direct cloning.

These updates enhance the robustness of the sandbox setup by ensuring smoother fallback mechanisms during cache and clone operations.
Base automatically changed from feat/agent-sandbox-helm to main April 29, 2026 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant