feat(sandbox): add shared build cache PVC to sandbox-env chart#3221
Open
pedrofrxncx wants to merge 17 commits intomainfrom
Open
feat(sandbox): add shared build cache PVC to sandbox-env chart#3221pedrofrxncx wants to merge 17 commits intomainfrom
pedrofrxncx wants to merge 17 commits intomainfrom
Conversation
Helm chart, vendored agent-sandbox subchart, local kind dev tooling, and monitoring stack for the agent-sandbox runner shipped in 2f293ba. The runner remains opt-in (STUDIO_SANDBOX_RUNNER=agent-sandbox); installing this chart with sandbox.agentSandbox.enabled=false is a no-op. - deploy/helm/charts/agent-sandbox: vendored upstream subchart (CRDs + operator), refreshed via charts/agent-sandbox/vendor.sh. CRDs and templates are marked linguist-generated in .gitattributes so the diff collapses in review. - deploy/helm/templates: SandboxTemplate, optional SandboxWarmPool, mesh RBAC, NetworkPolicy, and optional wildcard Gateway + Certificate for *.preview.<base-domain>. New values: nodeSelector (default amd64), tolerations, hostUsers, readOnlyRootFilesystem, sandbox.agentSandbox.previewUrlPattern / previewGateway. configMap surfaces STUDIO_SANDBOX_PREVIEW_URL_PATTERN. - deploy/k8s-sandbox/local: kind scripts (up.sh / down.sh / reload-image.sh), shared sandbox-template, and an end-to-end smoke.ts. - deploy/k8s-sandbox/monitoring: kube-prometheus-stack + OTel collector values and a Grafana dashboard. - .github/workflows/release-studio-sandbox.yaml: builds and pushes the mesh-sandbox image to ghcr.io.
- Updated .gitignore to exclude cached .tgz files for agent-sandbox. - Enhanced GitHub Actions workflow to skip republishing if the image version tag already exists, preventing unnecessary pushes. - Added checks for existing version tags in the release workflow to avoid overwriting images. - Updated helm chart values to pin the image version to "0.1.0" for consistency and reliability. - Improved documentation in README.md regarding the security implications of preview handles. - Refactored templates to use common label definitions for sandbox resources, ensuring consistency across the chart. - Added checksum verification in vendor.sh to enhance security when fetching upstream assets.
- Added a new Helm chart for the agent-sandbox, including CRDs and templates for the Kubernetes operator. - Implemented vendor.sh for fetching and managing upstream assets securely. - Updated .gitattributes to mark generated files appropriately. - Enhanced .gitignore to exclude unnecessary cached files. - Created GitHub Actions workflow for packaging and releasing the Helm chart to GitHub Container Registry. - Added example values for local development with kind. - Improved documentation in README.md regarding the chart's structure and usage.
…ture - Updated the README.md to reflect the standalone nature of the agent-sandbox Helm chart, including installation instructions and configuration details. - Expanded documentation on the chart's components, such as CRDs, templates, and ArgoCD application setup. - Clarified the role of the `vendor.sh` script in managing upstream assets and provided guidance on version bumping and CRD upgrades. - Improved the layout section to better illustrate the chart's directory structure and included additional examples for local development.
- Refactored `values-kind.yaml` for the agent-sandbox to clarify its usage in local kind clusters. - Removed outdated Postgres example manifest and values file to streamline the deployment process. - Updated `Chart.lock` and `Chart.yaml` to reflect the latest dependencies and versioning. - Enhanced `README.md` to provide clearer instructions on the agent-sandbox runner and its configuration. - Adjusted PVC template to ensure default storage class is used when none is specified.
- Merged the controller Deployments from manifest.yaml and extensions.yaml into a single Deployment to ensure proper functionality of the agent-sandbox reconciler. - Removed the duplicate Deployment definition from extensions.yaml and added the `--extensions` argument to the manifest.yaml Deployment. - Implemented transformations in vendor.sh to automate these changes while maintaining checksum verification for upstream assets.
…ring stack The standalone monitoring values install their own kube-prometheus-stack and OTel collector — they don't reuse the cluster's existing observability stack — so they're not wanted in the PR. The kind/local dev scripts go along with them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ine values.yaml - Deleted the section on the agent-sandbox runner from README.md to eliminate redundancy and clarify the documentation. - Cleaned up values.yaml by removing commented-out instructions related to the agent-sandbox chart, ensuring a more concise configuration file.
- Removed an extra blank line in the README.md to enhance readability and maintain a consistent formatting style.
…BOX_IMAGE → STUDIO_SANDBOX_IMAGE Stale comments referenced deploy/k8s-sandbox/ paths that no longer ship in this PR. Env var name was the last leftover from the mesh-sandbox → studio-sandbox rename — aligning it with STUDIO_SANDBOX_RUNNER and the studio-sandbox image/labels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Introduced a GitHub Actions workflow for linting, rendering, and vendor drift checks on the Helm charts in the agent-sandbox directory. - Added validation to ensure the chart is installed in the correct namespace (`agent-sandbox-system`). - Updated .gitattributes to remove the binary designation for agent-sandbox .tgz files. - Enhanced README.md with prerequisites and details on the preview gateway authentication model. - Deprecated legacy network policy configurations in values.yaml and templates, with plans for future removal.
… charts - Replaced the agent-sandbox chart with two new charts: sandbox-operator and sandbox-env, allowing for better separation of concerns and environment-specific configurations. - Updated GitHub Actions workflows to support the new chart structure, including linting, rendering, and vendor drift checks. - Enhanced .gitattributes and .gitignore to reflect the new chart paths and prevent unnecessary file tracking. - Added comprehensive README.md documentation for both new charts, detailing installation instructions, prerequisites, and configuration options. - Implemented environment-specific resource naming to avoid collisions in the shared namespace. - Introduced validation checks to ensure proper installation and configuration of the new charts.
The committed agent-sandbox-manifest.yaml has hand-edited PodSecurity admission labels on the agent-sandbox-system Namespace, with a "LOCAL EDIT — preserve when re-running vendor.sh" comment in the header. But vendor.sh wasn't actually preserving them — every refresh stripped both the comment and the labels back to upstream's bare Namespace, which is exactly what the helm-test workflow's vendor-drift check just caught. Two fixes: - Add a third downstream patch (after the existing two — drop duplicate Deployment, add --extensions arg) that injects the six PodSecurity labels into the agent-sandbox-system Namespace doc. Post-patch grep asserts the injection happened. - Update HEADER_TMPL to regenerate the LOCAL EDIT comment block. If upstream ever ships its own labels: block on this Namespace, the awk will produce a duplicate `labels:` key and helm lint will fail — the canary for "merge into existing labels block" is documented inline.
Adds an opt-in shared PVC that mounts at /mnt/cache inside every sandbox pod and redirects all package manager cache dirs there via env vars (npm, pnpm, yarn, bun, deno, XDG). Sandboxes that share deps at the same version skip the registry download entirely on warm cache. New values: cache.enabled — off by default, opt-in per env cache.storageClass — empty = cluster default (efs-sc for EKS/EFS) cache.accessMode — ReadWriteMany for multi-node, ReadWriteOnce for kind cache.size — 50Gi default cache.mountPath — /mnt/cache values-kind.yaml enables it with standard/ReadWriteOnce/10Gi for local dev. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Introduces a comprehensive shared build cache system for sandbox pods, allowing for efficient package management by redirecting cache directories to a mounted PVC at /mnt/cache. Key updates include: - New configuration options for cache management: - `cache.enabled`: Enables the shared cache (default: false). - `cache.storageClass`: Specifies the storage class for the PVC (required for ReadWriteMany). - `cache.accessMode`: Configurable for multi-node (ReadWriteMany) or single-node (ReadWriteOnce) setups. - `cache.size`: Default size set to 50Gi. - `cache.gc.enabled`: Enables a CronJob for garbage collection of stale cache entries. - Validation checks to ensure proper configuration of the cache settings. - Integration of cache management into the daemon's setup process, allowing for symlinked node_modules and git mirrors to persist across sandbox instances. This update significantly improves build efficiency by reducing redundant package downloads across sandboxes sharing dependencies.
Updated the logic for creating and managing shallow bare clones in the git mirror setup. Key changes include: - Clarified comments regarding the use of shallow clones and their integration as alternates, ensuring compatibility with `git clone --reference`. - Streamlined the command sequence for initializing the repository and managing alternates, enhancing clarity and efficiency. - Ensured that TTL refresh logic maintains consistency with shallow clone behavior. These adjustments enhance the robustness of the cloning process while maintaining performance across sandbox instances.
Enhanced the error handling in the cache and clone setup processes. Key changes include: - Added logic to remove symlinks for node_modules when fallback installations occur, preventing potential issues with full PVC slots. - Improved the mkdirSync error handling to gracefully skip mirror creation if the PVC is full or unavailable, allowing for a fallback to direct cloning. These updates enhance the robustness of the sandbox setup by ensuring smoother fallback mechanisms during cache and clone operations.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
cache-pvc.yamltemplate tosandbox-envchart — creates a sharedPersistentVolumeClaim(<name>-cache) inagent-sandbox-systemSandboxTemplatewith env vars that redirect every package manager's cache to/mnt/cache(npm, pnpm, yarn, bun, deno, XDG)cache.enabled: false); opt-in per environment via valuesvalues-kind.yamlenables it withstandard/ReadWriteOnce/10Gifor local dev (single-node kind, no EFS needed)How it works
Package managers have a two-level storage model: a content-addressed global cache (tarballs, keyed by name+version+hash) and the per-project
node_modules/. The PVC holds the global cache layer. On a cold start the firstbun installpopulates it; every subsequent sandbox that shares a dep at the same version skips the registry download and links from the local cache instead.Production setup (EKS)
Requires the AWS EFS CSI driver and a provisioned EFS filesystem. Set:
Test plan
helm templatewithvalues-kind.yamlrenders PVC + env vars + volumeMount in SandboxTemplatehelm templatewith default values renders no cache resourcescache.enabled: true— PVC binds on first sandbox pod,/mnt/cacheis mounted,BUN_INSTALL_CACHE_DIRetc. are set🤖 Generated with Claude Code
Summary by cubic
Adds an opt‑in shared build and source cache to the
sandbox-envchart. Speeds up clone, install, and Next.js cold starts while cutting registry and Git traffic. Adds safer fallbacks when the cache PVC is unavailable or full.New Features
<sandbox>-cachePVC inagent-sandbox-system, mounted atcache.mountPath(default/mnt/cache), kept across Helm upgrades.CACHE_DIR; the daemon redirectsnpm/pnpm/yarn/bun/deno/XDGcaches, adds lockfile‑keyed sharednode_moduleswith flock + sentinel, and a per‑repo Next.js.next/cacheviaSANDBOX_CACHE_KEY. Installs use--prefer-offline/--frozen-lockfileto reduce network.HEAD. If the cache dir is unavailable, falls back to direct clone.node_modulesslots and git mirrors; configure viacache.gc.*(enabled,schedule,ttlDays).ReadWriteManywithoutcache.storageClass; README adds EFS setup guidance. Configure viacache.*(enabled,storageClass,accessMode,size,mountPath,gc.*).examples/values-kind.yamlenables local dev withstandard/ReadWriteOnce/10Gi.node_modulessymlink to avoid writing into a bad PVC slot.Migration
StorageClass(e.g.efs-sc), then setcache.enabled: true,cache.storageClass: efs-sc,cache.accessMode: ReadWriteMany, andsizeas needed. Optional: tunecache.gc.scheduleandcache.gc.ttlDays.Written for commit dd07d10. Summary will update on new commits. Review in cubic