|
| 1 | +# RFC 001: Pre-seed BuildKit Layer Store in Builder VMs |
| 2 | + |
| 3 | +**Status**: Proposal |
| 4 | +**Author**: hiroTamada |
| 5 | +**Date**: 2026-02-09 |
| 6 | + |
| 7 | +## Problem |
| 8 | + |
| 9 | +Every code-change deployment takes ~27s of build time inside the builder VM, even though the base image hasn't changed. The breakdown: |
| 10 | + |
| 11 | +| Step | Time | Notes | |
| 12 | +|------|------|-------| |
| 13 | +| Base image extract | ~10.6s | Decompress + unpack 16 gzipped tar layers (~100MB) | |
| 14 | +| COPY + RUN setup.sh | ~5.4s | pnpm install + TypeScript + esbuild | |
| 15 | +| Image export + push | ~7.3s | Push built image layers to registry | |
| 16 | +| Cache export | ~0.1s | Write cache manifest | |
| 17 | + |
| 18 | +The ~10.6s base image extraction is the single largest cost and is incurred on **every build where source code changes**, which is effectively every real deployment. This happens because builder VMs are ephemeral — BuildKit starts with an empty content store every time. |
| 19 | + |
| 20 | +### Current architecture |
| 21 | + |
| 22 | +``` |
| 23 | +Builder VM boots |
| 24 | + → BuildKit starts with empty content store |
| 25 | + → Imports cache manifest from registry (knows WHAT layers exist) |
| 26 | + → COPY step: cache miss (source files changed) |
| 27 | + → BuildKit needs filesystem to execute COPY |
| 28 | + → Downloads 16 compressed layers from local registry (~0.4s) |
| 29 | + → Decompresses + extracts each layer sequentially (~10.6s) ← BOTTLENECK |
| 30 | + → Executes COPY, RUN, etc. |
| 31 | +``` |
| 32 | + |
| 33 | +When all steps are cached (identical redeploy), BuildKit never needs the filesystem and the build completes in ~1s. But any cache miss — which happens on every code change — triggers the full extraction. |
| 34 | + |
| 35 | +### Why the registry cache doesn't help |
| 36 | + |
| 37 | +The registry stores **compressed tar archives** (gzipped blobs). BuildKit needs **unpacked filesystem trees** (actual directories and files on disk) to execute build steps. The registry cache tells BuildKit *what* the result of each step is (layer digests), but when a step needs to actually execute, BuildKit must reconstruct the filesystem from the compressed layers. The ~10s is the decompression + extraction cost. |
| 38 | + |
| 39 | +## Proposal |
| 40 | + |
| 41 | +Pre-seed the builder VM's rootfs with BuildKit's content store already populated for known base images. When a build runs, BuildKit finds the base image layers already extracted locally and skips the download + extraction entirely. |
| 42 | + |
| 43 | +### How BuildKit's content store works |
| 44 | + |
| 45 | +BuildKit uses containerd's content store at `/home/builder/.local/share/buildkit/`: |
| 46 | + |
| 47 | +``` |
| 48 | +/home/builder/.local/share/buildkit/ |
| 49 | +├── containerd/ |
| 50 | +│ ├── content/ |
| 51 | +│ │ └── blobs/sha256/ ← compressed layer blobs (content-addressable) |
| 52 | +│ └── metadata.db ← bolt database with metadata |
| 53 | +└── snapshots/ |
| 54 | + └── overlayfs/ ← extracted filesystem snapshots |
| 55 | +``` |
| 56 | + |
| 57 | +When BuildKit processes `FROM onkernel/nodejs22-base:0.1.1`, it: |
| 58 | + |
| 59 | +1. Checks if the layer blobs exist in `content/blobs/sha256/` |
| 60 | +2. Checks if extracted snapshots exist in `snapshots/` |
| 61 | +3. If both exist, skips download + extraction entirely |
| 62 | +4. If not, downloads from registry and extracts |
| 63 | + |
| 64 | +By pre-populating both the blobs and snapshots, BuildKit can skip steps 3-4. |
| 65 | + |
| 66 | +## Options Considered |
| 67 | + |
| 68 | +### Option A: Pre-seed at builder image build time (Recommended) |
| 69 | + |
| 70 | +Warm the content store during the builder Docker image build, then bake the result into the image. |
| 71 | + |
| 72 | +**Build process:** |
| 73 | + |
| 74 | +```bash |
| 75 | +# 1. Build the base builder image as usual |
| 76 | +docker build -t builder:base -f lib/builds/images/generic/Dockerfile . |
| 77 | + |
| 78 | +# 2. Warm the content store |
| 79 | +docker run -d --privileged --name warmup builder:base sleep infinity |
| 80 | + |
| 81 | +docker exec warmup sh -c ' |
| 82 | + buildkitd & |
| 83 | + sleep 2 |
| 84 | +
|
| 85 | + mkdir /tmp/warmup |
| 86 | + echo "FROM onkernel/nodejs22-base:0.1.1" > /tmp/warmup/Dockerfile |
| 87 | + echo "RUN true" >> /tmp/warmup/Dockerfile |
| 88 | +
|
| 89 | + buildctl build \ |
| 90 | + --frontend dockerfile.v0 \ |
| 91 | + --local context=/tmp/warmup \ |
| 92 | + --local dockerfile=/tmp/warmup \ |
| 93 | + --output type=oci,dest=/dev/null |
| 94 | +
|
| 95 | + kill %1 && wait |
| 96 | +' |
| 97 | + |
| 98 | +# 3. Commit with warmed content store |
| 99 | +docker commit warmup onkernel/builder-generic:latest |
| 100 | + |
| 101 | +# 4. Push with OCI mediatypes |
| 102 | +# (need to re-tag and push via buildx for OCI compliance) |
| 103 | +docker rm -f warmup |
| 104 | +``` |
| 105 | + |
| 106 | +**Could also be automated** via a `Makefile` target: |
| 107 | + |
| 108 | +```makefile |
| 109 | +build-builder-warmed: |
| 110 | + docker build -t builder:base -f lib/builds/images/generic/Dockerfile . |
| 111 | + ./scripts/warm-builder-cache.sh builder:base onkernel/builder-generic:latest |
| 112 | +``` |
| 113 | + |
| 114 | +**Pros:** |
| 115 | +- Every VM boots with the base image already extracted |
| 116 | +- Zero per-tenant storage overhead |
| 117 | +- No changes to the build pipeline — just a bigger builder image |
| 118 | +- Eliminates ~10.6s extraction on every code-change build |
| 119 | + |
| 120 | +**Cons:** |
| 121 | +- Builder image grows by ~100-150MB (uncompressed base image layers) |
| 122 | +- Must rebuild the builder image when base images change |
| 123 | +- `docker commit` + re-push workflow is somewhat manual |
| 124 | +- Only helps for base images known at builder image build time |
| 125 | + |
| 126 | +### Option B: Persistent volume per tenant |
| 127 | + |
| 128 | +Attach a persistent block device to each builder VM that survives across builds. |
| 129 | + |
| 130 | +**Architecture:** |
| 131 | + |
| 132 | +``` |
| 133 | +First build: |
| 134 | + VM boots with persistent volume mounted at /home/builder/.local/share/buildkit/ |
| 135 | + → BuildKit extracts base image layers → written to persistent volume |
| 136 | + → Build completes |
| 137 | + → VM shuts down, volume persists |
| 138 | + |
| 139 | +Second build: |
| 140 | + VM boots with same persistent volume |
| 141 | + → BuildKit finds layers already extracted |
| 142 | + → Skips download + extraction (0.0s) |
| 143 | +``` |
| 144 | +
|
| 145 | +**Implementation:** |
| 146 | +- Create a persistent ext4 volume per tenant (or per org) |
| 147 | +- Mount it into the builder VM at BuildKit's content store path |
| 148 | +- Manage lifecycle: create on first build, garbage collect after idle period |
| 149 | +
|
| 150 | +**Pros:** |
| 151 | +- Works for ANY base image (not just pre-known ones) |
| 152 | +- Gets faster over time as more layers accumulate |
| 153 | +- Tenant cache and layer store in one place |
| 154 | +
|
| 155 | +**Cons:** |
| 156 | +- Per-tenant storage cost (~200-500MB per tenant, grows over time) |
| 157 | +- Needs volume lifecycle management (creation, cleanup, GC) |
| 158 | +- Potential stale data issues (old layers accumulating) |
| 159 | +- More complex VM setup (attach volume before boot) |
| 160 | +- Cloud Hypervisor needs block device attachment support |
| 161 | +
|
| 162 | +### Option C: Shared read-only layer cache |
| 163 | +
|
| 164 | +Maintain a shared, read-only layer cache volume that contains extracted layers for common base images. Mount it into every builder VM. |
| 165 | +
|
| 166 | +**Architecture:** |
| 167 | +
|
| 168 | +``` |
| 169 | +Periodic job: |
| 170 | + Extracts layers for known base images into a shared volume |
| 171 | + → onkernel/nodejs22-base:0.1.1 |
| 172 | + → onkernel/python311-base:0.1.0 |
| 173 | + → etc. |
| 174 | + |
| 175 | +Every build: |
| 176 | + VM boots with shared volume mounted read-only |
| 177 | + → BuildKit finds common layers already extracted |
| 178 | + → Uses copy-on-write for any new layers |
| 179 | +``` |
| 180 | +
|
| 181 | +**Pros:** |
| 182 | +- One volume serves all tenants |
| 183 | +- Minimal storage overhead |
| 184 | +- No per-tenant state to manage |
| 185 | +
|
| 186 | +**Cons:** |
| 187 | +- Only helps for pre-known base images (same as Option A) |
| 188 | +- Needs overlay/copy-on-write filesystem support |
| 189 | +- Read-only mount needs BuildKit configuration changes |
| 190 | +- More complex than Option A for similar benefit |
| 191 | +
|
| 192 | +## Recommendation |
| 193 | +
|
| 194 | +**Start with Option A** (pre-seed at build time). It's the simplest to implement, requires no infrastructure changes, and addresses the primary bottleneck. The only cost is a larger builder image (~100-150MB), which is negligible given the ~10s savings on every deploy. |
| 195 | +
|
| 196 | +### Expected impact |
| 197 | +
|
| 198 | +| Scenario | Current | With pre-seeded layers | |
| 199 | +|----------|---------|----------------------| |
| 200 | +| Code change deploy (first for tenant) | ~27s build | ~17s build (-37%) | |
| 201 | +| Code change deploy (subsequent) | ~27s build | ~17s build (-37%) | |
| 202 | +| No code change (cached) | ~1s build | ~1s build (unchanged) | |
| 203 | +| Total deploy time (code change) | ~50s | ~40s | |
| 204 | +
|
| 205 | +The ~10s savings applies to every single code-change deployment across all tenants using `nodejs22-base`. |
| 206 | +
|
| 207 | +### Future work |
| 208 | +
|
| 209 | +If builder image size becomes a concern (multiple base images), consider: |
| 210 | +
|
| 211 | +1. **Option B** for tenants with high deploy frequency — persistent volumes amortize the extraction cost over many builds |
| 212 | +2. **Lazy pulling** (eStargz/zstd:chunked) — BuildKit can pull and extract only the layers it actually needs, on demand. Requires base images published in eStargz format. |
| 213 | +3. **Dockerfile restructuring** — splitting `COPY` into dependency-only and source-only steps to maximize cache hits on the `RUN` step, reducing the impact of cache misses |
| 214 | +
|
| 215 | +## Open Questions |
| 216 | +
|
| 217 | +1. Should we pre-seed multiple base images (nodejs, python, etc.) or just the most common one? |
| 218 | +2. What's the acceptable builder image size increase? Each base image adds ~100-150MB. |
| 219 | +3. Should the warm-up script be part of CI/CD, or a manual step when base images change? |
| 220 | +4. Does Cloud Hypervisor's block device support make Option B viable for later? |
0 commit comments