Skip to content

Commit 7af1231

Browse files
hiroTamadaclaude
andcommitted
rfc: pre-seed BuildKit layer store to reduce build times
Proposes pre-seeding the builder VM rootfs with extracted base image layers to eliminate the ~10.6s decompression bottleneck on every code-change deployment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent f21c072 commit 7af1231

File tree

1 file changed

+220
-0
lines changed

1 file changed

+220
-0
lines changed
Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
# RFC 001: Pre-seed BuildKit Layer Store in Builder VMs
2+
3+
**Status**: Proposal
4+
**Author**: hiroTamada
5+
**Date**: 2026-02-09
6+
7+
## Problem
8+
9+
Every code-change deployment takes ~27s of build time inside the builder VM, even though the base image hasn't changed. The breakdown:
10+
11+
| Step | Time | Notes |
12+
|------|------|-------|
13+
| Base image extract | ~10.6s | Decompress + unpack 16 gzipped tar layers (~100MB) |
14+
| COPY + RUN setup.sh | ~5.4s | pnpm install + TypeScript + esbuild |
15+
| Image export + push | ~7.3s | Push built image layers to registry |
16+
| Cache export | ~0.1s | Write cache manifest |
17+
18+
The ~10.6s base image extraction is the single largest cost and is incurred on **every build where source code changes**, which is effectively every real deployment. This happens because builder VMs are ephemeral — BuildKit starts with an empty content store every time.
19+
20+
### Current architecture
21+
22+
```
23+
Builder VM boots
24+
→ BuildKit starts with empty content store
25+
→ Imports cache manifest from registry (knows WHAT layers exist)
26+
→ COPY step: cache miss (source files changed)
27+
→ BuildKit needs filesystem to execute COPY
28+
→ Downloads 16 compressed layers from local registry (~0.4s)
29+
→ Decompresses + extracts each layer sequentially (~10.6s) ← BOTTLENECK
30+
→ Executes COPY, RUN, etc.
31+
```
32+
33+
When all steps are cached (identical redeploy), BuildKit never needs the filesystem and the build completes in ~1s. But any cache miss — which happens on every code change — triggers the full extraction.
34+
35+
### Why the registry cache doesn't help
36+
37+
The registry stores **compressed tar archives** (gzipped blobs). BuildKit needs **unpacked filesystem trees** (actual directories and files on disk) to execute build steps. The registry cache tells BuildKit *what* the result of each step is (layer digests), but when a step needs to actually execute, BuildKit must reconstruct the filesystem from the compressed layers. The ~10s is the decompression + extraction cost.
38+
39+
## Proposal
40+
41+
Pre-seed the builder VM's rootfs with BuildKit's content store already populated for known base images. When a build runs, BuildKit finds the base image layers already extracted locally and skips the download + extraction entirely.
42+
43+
### How BuildKit's content store works
44+
45+
BuildKit uses containerd's content store at `/home/builder/.local/share/buildkit/`:
46+
47+
```
48+
/home/builder/.local/share/buildkit/
49+
├── containerd/
50+
│ ├── content/
51+
│ │ └── blobs/sha256/ ← compressed layer blobs (content-addressable)
52+
│ └── metadata.db ← bolt database with metadata
53+
└── snapshots/
54+
└── overlayfs/ ← extracted filesystem snapshots
55+
```
56+
57+
When BuildKit processes `FROM onkernel/nodejs22-base:0.1.1`, it:
58+
59+
1. Checks if the layer blobs exist in `content/blobs/sha256/`
60+
2. Checks if extracted snapshots exist in `snapshots/`
61+
3. If both exist, skips download + extraction entirely
62+
4. If not, downloads from registry and extracts
63+
64+
By pre-populating both the blobs and snapshots, BuildKit can skip steps 3-4.
65+
66+
## Options Considered
67+
68+
### Option A: Pre-seed at builder image build time (Recommended)
69+
70+
Warm the content store during the builder Docker image build, then bake the result into the image.
71+
72+
**Build process:**
73+
74+
```bash
75+
# 1. Build the base builder image as usual
76+
docker build -t builder:base -f lib/builds/images/generic/Dockerfile .
77+
78+
# 2. Warm the content store
79+
docker run -d --privileged --name warmup builder:base sleep infinity
80+
81+
docker exec warmup sh -c '
82+
buildkitd &
83+
sleep 2
84+
85+
mkdir /tmp/warmup
86+
echo "FROM onkernel/nodejs22-base:0.1.1" > /tmp/warmup/Dockerfile
87+
echo "RUN true" >> /tmp/warmup/Dockerfile
88+
89+
buildctl build \
90+
--frontend dockerfile.v0 \
91+
--local context=/tmp/warmup \
92+
--local dockerfile=/tmp/warmup \
93+
--output type=oci,dest=/dev/null
94+
95+
kill %1 && wait
96+
'
97+
98+
# 3. Commit with warmed content store
99+
docker commit warmup onkernel/builder-generic:latest
100+
101+
# 4. Push with OCI mediatypes
102+
# (need to re-tag and push via buildx for OCI compliance)
103+
docker rm -f warmup
104+
```
105+
106+
**Could also be automated** via a `Makefile` target:
107+
108+
```makefile
109+
build-builder-warmed:
110+
docker build -t builder:base -f lib/builds/images/generic/Dockerfile .
111+
./scripts/warm-builder-cache.sh builder:base onkernel/builder-generic:latest
112+
```
113+
114+
**Pros:**
115+
- Every VM boots with the base image already extracted
116+
- Zero per-tenant storage overhead
117+
- No changes to the build pipeline — just a bigger builder image
118+
- Eliminates ~10.6s extraction on every code-change build
119+
120+
**Cons:**
121+
- Builder image grows by ~100-150MB (uncompressed base image layers)
122+
- Must rebuild the builder image when base images change
123+
- `docker commit` + re-push workflow is somewhat manual
124+
- Only helps for base images known at builder image build time
125+
126+
### Option B: Persistent volume per tenant
127+
128+
Attach a persistent block device to each builder VM that survives across builds.
129+
130+
**Architecture:**
131+
132+
```
133+
First build:
134+
VM boots with persistent volume mounted at /home/builder/.local/share/buildkit/
135+
→ BuildKit extracts base image layers → written to persistent volume
136+
→ Build completes
137+
→ VM shuts down, volume persists
138+
139+
Second build:
140+
VM boots with same persistent volume
141+
→ BuildKit finds layers already extracted
142+
→ Skips download + extraction (0.0s)
143+
```
144+
145+
**Implementation:**
146+
- Create a persistent ext4 volume per tenant (or per org)
147+
- Mount it into the builder VM at BuildKit's content store path
148+
- Manage lifecycle: create on first build, garbage collect after idle period
149+
150+
**Pros:**
151+
- Works for ANY base image (not just pre-known ones)
152+
- Gets faster over time as more layers accumulate
153+
- Tenant cache and layer store in one place
154+
155+
**Cons:**
156+
- Per-tenant storage cost (~200-500MB per tenant, grows over time)
157+
- Needs volume lifecycle management (creation, cleanup, GC)
158+
- Potential stale data issues (old layers accumulating)
159+
- More complex VM setup (attach volume before boot)
160+
- Cloud Hypervisor needs block device attachment support
161+
162+
### Option C: Shared read-only layer cache
163+
164+
Maintain a shared, read-only layer cache volume that contains extracted layers for common base images. Mount it into every builder VM.
165+
166+
**Architecture:**
167+
168+
```
169+
Periodic job:
170+
Extracts layers for known base images into a shared volume
171+
→ onkernel/nodejs22-base:0.1.1
172+
→ onkernel/python311-base:0.1.0
173+
→ etc.
174+
175+
Every build:
176+
VM boots with shared volume mounted read-only
177+
→ BuildKit finds common layers already extracted
178+
→ Uses copy-on-write for any new layers
179+
```
180+
181+
**Pros:**
182+
- One volume serves all tenants
183+
- Minimal storage overhead
184+
- No per-tenant state to manage
185+
186+
**Cons:**
187+
- Only helps for pre-known base images (same as Option A)
188+
- Needs overlay/copy-on-write filesystem support
189+
- Read-only mount needs BuildKit configuration changes
190+
- More complex than Option A for similar benefit
191+
192+
## Recommendation
193+
194+
**Start with Option A** (pre-seed at build time). It's the simplest to implement, requires no infrastructure changes, and addresses the primary bottleneck. The only cost is a larger builder image (~100-150MB), which is negligible given the ~10s savings on every deploy.
195+
196+
### Expected impact
197+
198+
| Scenario | Current | With pre-seeded layers |
199+
|----------|---------|----------------------|
200+
| Code change deploy (first for tenant) | ~27s build | ~17s build (-37%) |
201+
| Code change deploy (subsequent) | ~27s build | ~17s build (-37%) |
202+
| No code change (cached) | ~1s build | ~1s build (unchanged) |
203+
| Total deploy time (code change) | ~50s | ~40s |
204+
205+
The ~10s savings applies to every single code-change deployment across all tenants using `nodejs22-base`.
206+
207+
### Future work
208+
209+
If builder image size becomes a concern (multiple base images), consider:
210+
211+
1. **Option B** for tenants with high deploy frequency — persistent volumes amortize the extraction cost over many builds
212+
2. **Lazy pulling** (eStargz/zstd:chunked) — BuildKit can pull and extract only the layers it actually needs, on demand. Requires base images published in eStargz format.
213+
3. **Dockerfile restructuring** — splitting `COPY` into dependency-only and source-only steps to maximize cache hits on the `RUN` step, reducing the impact of cache misses
214+
215+
## Open Questions
216+
217+
1. Should we pre-seed multiple base images (nodejs, python, etc.) or just the most common one?
218+
2. What's the acceptable builder image size increase? Each base image adds ~100-150MB.
219+
3. Should the warm-up script be part of CI/CD, or a manual step when base images change?
220+
4. Does Cloud Hypervisor's block device support make Option B viable for later?

0 commit comments

Comments
 (0)