Skip to content

Conversation

@hiroTamada
Copy link
Contributor

@hiroTamada hiroTamada commented Jan 5, 2026

Note

Adds a secure source-to-image build system running in microVMs with API, queuing, storage, and registry integration.

  • New lib/builds package: queue, storage, cache keys, metrics, registry JWTs, vsock protocol, file-based secrets, and a guest builder_agent
  • API: adds BuildManager to ApiService, new endpoints (POST /builds, GET /builds, GET /builds/{id}, DELETE /builds/{id}, GET /builds/{id}/logs via SSE events) and server startup for build services
  • DI/wiring: provides BuildManager via wire and starts it in main
  • Config: build-related settings (MAX_CONCURRENT_SOURCE_BUILDS, BUILDER_IMAGE, REGISTRY_URL, BUILD_TIMEOUT, BUILD_SECRETS_DIR)
  • Registry/middleware: JWT-scoped repo tokens, Basic/Bearer handling for /v2/*, repo/scope validation, and internal VM fallback; tests updated to include full host in image refs
  • OCI: improve unpack/metadata to support Docker v2 → OCI via go-containerregistry
  • Tooling: Makefile targets for building the builder image and an E2E build test
  • Tests/docs: extensive unit tests and READMEs for the build system and builder image

Written by Cursor Bugbot for commit 528cf8d. This will update automatically on new commits. Configure here.

hiroTamada and others added 12 commits December 18, 2025 16:49
Proof of concept for a secure build system that runs rootless BuildKit
inside ephemeral Cloud Hypervisor microVMs for multi-tenant isolation.

Components:
- lib/builds/: Core build system (queue, storage, manager, cache)
- lib/builds/builder_agent/: Guest binary for running BuildKit
- lib/builds/templates/: Dockerfile generation for Node.js/Python
- lib/builds/images/: Builder image Dockerfiles

API endpoints:
- POST /v1/builds: Submit build job
- GET /v1/builds: List builds
- GET /v1/builds/{id}: Get build details
- DELETE /v1/builds/{id}: Cancel build
- GET /v1/builds/{id}/logs: Stream logs (SSE)
- Start vsock handler when build manager starts for builder VM communication
- Create config volume with build.json mounted at /config in builder VMs
- Mount source volume as read-write for generated Dockerfile writes
- Fix builder image Dockerfile: copy buildkit-runc as /usr/bin/runc
- Mount cgroups (v2 with v1 fallback) in microVM init script for runc
- Configure insecure registry flag in builder agent for HTTP registry push
- Add auth bypass for internal VM network (10.102.x.x) registry pushes
- Update README with comprehensive E2E testing guide and troubleshooting
Comprehensive documentation for creating, building, and testing
builder images including required components, OCI format build
process, and troubleshooting common issues.
Includes planned phases for cache optimization, security hardening,
additional runtimes, and observability. Documents threat model and
open design questions.
- Update builder_agent to LISTEN on vsock port 5001 instead of dialing out
- Update manager to connect TO builder VM's vsock socket with CH handshake
- Simplify vsock_handler to only contain message types

Cloud Hypervisor's vsock implementation requires host to dial guest,
not the other way around. This matches the pattern used by exec_agent.

Build results (digest, provenance, logs) now properly returned via vsock.
- Fix API port to 8083 (was 8080)
- Update builder image to hirokernel/builder-nodejs20:latest
- Add explicit Dockerfile to test source
- Fix log functions to output to stderr (avoid mixing with return values)
- Add environment variables documentation
- Remove deprecated runtime parameter from build submission
- Require Dockerfile in source tarball (fail early if missing)
- Update builder image reference to hypeman/builder:latest
- Update comments to reflect generic builder approach
- Replace runtime-specific builders (nodejs20, python312) with generic builder
- Users now provide their own Dockerfile instead of auto-generation
- Add JWT-based registry token authentication for builder VMs
  - Tokens scoped to specific build and cache repositories
  - 30-minute expiry for security
  - Support both Bearer and Basic auth (for BuildKit compatibility)
- Update builder agent to configure registry auth from token
- Fix auth middleware to handle Basic auth for registry paths
- Update API to make 'runtime' optional (deprecated)
- Add comprehensive documentation for building OCI-format images
- Delete deprecated: templates/, base/, nodejs20/, python312/ Dockerfiles

Breaking changes:
- Dockerfile is now required (in source tarball or as API parameter)
- Builder image must be built with 'docker buildx --output type=registry,oci-mediatypes=true'
- Update architecture diagram to show JWT token auth instead of insecure
- Add Registry Token System section documenting registry_token.go
- Add Metrics section documenting metrics.go
- Update cache example to remove outdated runtime references
- Add Registry Authentication section explaining token-based auth
- Update Security Model to include registry auth
- Fix Build and Push section to use docker buildx with OCI format
- Update E2E test example to use generic builder image
- Update troubleshooting for 401 errors with token-based auth
- Update config.json example to show registry_token and dockerfile fields
- Remove references to deleted templates package
Conflicts resolved:
- cmd/api/api/api.go: Added both BuildManager and ResourceManager
- cmd/api/config/config.go: Added both build system and hypervisor/oversubscription config
- cmd/api/wire.go, wire_gen.go: Added both providers
- lib/oapi/oapi.go: Regenerated from merged openapi.yaml
- openapi.yaml: Added both Build schemas and Resource schemas
- lib/system/init_script.go: Deleted (replaced by init/ directory in main)
@hiroTamada hiroTamada changed the title (WIP) hypeman build hypeman build Jan 7, 2026
@hiroTamada hiroTamada marked this pull request as ready for review January 7, 2026 15:07
@github-actions
Copy link

github-actions bot commented Jan 7, 2026

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

feat: hypeman build
⚠️ hypeman-typescript studio · code

There was a regression in your SDK.
generate ⚠️build ✅lint ✅test ✅

npm install https://pkg.stainless.com/s/hypeman-typescript/39cdbb6af9289523d01964333ce46fea9c292b24/dist.tar.gz
⚠️ hypeman-go studio · code

There was a regression in your SDK.
generate ⚠️lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@85c8f8e0d7a60b563478edfde5977c61bba402ae
⚠️ hypeman-cli studio · conflict

There was a regression in your SDK.


This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2026-01-09 19:18:12 UTC

### Prerequisites

1. **Docker Buildx** with a container builder:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: we need a script or tooling for creating a new builder image

Use go-containerregistry instead of umoci's casext for manifest parsing,
which handles both Docker v2 and OCI v1 formats automatically.

Changes:
- extractOCIMetadata: Use layout.Path and Image.ConfigFile() from
  go-containerregistry which abstracts away manifest format differences
- unpackLayers: Get manifest via go-containerregistry, then convert to
  OCI v1 format for umoci's layer unpacker
- Add imageByAnnotation() helper to find images in OCI layout by tag
- Add convertToOCIManifest() to convert go-containerregistry manifest
  to OCI v1.Manifest for umoci compatibility
- Update documentation to remove OCI format requirement

This allows users to build images with standard 'docker build' without
needing 'docker buildx --output type=registry,oci-mediatypes=true'.
- Use context.Background() for volume deletion in defers instead of the
  build timeout context, matching the pattern used for instance cleanup
- Fix error variable shadowing in setup config volume error path (copyErr)
- Improve multipart form field parsing with proper error handling using
  io.ReadAll instead of bytes.Buffer with ignored errors
- Reject registry tokens from API authentication (defense-in-depth)
  - Check for repos, scope, build_id claims
  - Reject tokens with builder- subject prefix
  - Add comprehensive test coverage

- Fix indefinite blocking in waitForResult
  - Use goroutine + select for context-aware cancellation
  - Close connection to unblock decoder on timeout
  - Prevents resource leaks from unresponsive builders

- Fix Makefile builder targets
  - Replace non-existent nodejs20/python312 targets
  - Single build-builder target for generic builder
  - build-builders as backwards-compatible alias

# Run locally to test
docker run --rm hypeman/builder:local --help
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious how you think hypeman deployment process should load this builder image... should it be part of the deploy process or should hypeman itself perform this build on startup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use the same process as kernel-images, and runtime images where you manually publish images to public docker hub.


func (p *NoOpSecretProvider) GetSecrets(ctx context.Context, secretIDs []string) (map[string]string, error) {
return make(map[string]string), nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious how you're thinking about modeling / implementing build secrets...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually implemented this

Build System Cleanup:
- Remove deprecated RuntimeNodeJS20/RuntimePython312 constants
- Remove Runtime field from Build, CreateBuildRequest, BuildConfig
- Remove ToolchainVersion from BuildProvenance
- Update OpenAPI spec: remove runtime field, rename /builds/{id}/logs to /builds/{id}/events
- Add typed BuildEvent schema for SSE streaming
- Remove unused deref function from builds.go
- Update documentation (PLAN.md, README.md) to reflect generic builder

Security Fixes:
- Fix IP spoofing vulnerability: isInternalVMRequest now only trusts r.RemoteAddr
- Add registry token rejection to OapiAuthenticationFunc for defense-in-depth

Testing:
- Add comprehensive build manager unit tests with mocked dependencies
- Enhance E2E test script to run VM with built image after successful build
- Add --skip-run flag to E2E test for build-only testing
- Fix test race conditions by testing storage/queue directly

Documentation:
- Add lib/builds/TODO.md tracking remaining issues
- Mark completed security fixes and improvements
- Add BuildSecretsDir to Config struct
- Load from BUILD_SECRETS_DIR environment variable
- Update ProvideBuildManager to use FileSecretProvider when configured
- Log when build secrets are enabled
- Fix protocol deadlock: agent now proactively sends build_result when complete
  instead of waiting for host to request it (was causing builds to hang forever)
- Add secrets field to /builds POST API endpoint
- Add INFO logging for vsock communication debugging
- Regenerate oapi.go with secrets field support

The build agent would receive host_ready, handle secrets, and then loop waiting
for more messages. But it never sent anything back, and the host was also
waiting. Now the agent spawns a goroutine after host_ready to wait for build
completion and send the result automatically.
The secrets API flow is fully implemented and working:
- Host receives secrets from API
- Host sends secrets to builder agent via vsock
- Agent writes secrets to /run/secrets/
- BuildKit receives --secret flags

However, BuildKit's runc requires cgroup mounts when --secret
flags are present, which the current microVM doesn't have.
This is an infrastructure issue to be fixed separately.
Documents the root cause (missing /sys/fs/cgroup mount in VM init),
two proposed solutions (Option A: all VMs, Option B: builder-only),
and security analysis for team discussion.
- Update builder Dockerfile to build and include guest-agent binary
- Only copy proto files (not client.go) to avoid host-side dependencies
- Start guest-agent in builder-agent main() before build starts
- Guest-agent listens on vsock port 2222 for exec requests

Note: Testing blocked by cgroup issue (builds fail before we can exec).
Once cgroups are enabled, exec into builder VMs will work.
- Convert state to lowercase for comparison (API returns 'Running' not 'running')
- Use build ID matching for imported images (API normalizes registry names)

E2E test now passes for full build + VM run flow.
…conversion

Previously, when the builder pushed to 10.102.0.1:8083/builds/xxx, the registry
would extract only the path (/builds/xxx) and normalize it to docker.io/builds/xxx.

This was confusing because:
- docker.io implies Docker Hub, but these are local builds
- Could conflict with real Docker Hub images
- Lost the original registry URL

The fix includes the request's Host header when building the full repository path,
so images are now stored as 10.102.0.1:8083/builds/xxx as expected.
Removed completed items:
- IP spoofing vulnerability fix
- Registry token scope leakage fix
- Vsock read deadline handling
- SSE streaming implementation
- Build secrets via vsock
- E2E test enhancement
- Build manager unit tests
- Guest agent on builder VMs
- Runtime/toolchain cleanup

Remaining tasks:
- Enable cgroups for BuildKit secrets (blocked on team discussion)
- Builder image tooling
- Keep failed builders for debugging
if m.metrics != nil {
m.metrics.RecordBuild(ctx, "failed", duration)
}
return
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cancelled build status overwritten by concurrent runBuild goroutine

Medium Severity

When CancelBuild is called on a running build, it sets the status to "cancelled" and deletes the builder instance. However, the runBuild goroutine continues executing and eventually fails (due to the deleted instance), then calls updateBuildComplete with StatusFailed. Since neither runBuild nor updateBuildComplete checks if the build was already cancelled, the status is overwritten from "cancelled" to "failed". This causes incorrect user-visible behavior where a successfully cancelled build shows as "failed".

Additional Locations (1)

Fix in Cursor Fix in Web

After the registry fix to preserve host in image names, tests need to
include the serverHost when looking up images.

Also fixes TestRegistryPushAndConvert timeout issue.

Note: Some tests may still fail due to Docker Hub rate limiting.
Mount cgroup2 filesystem at /sys/fs/cgroup during VM init and bind-mount
it to the new root. This enables runc (used by BuildKit) to work properly
for builds with secrets.

Security notes:
- cgroup v2 has no release_agent escape vector (unlike v1)
- VMs are already isolated by Cloud Hypervisor (hardware boundary)
- This is non-fatal if the kernel doesn't support cgroup2

To activate, rebuild initrd: make init && make initrd
credentials := strings.SplitN(string(decoded), ":", 2)
if len(credentials) == 0 {
return "", "", fmt.Errorf("invalid basic auth format")
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code check allows empty tokens through

Low Severity

The condition len(credentials) == 0 in extractTokenFromAuth is dead code that can never be true. The strings.SplitN function with n=2 always returns at least one element, even for empty input strings (e.g., strings.SplitN("", ":", 2) returns [""]). This means the check never triggers, and an empty token from malformed Basic auth (e.g., base64 of ":") would be returned instead of an error. While the empty token would fail JWT validation downstream, the check represents a misunderstanding of the API and doesn't provide the intended validation.

Fix in Cursor Fix in Web

if err := encoder.Encode(VsockMessage{Type: "build_result", Result: result}); err != nil {
log.Printf("Failed to send build result: %v", err)
}
}()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goroutine writes to potentially closed connection

Medium Severity

The goroutine spawned on host_ready captures encoder which writes to the connection, but the parent function has defer conn.Close(). If the host disconnects before the build completes (e.g., timeout), handleHostConnection returns and closes the connection via defer. When the build eventually completes, the goroutine wakes up and tries to write the result to the already-closed connection, causing the result to be lost. This can cause successful builds to be incorrectly marked as failed on the host side.

Additional Locations (1)

Fix in Cursor Fix in Web

The BuildEvent type and event type constants were accidentally removed
from types.go, causing build failures in lib/providers.
}

meter := otel.GetMeterProvider().Meter("hypeman")
return builds.NewManager(p, buildConfig, instanceManager, volumeManager, secretProvider, log, meter)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nil pointer dereference when secret provider is unconfigured

High Severity

When BuildSecretsDir is empty (the default configuration), secretProvider is initialized as nil and passed to NewManager. Later in waitForResult, when a builder agent sends a get_secrets message, line 503 calls m.secretProvider.GetSecrets() without a nil check, causing a panic. A NoOpSecretProvider exists in vsock_handler.go but isn't used as a fallback. Any build request that includes secrets will crash the server when the default configuration is used.

Additional Locations (1)

Fix in Cursor Fix in Web

When SKIP_DOCKER_HUB_TESTS=1 is set, tests that require pulling images
from Docker Hub are skipped. This allows CI to run without being blocked
by Docker Hub rate limiting.

Affected tests:
- lib/images: TestCreateImage*, TestListImages, TestDeleteImage, TestLayerCaching
- lib/instances: TestBasicEndToEnd, TestStandbyAndRestore, TestExecConcurrent,
  TestCreateInstanceWithNetwork, TestQEMU*, TestVolume*, TestOverlayDisk*,
  TestAggregateLimits_EnforcedAtRuntime
- lib/system: TestEnsureSystemFiles
- cmd/api/api: TestCreateImage*, TestCreateInstance*, TestInstanceLifecycle,
  TestRegistry*, TestCp*, TestExec*
- integration: TestSystemdMode
- Fix closure capturing loop variable in recoverPendingBuilds (add meta := meta shadow)
- Fix premature loop exit in StreamBuildEvents when receiving non-terminal status events
- Fix race condition where cancelled builds could be overwritten to 'failed' status
- Fix nil panic when secretProvider is not configured (use NoOpSecretProvider fallback)
The second CreateImage call can return either 'pending' (still processing)
or 'ready' (already completed) depending on CI speed and caching.
The key idempotency invariant is that the digest is the same, not the status.
select {
case <-ctx.Done():
cmd.Process.Kill()
return
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing Wait() after Kill() causes zombie process leak

Medium Severity

In StreamBuildEvents, the tail process is killed via cmd.Process.Kill() in multiple return paths (context cancellation, build completion) without calling cmd.Wait() to reap the process. Only line 874 calls Wait(). When a process is killed but not waited on, it becomes a zombie process that persists until the parent process terminates. For a long-running server handling many SSE log streams, this accumulates zombie processes over time, potentially exhausting process table entries.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor

@rgarcia rgarcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on the build system! This is a well-designed, comprehensive feature that adds source-to-image builds in isolated microVMs with proper security boundaries.

Highlights:

  • Clean architecture with BuildKit running rootless inside ephemeral VMs
  • Good tenant isolation via scoped registry tokens and cache keys
  • Solid test coverage across core components
  • Thorough documentation in the README

I've left some inline comments—mostly minor suggestions and a few questions about edge cases. Nothing blocking.

🚀 Ship it!

timestamp:
type: string
format: date-time
description: Build completion timestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but maybe completed_at would be a better name? Actually maybe this is just redundant with Build.completed_at and should be removed? Also I was somewhat expecting the provenance object to contain the started-at time ("provenance" chat tells me, comes from the French provenir, meaning "to come from" or "to originate"). But then it also seems redundant with the timestamp for started at in the main build object.

build-builders: build-builder

# Run E2E build system test (requires server running: make dev)
e2e-build-test:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build-builders alias seems premature if there's only one builder image—consider removing it unless there's a concrete plan for multiple builders soon.

MaxConcurrentSourceBuilds int // Max concurrent source-to-image builds
BuilderImage string // OCI image for builder VMs
RegistryURL string // URL of registry for built images
BuildTimeout int // Default build timeout in seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be the hypeman registry URL, right? Since it relies on custom auth specific to hypeman, maybe it should not be configurable.

switch part.FormName() {
case "source":
sourceData, err = io.ReadAll(part)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multipart parsing loop has no size limit on sourceData—could be a DoS vector if someone uploads a huge file. Should there be a max size check?

Message: "build not found",
}, nil
}
log.ErrorContext(ctx, "failed to stream build events", "error", err, "id", request.Id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If json.Marshal(event) fails here, it silently continues. Might be worth logging this error.

for _, ch := range m.statusSubscribers[buildID] {
// Non-blocking send - drop if channel is full
select {
case ch <- event:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When cancelling a building/pushing job, this deletes the instance but doesn't wait for confirmation. The deferred cleanup in executeBuild might race with this—is that okay?

vsockPort = 5001 // Build agent port (different from exec agent)
)

// BuildConfig matches the BuildConfig type from lib/builds/types.go
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment here noting that these types must be kept in sync with lib/builds/types.go and are only duplicated since this is a separate binary.

case <-ctx.Done():
setResult(BuildResult{
Success: false,
Error: "build timeout while waiting for secrets",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think secrets timeout would be a hard fail? If secrets are required for the build, proceeding without them will just cause a more confusing error later.


const userIDKey contextKey = "user_id"

// registryPathPattern matches /v2/{repository}/... paths
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment that this type needs to stay in sync with lib/builds/registry_token.go and lib/builds/builder_agent/main.go.

}
log.DebugContext(r.Context(), "registry token validation failed", "error", err)
} else {
log.DebugContext(r.Context(), "failed to extract token", "error", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This IP fallback is marked as deprecated—is there a plan/timeline to remove it?

@hiroTamada hiroTamada mentioned this pull request Jan 9, 2026
18 tasks
@hiroTamada hiroTamada merged commit d353357 into main Jan 9, 2026
4 checks passed
@hiroTamada hiroTamada deleted the feat/build-system branch January 9, 2026 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants