This document records the key architectural and technical decisions made during GoShip's development. It is intended for contributors who want to understand what was decided and why, without digging through blog posts or git history.
Each entry follows a lightweight ADR (Architecture Decision Record) format: a one-line decision, the context that prompted it, alternatives that were considered, and the rationale for the choice made.
- Entries are grouped by implementation step and ordered chronologically.
- Status: Accepted means the decision is current.
- Status: Superseded means a later decision replaced it (linked via "Superseded by").
- Cross-references point to related ADRs where decisions interact.
- Decision: Use libvirt (via
libvirt.org/go/libvirtGo bindings) as the primary VM lifecycle manager instead of shelling out to QEMU directly. - Context: GoShip needs to create, start, stop, and destroy VMs programmatically. QEMU can be invoked directly, but managing the full lifecycle (domain definitions, network setup, device hotplug) manually is error-prone.
- Alternatives considered: Direct QEMU invocation via
exec.Command; Kata Containers; Firecracker. - Rationale: Libvirt provides a stable API for domain XML, network management, storage pools, and device configuration. It handles details like PCI address assignment, MAC generation, and process management. The
ProjectRuntimeinterface allows swapping backends later (Kata, Firecracker) without changing the control plane. Seedocs/DESIGN.mdsection 11 for the runtime evolution roadmap. - Status: Accepted
- Decision: Connect to the system-wide libvirtd daemon via
qemu:///system, not the per-userqemu:///session. - Context: Libvirt has two daemon modes. The session daemon is unprivileged but only supports SLIRP (user-mode) networking with no bridging or virtual network support.
- Alternatives considered:
qemu:///session(no root/group membership needed, but no real networking). - Rationale: GoShip needs
<interface type='network'>for host-VM communication over libvirt's NAT bridge. SLIRP networking cannot support this. The trade-off is requiring membership in thelibvirtgroup, which is acceptable for a development tool. - Status: Accepted
- Decision: Use the
q35machine type in domain XML instead of the olderi440fx(pc). - Context: QEMU supports two main x86 machine types. The
i440fxis the legacy default;q35is the modern chipset with PCIe, AHCI, and better device topology. - Alternatives considered:
i440fx(pc) — wider compatibility but aging architecture. - Rationale:
q35supports both BIOS and UEFI boot, has native PCIe (important for future device passthrough and confidential computing), and is the recommended machine type for new VMs. It doesn't lock us out of any future feature. - Status: Accepted
- Decision: Use BIOS firmware (SeaBIOS, the QEMU default) instead of UEFI (OVMF) for Phase 0.
- Context: The domain XML
<os>section can specify either firmware. UEFI requires theovmfpackage on the host and additional XML elements (<loader>,<nvram>). - Alternatives considered: UEFI (OVMF) — needed for Secure Boot and Confidential Computing (SEV-SNP, TDX).
- Rationale: BIOS produces a simpler domain XML (three lines in
<os>, no<loader>or<nvram>paths), requires no extra host packages, and is sufficient for learning VM lifecycle management. Theq35machine type supports both, so switching to UEFI later doesn't require a machine type change. See also ADR-009. - Status: Accepted
- Decision: Create per-VM disk images as QCOW2 copy-on-write overlays backed by a shared base image, rather than copying the full base image for each VM.
- Context: Each VM needs its own writable disk. Copying a 200MB+ base image for every VM wastes disk space and time.
- Alternatives considered: Full copy of base image per VM; thin-provisioned raw images.
- Rationale: CoW overlays start near-zero size and only grow as the VM writes data. Creation is instant (
qemu-img create -f qcow2 -F qcow2 -b <base>). The base image stays immutable and reusable. Cleanup is trivial:os.RemoveAllon the VM directory removes everything. - Status: Accepted
- Decision: Add
<seclabel type='none'/>to domain XML to disable libvirt's security driver confinement. - Context: With
qemu:///system, QEMU runs as thelibvirt-qemuuser. Disk images in~/.goship/are inaccessible due to home directory permissions (0750) and AppArmor confinement. - Alternatives considered: Move images to
/var/lib/libvirt/images/(standard path, already allowed by AppArmor); configure custom AppArmor profiles; use DAC labels to run QEMU as the current user. See ADR-034. - Rationale:
seclabel type='none'is the simplest path for a single-user development tool. It disables all security driver confinement for GoShip VMs. This is explicitly a Phase 0 shortcut — the DAC label approach (ADR-034) is the planned hardening path. - Status: Accepted (superseded in practice by ADR-034 for file access; both are set in Phase 0)
- Decision: Use
~/.goship/as the default data directory for VM disk images, cloud-init ISOs, sockets, and state. - Context: GoShip needs a persistent location for VM artifacts. System paths like
/var/lib/goship/requiresudofor initial setup. - Alternatives considered:
/var/lib/goship/(follows FHS, avoids permission issues withqemu:///system);/var/lib/libvirt/images/goship/(standard libvirt path, AppArmor-friendly). - Rationale:
~/.goship/provides zero-setup experience: clone, build, run. Nosudo, no system directory creation. This mirrors~/.docker/and~/.kube/conventions. The trade-off is needing security label workarounds (ADR-006, ADR-034) for QEMU file access. A system path is the better long-term answer. - Status: Accepted
- Decision: Use cloud-init with the NoCloud data source (ISO attached as CDROM) for first-boot VM provisioning.
- Context: After creating a VM, it boots into a generic Alpine image with no SSH key, no hostname, no identity. Manual console configuration is not automation.
- Alternatives considered: Manual configuration via serial console; Packer-based custom image builds; pre-configured images with baked-in credentials.
- Rationale: Cloud-init is the industry standard for first-boot provisioning (AWS, GCP, Azure, OpenStack all use it). NoCloud reads configuration from a local CDROM ISO, requiring no metadata server. The base image stays generic; per-VM identity travels as a sidecar ISO. The ISO is read-only, disposable, and cleaned up with the VM directory.
- Status: Accepted
- Decision: Use
nocloud_alpine-*-x86_64-bios-cloudinit-*.qcow2as the base VM image. - Context: Alpine publishes multiple variants:
biosvsuefidisk layout,nocloudvs other data sources. - Alternatives considered: UEFI variant (GPT-partitioned with EFI System Partition); Ubuntu or Debian cloud images.
- Rationale: The
biosvariant matches the SeaBIOS firmware choice (ADR-004) — mismatching firmware and disk layout causes boot failures. Alpine is minimal (~200MB), boots fast, and includes cloud-init. Thenoclouddata source matches our ISO-based provisioning approach (ADR-008). - Status: Accepted
- Decision: Use virtio-serial (paravirtualized serial port over Unix socket) as the host-to-VM communication channel.
- Context: The control plane needs to send commands to the in-VM agent (deploy, stop, status). The channel must work without network dependency, with low latency, and be expressible in libvirt domain XML.
- Alternatives considered:
- SSH — Requires key management, network stack, sshd running in VM. Too heavy for a control channel.
- QEMU Guest Agent — Opinionated, limited to predefined commands (file ops, freeze/thaw). Not extensible for custom actions.
- Virtio-vsock (AF_VSOCK) — Modern and promising, but requires
VHOST_VSOCKkernel module and CID management. Less mature tooling on Alpine. - 9P/virtiofs — Shared filesystem, awkward for request/response (would need polling or inotify).
- Rationale: Virtio-serial provides a Unix domain socket on the host and a character device (
/dev/virtio-ports/goship.0) in the VM. It's full-duplex, low overhead, debuggable withsocat, and has first-class libvirt support. The protocol is entirely under our control. - Status: Accepted
- Decision: Use newline-delimited JSON as the wire protocol over the virtio-serial channel.
- Context: Need a protocol for request/response messages between host and VM agent.
- Alternatives considered: Binary protocol (protobuf, msgpack); custom text protocol.
- Rationale: JSON is human-readable (debuggable with
socatorcat), supported by Go's stdlib (encoding/json), and appropriate for control plane traffic (small command messages, not data streams). Parsing overhead is irrelevant for 50-byte messages. A binary protocol would add complexity for zero benefit at this scale. - Status: Accepted
- Decision: Use
mode='bind'(notmode='connect') for the virtio-serial Unix socket. - Context: Libvirt supports two socket modes:
bind(QEMU creates and listens on the socket) andconnect(QEMU dials an existing socket you created). - Alternatives considered: Connect mode — host code creates the socket, QEMU dials it.
- Rationale: In bind mode, the socket lifecycle matches the VM lifecycle automatically — QEMU creates it on VM start and removes it on VM stop. No orphan socket cleanup needed. Connect mode would require the host to manage socket creation and lifecycle independently.
- Status: Accepted
- Decision: Place the virtio-serial Unix socket (
goship.sock) in the same directory as the VM's disk image (~/.goship/vms/{name}/). - Context: The socket file needs a predictable location. Options include a central socket directory or per-VM placement.
- Alternatives considered: Central socket directory (e.g.,
~/.goship/sockets/);/var/run/goship/. - Rationale: Co-location means cleanup is trivial —
os.RemoveAllon the VM directory removes disk, cloud-init ISO, and socket together. No separate socket registry, no orphan cleanup on crash. The trade-off (needing the VM name to find the socket) is acceptable since the state store already tracks this. - Status: Accepted
- Decision: Build a custom in-guest agent (
goship-init) that runs as PID 1 inside VMs, handling filesystem mounts, networking, and command dispatch. - Context: The VM needs an agent to receive commands over the virtio-serial channel and execute them (deploy apps, stop processes, report status).
- Alternatives considered: Use Alpine's init system (OpenRC) as PID 1 and run goship-init as a regular service; use systemd inside the VM.
- Rationale: Running as PID 1 gives the agent full control over the VM's lifecycle, including mounting
/proc,/sys,/dev, and bringing up networking. When running under OpenRC (which is also supported), the PID 1 duties are skipped viaos.Getpid() == 1check. This dual-mode design supports both bare init and managed service scenarios. - Status: Accepted
- Decision: Build
goship-initas a fully static binary withCGO_ENABLED=0. - Context: The binary runs inside minimal VM images where specific shared libraries may not exist.
- Alternatives considered: Dynamic binary with bundled shared libraries; Alpine-compatible musl-linked binary.
- Rationale:
CGO_ENABLED=0forces pure Go compilation, producing a self-contained binary with no libc dependency. This guarantees it runs on any Linux VM regardless of the installed C library. Verified withfile bin/goship-initshowing "statically linked". - Status: Accepted
- Decision: Provision
goship-initinto each VM's CoW overlay disk duringgoshipctl project create, rather than baking it into the shared base image. - Context: The original approach (baking
goship-initinto the base image viavirt-customizeduringimage build) was fragile: GitHub release 404s,apk addfailures during offline image customization, and one bad base image contaminating every VM. - Alternatives considered: Keep baking into base image (original approach, ADR-014's initial implementation); download the binary inside the VM via cloud-init
runcmd. - Rationale: Per-VM provisioning keeps the base image immutable and plain. Failures are contained to a single VM creation. Iterating on
goship-initno longer requires rebuilding the base image. Thevirt-customizetool operates on the CoW overlay, so the base image is never mutated. This was a rework of Step 6 after repeated failures with the original approach. - Status: Accepted (supersedes the original "bake into base image" approach)
- Decision: Install Docker packages via cloud-init
packagesdirective (at first boot with network access) instead of viavirt-customize(offline image modification). - Context:
virt-customizeruns offline — no network access inside the image, soapk add dockeronly works if packages are cached. Cloud-init runs at first boot when the VM has network access via DHCP. - Alternatives considered: Pre-install Docker in the base image; download Docker binary manually via cloud-init
runcmd. - Rationale: Cloud-init's
packagesdirective is the standard way to install packages on first boot. The VM has network access (DHCP from libvirt's default network), so package installation works reliably. This also removed Docker-related code fromguest_provision.go, simplifying the provisioning pipeline. - Status: Accepted
- Decision: Use a single
sync.Mutexto serialize access to the virtio-serial socket inVMCommunicator, rather than a connection pool. - Context: The communicator sends commands to the VM agent over a single Unix socket connection. Concurrent access could corrupt the NDJSON framing.
- Alternatives considered: Connection pooling (multiple concurrent connections to the socket); per-command connection (dial, send, read, close).
- Rationale: In Phase 0, commands are sequential — deploy, then status, then stop. There's no concurrent command stream. A connection pool adds complexity for zero benefit. Per-command connections would work but add latency for the connect/disconnect overhead on every operation.
- Status: Accepted
- Decision: Default to a 30-second timeout for virtio-serial commands, overridable via context deadline.
- Context: The timeout must accommodate the worst case: VM just booted, cloud-init still running, GoShip Init not yet listening.
- Alternatives considered: Shorter default (5 seconds) with caller-side retry; no default timeout (rely on context).
- Rationale: 30 seconds covers the first-boot window where cloud-init is still running and services haven't started yet. Callers can set shorter deadlines via context for operations like ping where fast feedback matters. No retry logic in the communicator — retry policy belongs to the caller who knows the operational context.
- Status: Accepted
- Decision: Use a JSON file (
~/.goship/state.json) withsync.RWMutexfor state persistence instead of a database. - Context: GoShip needs to persist project and app definitions across CLI invocations.
- Alternatives considered: SQLite; BoltDB/bbolt; etcd.
- Rationale: A JSON file is the simplest persistence mechanism that works. It's human-readable (debuggable with
cat), requires no dependencies, and every write persists to disk immediately. Thesync.RWMutexprovides thread safety for concurrent reads. This is explicitly a Phase 0 choice — the state store is concrete (no interface) and will be replaced with SQLite or similar in later phases. - Status: Accepted
- Decision: Split app lifecycle into
app create(define in state store) andapp deploy(send to VM), rather than a single command that does both. - Context: Users need to define app specifications (image, ports, env vars, resources) and then deploy them to VMs.
- Alternatives considered: Single
app deploythat creates and deploys in one step. - Rationale: Separation lets users define apps, review them with
app info, edit withapp edit, and deploy when ready. It mirrors the declarative model where desired state is defined first, then applied. The state store holds the app definition; the VM holds the running instance.app deletedoes best-effort cleanup on the VM, then always removes from state (ADR-024). - Status: Accepted
- Decision: Support two app execution modes inside VMs:
container(Docker/Podman) andprocess(direct binary with supervisor). - Context: Some workloads are naturally containerized; others are single Go binaries or custom daemons that don't need container overhead.
- Alternatives considered: Container-only (simpler, but forces containerization of everything); process-only (misses the container ecosystem).
- Rationale: The
ExecutorManagerroutes by mode and aggregates status from both engines. Docker is tried first for stop/remove (cheaper name lookup), with process manager as fallback. Both implement the sameAppExecutorinterface. This matches the design doc's supported execution modes. - Status: Accepted
- Decision: Stream cloud-init logs to the terminal during
project createso users can see what the VM is doing. - Context: VM creation takes 30-60 seconds (cloud-init installs packages, starts services). Without feedback, users see a blank terminal and don't know if it's working.
- Alternatives considered: Simple spinner with no detail; silent creation with post-check.
- Rationale: The
ProgressWriterpolls the VM's cloud-init log (/var/log/cloud-init-output.log) via the virtio-serial channel and prints new lines as they appear. This gives real-time visibility into Docker installation, service startup, and any errors. The offset tracker ensures only new lines are printed on each poll. - Status: Accepted
- Decision: When deleting an app, attempt to stop and remove it from the VM (best-effort), then always remove it from the state store regardless of VM operation success.
- Context: The VM might be stopped, unreachable, or in an error state when the user wants to delete an app definition.
- Alternatives considered: Require VM to be running for delete; leave state store entry if VM operation fails.
- Rationale: The state store is the source of truth for desired state. If the VM is unreachable, the user still needs to clean up app definitions. The app will be re-removed from the VM on next reconciliation or won't exist if the VM is destroyed. Blocking on VM reachability would create a poor user experience.
- Status: Accepted
- Decision: Transfer binaries from host to VM using a three-phase protocol (begin/data/finish) over the NDJSON virtio-serial channel, with 512KB base64-encoded chunks and SHA256 integrity verification.
- Context: Process-mode apps reference host-side binaries that don't exist inside the VM. The virtio-serial channel speaks NDJSON, so large binaries can't fit in a single message.
- Alternatives considered: SCP/SSH-based transfer (requires network + sshd); shared filesystem (9p/virtiofs); single large JSON message.
- Rationale: The chunked protocol reuses the existing virtio-serial channel with no new infrastructure. 512KB chunks balance transfer speed against JSON message size (~700KB after base64 expansion). SHA256 verification catches corruption. Atomic rename (
.newsuffix during transfer) prevents partial uploads from corrupting existing binaries. Path sanitization (filepath.Base) prevents directory traversal attacks. - Status: Accepted
- Decision: Implement auto-restart for managed processes with exponential backoff (1s, 2s, 4s, 8s, 16s, capped at 30s, max 5 attempts) and a
stoppingflag to prevent restart during explicit stop/remove. - Context: Process-mode apps that crash should be restarted automatically (if the restart policy says so), but restart loops waste CPU and fill logs.
- Alternatives considered: Immediate restart with no backoff; fixed-interval restart; delegate to systemd.
- Rationale: Three restart policies (
always,on-failure,never) cover common use cases. Exponential backoff prevents tight restart loops. Thestoppingflag prevents a race whereStop()sends SIGTERM, the process exits, and the monitor tries to restart it. The mutex is released during the backoff sleep to avoid blocking other operations. - Status: Accepted
- Decision: Run managed processes in their own process groups via
cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}. - Context: When stopping a process, SIGTERM should go to the managed process and its children, not to the GoShip Init agent.
- Alternatives considered: Send signals directly to the PID (misses child processes); use cgroups for process tracking.
- Rationale:
Setpgid: trueisolates the managed process's process group from the agent. The stop sequence sends SIGTERM, waits up to 10 seconds on thedonechannel, then falls back to SIGKILL. Thedonechannel (not a secondcmd.Wait()call) coordinates between the monitor goroutine and the stop logic to avoid double-wait panics. - Status: Accepted
- Decision: Check
/proc/<pid>/statusfor kernel-level process state instead of relying solely on in-memory state tracking. - Context: The in-memory state might say "running" when the process was killed by the OOM killer or became a zombie.
- Alternatives considered: Rely only on in-memory state from the monitor goroutine; use
kill(pid, 0)for existence check only. - Rationale:
/proc/<pid>/statusprovides the actual kernel process state:S (sleeping),R (running),Z (zombie),T (stopped). If the file doesn't exist, the process has exited. This enriches the CLI output from just "running" torunning (pid 1234, S (sleeping)), which is much more useful for debugging. - Status: Accepted
- Decision: Redirect each managed process's stdout/stderr to its own log file at
/var/log/goship-<appname>.logwithO_APPENDmode. - Context: Previously, all managed processes shared the Init agent's stdout, making logs from different apps indistinguishable.
- Alternatives considered: Structured logging to a central file with app name prefixes; in-memory ring buffer.
- Rationale: Per-process files integrate with the existing
ActionLogshandler (which reads any file under/var/log/) with no new protocol needed.O_APPENDensures logs survive process restarts. The log file handle is stored inmanagedProcessand cleaned up on Remove (but the file persists on disk for post-mortem analysis). - Status: Accepted
- Decision: Build
goshipctlwith-tags libvirt_dlopenso the binary loads libvirt at runtime viadlopen()instead of linking at compile time. - Context: The default Go libvirt bindings use
pkg-config: libvirt, which creates a hard version dependency. A binary built on libvirt 9.7.0 won't start on a machine with libvirt 9.0.0 due to ELF symbol version mismatches. - Alternatives considered: Static linking against libvirt (complex, not well-supported); build on oldest supported version; ship per-distro binaries.
- Rationale: With
libvirt_dlopen, the binary has no compile-time libvirt dependency. At runtime, it callsdlopen("libvirt.so.0", ...)and resolves symbols lazily. Functions that exist work normally; missing functions fail gracefully at the call site. CGO is still required (the dlopen calls themselves are C code), butlibvirt-devis not needed at build time. - Status: Accepted
- Decision: Detect KVM availability at runtime and fall back to QEMU TCG (software emulation) when
/dev/kvmis not present. - Context: GoShip was originally KVM-only. On cloud VMs without nested virtualization, containers without
/dev/kvm, or laptops with VT-x disabled, VM creation would fail. - Alternatives considered: Require KVM (simpler, but limits where GoShip runs); always use TCG (works everywhere, but wastes hardware acceleration).
- Rationale: The domain XML
typeattribute switches betweenkvmandqemubased on a singleEnableKVMboolean from capabilities discovery. TCG is 10-100x slower for CPU-bound work but I/O (virtio) performance is identical. For development and testing, TCG is perfectly usable. A warning is printed when falling back to TCG. - Status: Accepted
- Decision: Use
host-passthroughCPU mode when KVM is available andhost-modelwhen falling back to TCG. - Context: The CPU mode was originally hardcoded to
host-passthrough, which requires KVM (it passes the real host CPU through to the guest). Without KVM, libvirt rejects the domain or QEMU crashes. - Alternatives considered: Always use
host-model(works everywhere but masks CPU features under KVM); use a specific CPU model likeSkylake-Server. - Rationale:
host-passthroughgives the best performance and full CPU feature exposure under KVM.host-modelis libvirt's recommended mode for maximum compatibility — it approximates the host CPU using QEMU's built-in definitions and works with both KVM and TCG. Both conditionals use the sameEnableKVMfield, keeping the logic simple. See ADR-031. - Status: Accepted
- Decision: Automatically ensure the libvirt
defaultnetwork is active before VM creation, including defining it from scratch if missing. - Context: After a fresh install or reboot, the
defaultnetwork is often inactive. VM creation fails withnetwork 'default' is not active. Users shouldn't need to remembervirsh net-start default. - Alternatives considered: Require manual network setup (documented in README); use bridge networking instead of libvirt networks.
- Rationale: The
EnsureNetworkfunction handles three cases: network active (no-op), network inactive (start + enable autostart), network missing and name is "default" (define from standard XML + start + enable autostart). For non-default networks that don't exist, an error with instructions is returned. Setting autostart makes this a one-time fix. - Status: Accepted
- Decision: Set a DAC (Discretionary Access Control) security label in domain XML to run QEMU as the current user's UID:GID, using numeric IDs with the
+prefix. - Context: With
qemu:///system, QEMU runs aslibvirt-qemuby default and cannot access files in~/.goship/. ADR-006 disables all security drivers as a workaround, but that's the nuclear option. - Alternatives considered: Move images to
/var/lib/libvirt/images/(eliminates the problem entirely — this is the best long-term answer); keepseclabel type='none'only; configure AppArmor profiles. - Rationale: The DAC label
+<uid>:+<gid>(viaos.Getuid()/os.Getgid()) tells libvirt to run QEMU as the current user. Numeric IDs (with+prefix) are more portable than usernames — they work in containers and minimal environments.relabel='no'prevents libvirt fromchowning disk images. Currently,SecurityNone=truetakes priority in the XML template, but the DAC label is populated and ready for when security is tightened in later phases. - Status: Accepted
- Decision: When a compose service has
build:but noimage:, rundocker buildon the host and push the resulting image into the VM using the existingpushLocalImageflow, rather than skipping the service. - Context: Docker Compose files commonly use
build: .for application services that are built from local source. GoShip's compose parser previously skipped these services with a warning ("build is not supported"), which meant users had to manually build and tag images before deploying. - Alternatives considered:
- Keep skipping — Simple, but breaks the most common compose workflow where the app service uses
build:. - Build inside the VM — Send the build context into the VM and run
docker buildthere. Avoids host Docker dependency but requires transferring potentially large source trees over virtio-serial and having Docker build tooling ready inside the VM. - Require explicit
image:alongsidebuild:— Force users to add an image name. Less ergonomic and non-standard compared to Docker Compose behavior.
- Keep skipping — Simple, but breaks the most common compose workflow where the app service uses
- Rationale: Building on the host matches the Docker Compose workflow:
docker compose buildruns on the host, then images are used by services. GoShip already has thepushLocalImagemechanism (docker save, gzip, virtio-serial transfer, docker load) used byapp deploy --local-imageandapp push-image. Reusing this for compose builds requires no new protocol or infrastructure. A deterministic image name (goship-<service>:latest) is generated for services without an explicitimage:field. The--buildflag (default:true) lets users skip builds when images are already pushed. - Status: Accepted
- Decision: Resize VM resources (CPU, memory, disk) by stopping the VM, redefining the domain XML via
conn.DomainDefineXML(), optionally growing the disk viaqemu-img resize, and starting the VM again. - Context: Users need to change VM resource allocation after project creation without destroying the project and losing deployed apps.
- Alternatives considered:
- Live resize (hotplug) — Libvirt supports live vCPU and memory hotplug, but it requires complex guest cooperation (balloon driver, CPU online/offline), and the domain XML maximum must be set at creation time. Not all combinations work reliably.
- Destroy and recreate — Simpler but loses all deployed apps, VM state, and requires re-provisioning.
- virsh setvcpus / setmem — CLI-only, doesn't update the persistent definition, and requires running VM for some modes.
- Rationale:
DomainDefineXMLon a shut-off domain is the safest, most reliable approach. It updates the persistent definition atomically. No guest cooperation needed. Disk resize viaqemu-img resizeon a stopped VM is also safe and well-tested. The constraint of requiring a stopped VM is acceptable — it's explicit, easy to understand, and avoids the complexity of live migration. Theproject editCLI mirrors the existingapp editpattern. - Status: Accepted
- Decision: Add project-level environment variables (
Project.Env) that are inherited by all apps at deploy time, with optional AES-256-GCM encryption for sensitive values via--secretflag. - Context: Applications need configuration values (database URLs, API keys, ports) that should be set once at the project level and inherited by all apps, rather than duplicated per app. Sensitive values must not be stored in plaintext in the JSON state file.
- Alternatives considered:
- App-level only — Requires duplicating shared config across every app. No single place to change a database URL.
- External secret manager (HashiCorp Vault, SOPS) — Too heavy for a single-user CLI tool in Phase 0.
- Environment files (
.env) — Familiar but doesn't integrate with the state store or support encryption.
- Rationale: Project-level env vars provide a single source of shared configuration. The merge semantics (project env is base, app env overrides) match industry conventions (Heroku, Fly.io, Railway). AES-256-GCM with a local master key (
~/.goship/master.key,0o600) is proportionate security for a single-user tool — secrets are encrypted at rest in the state file, decrypted only at deploy time or with--show-values. Theencrypted:prefix makes it visible which values are secrets without needing a separate metadata field. - Status: Accepted
- Decision: Inject environment variables into apps at deploy time (
app deploy/compose up), not push them to running VMs whenenv setis called. - Context: After
env set, should the change take effect immediately on running apps or only on next deploy? - Alternatives considered: Live push — send env update over virtio-serial and restart affected containers/processes immediately.
- Rationale: Deploy-time injection is simpler, predictable, and matches the declarative model: desired state is defined, then applied explicitly. Live push would require tracking which apps use which env vars, coordinating restarts, and handling partial failures. The trade-off (requiring a redeploy) is acceptable and explicit — users know exactly when config changes take effect.
- Status: Accepted
- Decision: Add a REST API server (
goshipd) and an HTTP client ingoshipctlthat activates whenGOSHIP_API_URLis set, with backward-compatible fallback to direct libvirt when unset. - Context: Phase 0's direct-libvirt CLI works well on the VM host, but remote management requires a server. Users need a way to manage projects and apps from machines without libvirt installed.
- Alternatives considered:
- gRPC API — More efficient but adds protobuf dependency and is harder to debug (no
curl). - Separate CLI binary — A dedicated
goshipclient that only speaks HTTP. Doubles the tooling surface. - Always require API server — Breaking change for existing direct-mode users.
- gRPC API — More efficient but adds protobuf dependency and is harder to debug (no
- Rationale: The dual-mode pattern (
if apiClient != nilguard at the top of each handler) keeps existing direct-mode code untouched while adding HTTP support. The client importsapiservertypes directly (no type duplication). Commands without API equivalents (console, project logs, edit, update-init, push-image, env) return clear errors in API mode. The--api-urlflag andGOSHIP_API_URLenv var follow the existingardanlabs/confconfig pattern. - Status: Accepted
- Decision: Commands that require direct VM access (console, project logs, edit, update-init, push-image, env) return
"not available in API mode"errors instead of silently degrading. - Context: Some CLI operations require virtio-serial access, local binary paths, or virsh — none of which work over HTTP.
- Alternatives considered: Implement server-side equivalents for all commands; silently skip unavailable features.
- Rationale: Explicit errors are better than silent failure. Users know immediately which operations require direct access. Server-side equivalents can be added incrementally in future steps (e.g., server-side logs endpoint already exists for app logs). The error message tells the user exactly what's happening.
- Status: Accepted
- Decision: Projects own one or more domain names (e.g.,
["myapp.local"]). Apps specify a hostname (default: app name). Routes are computed as{hostname}.{domain}→{VM_IP}:{port}. - Context: Apps inside VMs need human-friendly URLs. The routing model needs to decide where domain ownership lives — at the project level or the app level.
- Alternatives considered: Per-app domains (each app gets its own domain); flat domain-to-app mapping (single domain per app, no hierarchy).
- Rationale: Project-level domains match the isolation model — a project owns a VM, and all apps in that project share the same VM IP. Multiple apps share a domain namespace naturally (e.g.,
web.myapp.local,api.myapp.local). Per-app domains would duplicate configuration and break the project-as-boundary model. The hostname field lets apps customize their subdomain without affecting other apps. - Status: Accepted
- Decision: The reverse proxy listens on
:8081(configurable viaGOSHIP_PROXY_ADDR), separate from the API server on:8080. - Context: GoShip needs to serve both API requests (project/app management) and proxied application traffic. These could share a port with Host-based routing or use separate ports.
- Alternatives considered: Single port with Host-based routing (API on
api.goship.local, proxy on everything else); path-prefix routing (/api/for management, everything else proxied). - Rationale: Clean separation — API traffic never competes with proxied traffic. The proxy can be fronted by its own load balancer or firewall rules. It's simpler to reason about: port 8080 is always management, port 8081 is always application traffic. No risk of Host header conflicts between API and proxy routing.
- Status: Accepted
- Decision: Routes are not persisted separately. The route table (
sync.RWMutex+map[string]string) is rebuilt from the state store ongoshipdstartup viaRebuildRoutes(). - Context: The proxy needs a fast lookup table mapping domain names to backend addresses. This data could be persisted separately or derived from existing state.
- Alternatives considered: Persist routes to a separate file; add a
routesfield tostate.json; use a shared cache like Redis. - Rationale: The state store already contains all data needed to compute routes: project domains, app hostnames, instance IPs, and port mappings. Deriving routes avoids double-write consistency issues — there's no risk of the route table and state store disagreeing. Rebuild is O(projects × apps), which is negligible at Phase 0 scale. The
sync.RWMutexprovides thread safety for concurrent reads from the proxy and writes from API handlers. - Status: Accepted
- Decision:
AppSpec.Availableis a*boolfield.nilmeans available (default),falseprevents route registration in the proxy. - Context: Some apps are internal services (databases, caches) that should not be exposed via the reverse proxy. A mechanism is needed to control which apps get external routes.
- Alternatives considered: Separate
externalboolean field; enum withavailable/internal/disabledstates. - Rationale:
*boolwith nil-means-true is the simplest representation. Apps are external by default because that's the most common case — users deploy web services they want to access. Internal-only services (databases, background workers) opt out explicitly with"available": false. TheIsAvailable()method encapsulates the nil-means-true logic. - Status: Accepted
- Decision: Updating project domains or an app's hostname/available flag immediately reconciles proxy routes via
reconcileRoutesOnDomainChange()andreconcileAppRoutes(). - Context: When a user changes a project's domain list or an app's hostname, the proxy route table must reflect the change. Routes could be reconciled immediately or deferred until the next deploy.
- Alternatives considered: Require explicit
rebuild-routescommand; only reconcile on deploy; periodic reconciliation timer. - Rationale: Immediate reconciliation prevents stale routes. If a user changes a project's domain from
myapp.localtomyapp.dev, they expect the new domain to work immediately — requiring a redeploy would be confusing. The reconciliation is simple: remove old routes, add new ones. The two-phase approach (remove all old, then re-register all new) handles additions, removals, and changes in a single pass. - Status: Accepted
| Original | Superseded by | What changed |
|---|---|---|
| Bake goship-init into base image (original Step 6) | ADR-016 | Moved to per-VM provisioning via CoW overlay |
| Install Docker via virt-customize | ADR-017 | Moved to cloud-init packages (needs network) |
seclabel type='none' as sole access fix (ADR-006) |
ADR-034 | DAC labels provide targeted access without disabling all security |