Skip to content

Commit f8efa48

Browse files
committed
update vm docs
Signed-off-by: Ashraf Fouda <ashraf.m.fouda@gmail.com>
1 parent 1e3ae5d commit f8efa48

File tree

1 file changed

+158
-25
lines changed

1 file changed

+158
-25
lines changed

docs/internals/vmd/readme.md

Lines changed: 158 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -2,28 +2,33 @@
22

33
## ZBus
44

5-
Storage module is available on zbus over the following channel
5+
VMD module is available on zbus over the following channel
66

77
| module | object | version |
88
|--------|--------|---------|
9-
| vmd|[vmd](#interface)| 0.0.1|
9+
| vmd | [vmd](#interface) | 0.0.1 |
1010

1111
## Home Directory
1212

13-
contd keeps some data in the following locations
14-
| directory | path|
15-
|----|---|
16-
| root| `/var/cache/modules/containerd`|
13+
vmd keeps data in the following locations:
14+
15+
| directory | path |
16+
|-----------|------|
17+
| root | `/var/cache/modules/vmd` |
18+
| config | `{root}/config/` — one JSON file per VM |
19+
| logs | `{root}/logs/` — stdout/stderr per VM |
20+
| cloud-init | `{root}/cloud-init/` — fat32 images per VM |
21+
| sockets | `/var/run/cloud-hypervisor/` — unix API socket per VM |
1722

1823
## Introduction
1924

20-
The vmd module, manages all virtual machines processes, it provide the interface to, create, inspect, and delete virtual machines. It also monitor the vms to make sure they are re-spawned if crashed. Internally it uses `cloud-hypervisor` to start the Vm processes.
25+
The vmd module manages all virtual machine processes. It provides the interface to create, inspect, pause, resume, and delete virtual machines. It monitors VMs and re-spawns them if they crash. Internally it uses [cloud-hypervisor](https://www.cloudhypervisor.org/) to run VM processes.
2126

22-
It also provide the interface to configure VM logs streamers.
27+
It also provides the interface to configure VM log streamers via zinit-managed `tailstream` services.
2328

2429
### zinit unit
2530

26-
`contd` must run after containerd is running, and the node boot process is complete. Since it doesn't keep state, no dependency on `stroaged` is needed
31+
`vmd` must run after the boot process and networking are ready. Since it doesn't keep state on disk (config is regenerated by the provision engine on boot), no dependency on `storaged` is needed.
2732

2833
```yaml
2934
exec: vmd --broker unix:///var/run/redis.sock
@@ -32,25 +37,153 @@ after:
3237
- networkd
3338
```
3439
40+
## Architecture
41+
42+
```
43+
VMModule interface (pkg/vm.go)
44+
|
45+
v
46+
Module (pkg/vm/manager.go)
47+
|
48+
+-- Run()
49+
| +-- cloudinit.CreateImage() → fat32 disk image
50+
| +-- Machine.Save() → JSON config
51+
| +-- Machine.Run() → cloud-hypervisor process
52+
| +-- startFs() × N → virtiofsd-rs daemons (virtio-fs shares)
53+
| +-- exec cloud-hypervisor via busybox setsid
54+
| +-- waitAndAdjOom() → OOM protection (-200)
55+
| +-- startCloudConsole → cloud-console process (serial PTY)
56+
|
57+
+-- Monitor() goroutine
58+
| +-- health check every 10s → restart crashed VMs (up to 4 times)
59+
| +-- log rotation every 10m → 8 MB max, tail 4 MB
60+
| +-- cloud-init cleanup every 10m
61+
|
62+
+-- Delete() → graceful shutdown → SIGTERM → SIGKILL
63+
+-- Inspect() → cloud-hypervisor REST API (unix socket)
64+
+-- Lock() → pause/resume via CH API
65+
+-- Metrics() → /sys/class/net/.../statistics/
66+
+-- StreamCreate/StreamDelete() → zinit service + tailstream
67+
```
68+
69+
## VM Types
70+
71+
### Container VM vs Full VM
72+
73+
The module supports two boot modes determined by the flist content:
74+
75+
- **Container VM** (flist without `/image.raw`): The flist is mounted as a read-write overlay using a btrfs subvolume. A cloud-container kernel + initrd are injected. The root filesystem is shared via virtio-fs with tag `vroot`. Kernel args are set to `root=vroot rootfstype=virtiofs`.
76+
77+
- **Full VM** (flist with `/image.raw`): The disk image is written to the first ZMount. The VM boots directly from disk using `hypervisor-fw` firmware. No virtio-fs root is needed.
78+
79+
### Networking
80+
81+
Network interfaces are attached as tap devices:
82+
83+
| Tap prefix | Traffic type | Examples |
84+
|------------|-------------|---------|
85+
| `t-` | Private (wireguard, mycelium) | Private network, planetary, mycelium |
86+
| `p-` | Public | Public IPv4/IPv6 |
87+
88+
Each interface is configured via cloud-init with static IP addresses, routes, and gateways. A `cloud-console` process is launched for the private network interface, providing serial console access over the network.
89+
90+
### Storage
91+
92+
Disks are attached via virtio block devices (`--disk` flag):
93+
- Boot disk (full VM mode): first disk, read-write
94+
- Additional zmount disks: sequential virtio devices (`/dev/vda`, `/dev/vdb`, ...)
95+
- Cloud-init disk: last disk, read-only
96+
97+
Shared directories use virtio-fs (`--fs` flag). Each share runs a dedicated `virtiofsd-rs` daemon. In container mode, disks and shared dirs are mounted via cloud-init fstab entries.
98+
99+
### GPU Passthrough
100+
101+
PCI devices can be passed through to VMs via VFIO (`--device` flag). The module checks device exclusivity before launch — no two VMs can share the same PCI device.
102+
103+
## VM Lifecycle
104+
105+
### Creation (`Run`)
106+
107+
1. Validate config (name, CPU 1-max, memory >= 250 MB)
108+
2. Check for duplicate VM name
109+
3. Build cloud-init config (metadata, network, users, mounts, entrypoint)
110+
4. Check PCI device exclusivity
111+
5. Build disk list and virtio-fs shares
112+
6. Resolve kernel args (user args merged with defaults)
113+
7. Generate fat32 cloud-init disk image (2 MB)
114+
8. Save machine config as JSON
115+
9. Launch virtiofsd-rs daemons for each shared directory
116+
10. Launch cloud-hypervisor process (via `busybox setsid`)
117+
11. Wait for API socket to be ready, set OOM score to -200
118+
12. Launch cloud-console for serial access
119+
13. Return console URL
120+
121+
### Monitoring
122+
123+
A background goroutine runs three periodic tasks:
124+
125+
| Task | Interval | Description |
126+
|------|----------|-------------|
127+
| Health check | 10 seconds | Detect crashed VMs, restart up to 4 times, then decommission |
128+
| Log rotation | 10 minutes | Rotate logs > 8 MB, keep tail 4 MB |
129+
| Cloud-init cleanup | 10 minutes | Remove orphaned cloud-init images |
130+
131+
On crash detection:
132+
- If the VM has `NoKeepAlive` set, it is not restarted
133+
- If the VM has crashed fewer than 4 times within 2 minutes, it is restarted
134+
- After 4 crashes, the VM is decommissioned via `ProvisionStub.DecommissionCached()`
135+
- VMs whose workload is deleted or errored on the chain are killed and cleaned up
136+
137+
### Deletion (`Delete`)
138+
139+
Escalating shutdown sequence:
140+
1. Set permanent marker to prevent monitor from restarting
141+
2. Attempt graceful shutdown via cloud-hypervisor API (5 second timeout)
142+
3. Send `SIGTERM` after 5 seconds
143+
4. Send `SIGKILL` after 10 seconds
144+
5. Clean up: remove JSON config, cloud-init image, log file
145+
146+
### Pause/Resume (`Lock`)
147+
148+
Uses the cloud-hypervisor REST API:
149+
- Pause: `PUT /api/v1/vm.pause`
150+
- Resume: `PUT /api/v1/vm.resume`
151+
152+
## Cloud-Init
153+
154+
VM configuration is injected via a fat32 disk image mounted as the last virtio disk:
155+
156+
| File | Content |
157+
|------|---------|
158+
| `/meta-data` | Instance ID, hostname |
159+
| `/network-config` | Netplan v2 — static IPs, routes, gateways, nameservers |
160+
| `/user-data` | SSH keys, fstab mounts (disks + shared dirs) |
161+
| `/zosrc` | Shell script: environment variables + entrypoint command |
162+
163+
## Metrics
164+
165+
Network metrics are read from `/sys/class/net/{tap}/statistics/` for each tap device. Traffic is segregated into private (`t-*` taps) and public (`p-*` taps) categories, reporting rx/tx bytes and packets per VM.
166+
167+
## Legacy Support
168+
169+
The module includes a legacy monitor for old Firecracker-based VMs. It scans `/proc` for `firecracker` processes and cleans up their bind-mounts and directories when they exit. This runs in the background until no Firecracker processes or directories remain.
170+
35171
## Interface
36172

37173
```go
38-
39-
// VMModule defines the virtual machine module interface
40174
type VMModule interface {
41-
Run(vm VM) error
42-
Inspect(name string) (VMInfo, error)
43-
Delete(name string) error
44-
Exists(name string) bool
45-
Logs(name string) (string, error)
46-
List() ([]string, error)
47-
Metrics() (MachineMetrics, error)
48-
49-
// VM Log streams
50-
51-
// StreamCreate creates a stream for vm `name`
52-
StreamCreate(name string, stream Stream) error
53-
// delete stream by stream id.
54-
StreamDelete(id string) error
175+
Run(vm VM) (MachineInfo, error)
176+
Inspect(name string) (VMInfo, error)
177+
Delete(name string) error
178+
Exists(name string) bool
179+
Logs(name string) (string, error)
180+
LogsFull(name string) (string, error)
181+
List() ([]string, error)
182+
Metrics() (MachineMetrics, error)
183+
Lock(name string, lock bool) error
184+
185+
// VM log streams
186+
StreamCreate(name string, stream Stream) error
187+
StreamDelete(id string) error
55188
}
56189
```

0 commit comments

Comments
 (0)