Skip to content

Commit 91354f2

Browse files
authored
provsion docs (#106)
* provsion docs Signed-off-by: Ashraf Fouda <ashraf.m.fouda@gmail.com> * fixing linter for workflows Signed-off-by: Ashraf Fouda <ashraf.m.fouda@gmail.com> --------- Signed-off-by: Ashraf Fouda <ashraf.m.fouda@gmail.com>
1 parent 9b8c3ef commit 91354f2

File tree

4 files changed

+269
-22
lines changed

4 files changed

+269
-22
lines changed

docs/internals/provision/readme.md

Lines changed: 262 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,272 @@
11
# Provision Module
22

33
## ZBus
4-
This module is autonomous module and is not reachable over `zbus`.
4+
5+
The provision module exposes the `Provision` interface over zbus:
6+
7+
| module | object | version |
8+
|--------|--------|---------|
9+
| provision | [provision](#interface) | 0.0.1 |
510

611
## Introduction
712

8-
This module is responsible to provision/decommission workload on the node.
13+
This module is responsible for provisioning and decommissioning workloads on the node. It accepts new deployments over RMB (Reliable Message Bus), validates them against the TFChain contract, and brings them to reality by dispatching to per-type workload managers via zbus.
14+
15+
`provisiond` knows about all available daemons and contacts them over zbus to ask for the needed services. It pulls everything together and updates the deployment with the workload state.
16+
17+
If the node is restarted, `provisiond` re-provisions all active workloads to restore them to their original state.
18+
19+
## Supported Workloads
20+
21+
0-OS supports 13 workload types (see the [primitives package](../primitives/readme.md) for details):
22+
23+
- `network` — WireGuard private network
24+
- `network-light` — Mycelium-only network
25+
- `zmachine` — virtual machine (full networking)
26+
- `zmachine-light` — virtual machine (mycelium only)
27+
- `zmount` — virtual disk for a zmachine
28+
- `volume` — btrfs subvolume (shared directory for a zmachine)
29+
- `public-ip` / `public-ipv4` — public IPv4/IPv6 for a zmachine
30+
- [`zdb`](https://github.com/threefoldtech/0-DB) — 0-db namespace
31+
- [`qsfs`](https://github.com/threefoldtech/quantum-storage) — quantum safe filesystem
32+
- `zlogs` — log stream from a zmachine to an external endpoint
33+
- `gateway-name-proxy` — reverse proxy with grid-assigned subdomain
34+
- `gateway-fqdn-proxy` — reverse proxy with user-owned FQDN
35+
36+
## Architecture
37+
38+
```
39+
User
40+
| (RMB: zos.deployment.deploy / update)
41+
v
42+
ZOS API (pkg/zos_api)
43+
| (zbus)
44+
v
45+
NativeEngine (pkg/provision)
46+
| 1. Validate & persist to BoltDB
47+
| 2. Enqueue job to disk queue
48+
|
49+
v (run loop, single-threaded)
50+
mapProvisioner (pkg/provision)
51+
| (dispatches by workload type)
52+
v
53+
Manager (pkg/primitives/*)
54+
| (zbus)
55+
v
56+
System daemons (storaged, networkd, vmd, ...)
57+
```
58+
59+
## Engine
60+
61+
The `NativeEngine` is the central orchestrator. It implements both the `Engine` interface (for scheduling) and the `pkg.Provision` interface (exposed over zbus).
62+
63+
### Job Queue
64+
65+
All operations go through a durable disk-backed queue (`dque`). Enqueue returns immediately after persisting to storage — execution happens asynchronously in the run loop.
66+
67+
| Operation | Description |
68+
|-----------|-------------|
69+
| `opProvision` | Install a new deployment (validates against chain) |
70+
| `opDeprovision` | Uninstall a deployment |
71+
| `opUpdate` | Diff current vs new, apply add/remove/update ops |
72+
| `opProvisionNoValidation` | Re-install on boot (skips chain hash validation) |
73+
| `opPause` | Pause all workloads in a deployment |
74+
| `opResume` | Resume all paused workloads |
75+
76+
The run loop is **single-threaded**: one job at a time, FIFO order. A job is only dequeued after it completes, so if the node crashes mid-job, it will be retried on the next boot.
77+
78+
### Deployment Lifecycle
79+
80+
```
81+
1. RMB message arrives at ZOS API
82+
2. CreateOrUpdate validates:
83+
- Structural validity (no duplicate names, valid versions)
84+
- Ownership (deployment.TwinID == sender twin)
85+
- KYC verification (via env.KycURL)
86+
- Signature (ed25519 using twin's on-chain public key)
87+
3. Engine.Provision persists to BoltDB and enqueues opProvision
88+
4. Run loop picks up job:
89+
a. Chain validation:
90+
- Fetches NodeContract from substrate
91+
- Verifies contract is for this node
92+
- Compares deployment ChallengeHash with contract DeploymentHash
93+
- Checks node rent status
94+
b. Installs workloads in type order via provisioner
95+
c. Dequeues job, fires callback
96+
```
97+
98+
### Startup Order
99+
100+
Workloads are installed in a deterministic type order (networks before VMs, storage before VMs) and uninstalled in **reverse** order. Within the same type, ZMount and Volume workloads are sorted largest-first.
101+
102+
Pause uses reverse order, resume uses forward order.
103+
104+
### Boot Recovery
105+
106+
When `rerunAll` is enabled, the engine on start:
107+
108+
1. Scans all persisted deployments
109+
2. Re-enqueues active ones as `opProvisionNoValidation` (skips chain hash check since the deployment is already validated)
110+
3. The run loop processes them normally, restoring all workloads
111+
112+
### Upgrade / Update
113+
114+
When a deployment update arrives:
115+
116+
1. The engine computes a diff: which workloads are added, removed, or updated
117+
2. Operations are sorted: removes first (reverse type order), then adds/updates (forward type order)
118+
3. Each operation dispatches to the provisioner accordingly
119+
4. Workload type changes are not allowed; only managers implementing the `Updater` interface accept updates
120+
121+
## Interfaces
122+
123+
### Engine
124+
125+
```go
126+
type Engine interface {
127+
Provision(ctx context.Context, wl gridtypes.Deployment) error
128+
Deprovision(ctx context.Context, twin uint32, id uint64, reason string) error
129+
Pause(ctx context.Context, twin uint32, id uint64) error
130+
Resume(ctx context.Context, twin uint32, id uint64) error
131+
Update(ctx context.Context, update gridtypes.Deployment) error
132+
Storage() Storage
133+
Twins() Twins
134+
Admins() Twins
135+
}
136+
```
137+
138+
### Provisioner
139+
140+
Operates at the per-workload level. Returned by `primitives.NewPrimitivesProvisioner`.
141+
142+
```go
143+
type Provisioner interface {
144+
Initialize(ctx context.Context) error
145+
Provision(ctx context.Context, wl *gridtypes.WorkloadWithID) (gridtypes.Result, error)
146+
Deprovision(ctx context.Context, wl *gridtypes.WorkloadWithID) error
147+
Pause(ctx context.Context, wl *gridtypes.WorkloadWithID) (gridtypes.Result, error)
148+
Resume(ctx context.Context, wl *gridtypes.WorkloadWithID) (gridtypes.Result, error)
149+
Update(ctx context.Context, wl *gridtypes.WorkloadWithID) (gridtypes.Result, error)
150+
CanUpdate(ctx context.Context, typ gridtypes.WorkloadType) bool
151+
}
152+
```
153+
154+
### Manager
155+
156+
The interface each workload type must implement (see [primitives](../primitives/readme.md)):
157+
158+
```go
159+
type Manager interface {
160+
Provision(ctx context.Context, wl *gridtypes.WorkloadWithID) (interface{}, error)
161+
Deprovision(ctx context.Context, wl *gridtypes.WorkloadWithID) error
162+
}
163+
```
164+
165+
Optional extensions: `Initializer`, `Updater`, `Pauser`.
166+
167+
### Storage
168+
169+
Persists deployment state. Primary implementation uses BoltDB.
170+
171+
```go
172+
type Storage interface {
173+
Create(deployment gridtypes.Deployment) error
174+
Update(twin uint32, deployment uint64, fields ...Field) error
175+
Delete(twin uint32, deployment uint64) error
176+
Get(twin uint32, deployment uint64) (gridtypes.Deployment, error)
177+
Error(twin uint32, deployment uint64, err error) error
178+
Add(twin uint32, deployment uint64, workload gridtypes.Workload) error
179+
Remove(twin uint32, deployment uint64, name gridtypes.Name) error
180+
Transaction(twin uint32, deployment uint64, workload gridtypes.Workload) error
181+
Changes(twin uint32, deployment uint64) (changes []gridtypes.Workload, err error)
182+
Current(twin uint32, deployment uint64, name gridtypes.Name) (gridtypes.Workload, error)
183+
Twins() ([]uint32, error)
184+
ByTwin(twin uint32) ([]uint64, error)
185+
Capacity(exclude ...Exclude) (StorageCapacity, error)
186+
}
187+
```
188+
189+
## Storage Backend
190+
191+
### BoltDB (`provision/storage/`)
192+
193+
The primary storage backend uses BoltDB with an append-only transaction log:
194+
195+
```
196+
<twin_id> (bucket)
197+
└── "global" (bucket) — sharable workload name → deployment ID
198+
└── <deployment_id> (bucket)
199+
├── "version"
200+
├── "metadata"
201+
├── "description"
202+
├── "signature_requirement"
203+
├── "workloads" (bucket) — name → type (active workload index)
204+
└── "transactions" (bucket) — sequence → JSON(workload + result)
205+
```
206+
207+
Every state change appends a new entry to the `transactions` bucket. `Current()` scans backward to find the latest state for each workload. `Remove()` deletes from the active `workloads` index but historical entries remain.
208+
209+
### Filesystem (`provision/storage.fs/`)
210+
211+
Legacy filesystem-based storage. Each deployment is a versioned JSON file. Used for migration to BoltDB.
212+
213+
## Context Enrichment
214+
215+
The engine injects values into the context before calling provisioner methods:
216+
217+
| Accessor | Value | Description |
218+
|----------|-------|-------------|
219+
| `GetEngine(ctx)` | `Engine` | Access to the engine (and storage) |
220+
| `GetDeploymentID(ctx)` | `(twin, deployment)` | Current deployment IDs |
221+
| `GetDeployment(ctx)` | `Deployment` | Fresh deployment from storage |
222+
| `GetWorkload(ctx, name)` | `Workload` | Last state of a workload in the deployment |
223+
| `GetContract(ctx)` | `NodeContract` | TFChain contract for this deployment |
224+
| `IsRentedNode(ctx)` | `bool` | Whether the node has an active rent contract |
225+
226+
## Error Handling
227+
228+
| Error | Meaning |
229+
|-------|---------|
230+
| `ErrNoActionNeeded` | Workload already running correctly; skip writing a transaction |
231+
| `ErrDeploymentExists` | Storage conflict: deployment already exists |
232+
| `ErrDeploymentNotExists` | Deployment not found |
233+
| `ErrWorkloadNotExist` | Workload not found |
234+
| `ErrDeploymentUpgradeValidationError` | Upgrade diff failed validation |
235+
| `ErrInvalidVersion` | Version number is wrong |
236+
237+
Workload failures are expressed via `gridtypes.Result` with `State = StateError`, not as Go errors. Special response types allow managers to communicate specific states:
238+
239+
- `Ok()` — explicit success (normally returning `nil` error suffices)
240+
- `Paused()` — workload is paused
241+
- `UnChanged(err)` — update failed but workload still running with previous config
242+
243+
## Authentication
244+
245+
### Twins (`auth.go`)
246+
247+
- **`substrateTwins`**: Fetches twin ed25519 public keys from TFChain via `substrateGateway.GetTwin()`. Caches up to 1024 entries in an LRU cache.
248+
- **`substrateAdmins`**: Authorizes the farm owner twin only. Used for admin-only operations.
249+
250+
### HTTP Middleware (`mw/`)
9251

10-
It accepts new deployment over `rmb` and tries to bring them to reality by running a series of provisioning workflows based on the workload `type`.
252+
JWT-based authentication for the HTTP API layer:
253+
- Validates JWT signed with ed25519 (audience `"zos"`, max 2-minute expiry)
254+
- Injects twin ID and public key into the request context
11255

12-
`provisiond` knows about all available daemons and it contacts them over `zbus` to ask for the needed services. The pull everything together and update the deployment with the workload state.
256+
## Interface
13257

14-
If node was restarted, `provisiond` tries to bring all active workloads back to original state.
15-
## Supported workload
258+
The zbus-exposed interface:
16259

17-
0-OS currently support 8 type of workloads:
18-
- network
19-
- `zmachine` (virtual machine)
20-
- `zmount` (disk): usable only by a `zmachine`
21-
- `public-ip` (v4 and/or v6): usable only by a `zmachine`
22-
- [`zdb`](https://github.com/threefoldtech/0-DB) `namespace`
23-
- [`qsfs`](https://github.com/threefoldtech/quantum-storage)
24-
- `zlogs`
25-
- `gateway`
260+
```go
261+
type Provision interface {
262+
DecommissionCached(id string, reason string) error
263+
GetWorkloadStatus(id string) (gridtypes.ResultState, bool, error)
264+
CreateOrUpdate(twin uint32, deployment gridtypes.Deployment, update bool) error
265+
Get(twin uint32, contractID uint64) (gridtypes.Deployment, error)
266+
List(twin uint32) ([]gridtypes.Deployment, error)
267+
Changes(twin uint32, contractID uint64) ([]gridtypes.Workload, error)
268+
ListTwins() ([]uint32, error)
269+
ListPublicIPs() ([]string, error)
270+
ListPrivateIPs(twin uint32, network gridtypes.Name) ([]string, error)
271+
}
272+
```

pkg/network/nr/net_resource.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -606,7 +606,7 @@ func (nr *NetResource) ConfigureWG(privateKey string) error {
606606
// Delete removes all the interfaces and namespaces created by the Create method
607607
func (nr *NetResource) Delete() error {
608608
log.Info().Str("network-id", nr.ID()).Str("subnet", nr.resource.Subnet.String()).Msg("deleting network resource")
609-
609+
610610
netnsName, err := nr.Namespace()
611611
if err != nil {
612612
return err

pkg/provision/provisiner.go

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@ func (p *mapProvisioner) Initialize(ctx context.Context) error {
122122
// Provision implements provision.Provisioner
123123
func (p *mapProvisioner) Provision(ctx context.Context, wl *gridtypes.WorkloadWithID) (result gridtypes.Result, err error) {
124124
log.Info().Str("workload-id", string(wl.ID)).Str("workload-type", string(wl.Type)).Msg("provisioning workload")
125-
125+
126126
manager, ok := p.managers[wl.Type]
127127
if !ok {
128128
return result, fmt.Errorf("unknown workload type '%s' for reservation id '%s'", wl.Type, wl.ID)
@@ -139,7 +139,7 @@ func (p *mapProvisioner) Provision(ctx context.Context, wl *gridtypes.WorkloadWi
139139
// Decommission implementation for provision.Provisioner
140140
func (p *mapProvisioner) Deprovision(ctx context.Context, wl *gridtypes.WorkloadWithID) error {
141141
log.Info().Str("workload-id", string(wl.ID)).Str("workload-type", string(wl.Type)).Msg("deprovisioning workload")
142-
142+
143143
manager, ok := p.managers[wl.Type]
144144
if !ok {
145145
return fmt.Errorf("unknown workload type '%s' for reservation id '%s'", wl.Type, wl.ID)
@@ -151,7 +151,7 @@ func (p *mapProvisioner) Deprovision(ctx context.Context, wl *gridtypes.Workload
151151
// Pause a workload
152152
func (p *mapProvisioner) Pause(ctx context.Context, wl *gridtypes.WorkloadWithID) (gridtypes.Result, error) {
153153
log.Info().Str("workload-id", string(wl.ID)).Str("workload-type", string(wl.Type)).Msg("pausing workload")
154-
154+
155155
if wl.Result.State != gridtypes.StateOk {
156156
return wl.Result, fmt.Errorf("can only pause workloads in ok state")
157157
}
@@ -180,7 +180,7 @@ func (p *mapProvisioner) Pause(ctx context.Context, wl *gridtypes.WorkloadWithID
180180
// Resume a workload
181181
func (p *mapProvisioner) Resume(ctx context.Context, wl *gridtypes.WorkloadWithID) (gridtypes.Result, error) {
182182
log.Info().Str("workload-id", string(wl.ID)).Str("workload-type", string(wl.Type)).Msg("resuming workload")
183-
183+
184184
if wl.Result.State != gridtypes.StatePaused {
185185
return wl.Result, fmt.Errorf("can only resume workloads in paused state")
186186
}
@@ -208,7 +208,7 @@ func (p *mapProvisioner) Resume(ctx context.Context, wl *gridtypes.WorkloadWithI
208208
// Provision implements provision.Provisioner
209209
func (p *mapProvisioner) Update(ctx context.Context, wl *gridtypes.WorkloadWithID) (result gridtypes.Result, err error) {
210210
log.Info().Str("workload-id", string(wl.ID)).Str("workload-type", string(wl.Type)).Msg("updating workload")
211-
211+
212212
manager, ok := p.managers[wl.Type]
213213
if !ok {
214214
return result, fmt.Errorf("unknown workload type '%s' for reservation id '%s'", wl.Type, wl.ID)

pkg/vm/manager.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -574,7 +574,7 @@ func (m *Module) Run(vm pkg.VM) (pkg.MachineInfo, error) {
574574
logEvent := log.Info().
575575
Str("vm-id", vm.Name).
576576
Str("hostname", vm.Hostname).
577-
Uint8("cpu", uint8(vm.CPU)).
577+
Uint8("cpu", vm.CPU).
578578
Uint64("memory-bytes", uint64(vm.Memory))
579579

580580
logEvent = logInterfaceDetails(logEvent, nics, &vm.Network)

0 commit comments

Comments
 (0)