Skip to content

Commit a345a1b

Browse files
committed
Document cache, SSH, image, and analytics features
1 parent 5eeb1cb commit a345a1b

File tree

2 files changed

+150
-0
lines changed

2 files changed

+150
-0
lines changed

docs/configuration.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ This document aggregates the environment variables and helper tooling required t
1717
| `NIMBUS_METRICS_TOKEN` | Bearer token required for `/metrics`; if unset, access is restricted to loopback clients. | optional |
1818
| `NIMBUS_CACHE_TOKEN_TTL` | Seconds before cache tokens expire. | `3600` |
1919
| `NIMBUS_CACHE_SHARED_SECRET` | HMAC secret for cache token minting. | required |
20+
| `NIMBUS_SSH_SESSION_SECRET` | HMAC secret used to mint and verify SSH debugging tokens. | `local-ssh-secret` |
2021
| `NIMBUS_AGENT_TOKEN_SECRET` | Secret used to mint/verify agent bearer tokens. | required |
2122
| `NIMBUS_AGENT_TOKEN_RATE_LIMIT` | Maximum agent token mint operations per interval. | `15` |
2223
| `NIMBUS_AGENT_TOKEN_RATE_INTERVAL` | Interval window (seconds) for token mint rate limiting. | `60` |
@@ -32,6 +33,20 @@ This document aggregates the environment variables and helper tooling required t
3233
| `NIMBUS_CONTROL_PLANE_TOKEN` | Bearer token issued by the control plane. | required |
3334
| `NIMBUS_AGENT_REDIS_URL` | Optional Redis URL for local coordination/caching. | optional |
3435
| `NIMBUS_CACHE_PROXY_URL` | Cache proxy base URL for artifact downloads. | optional |
36+
| `NIMBUS_NEAR_CACHE_ENABLE` | Enable the embedded near-runner cache service. | `false` |
37+
| `NIMBUS_NEAR_CACHE_DIR` | Filesystem root for near-runner cache data. | `/var/lib/nimbus/near-cache` |
38+
| `NIMBUS_NEAR_CACHE_BIND` | Bind address for the FastAPI near-cache server. | `0.0.0.0` |
39+
| `NIMBUS_NEAR_CACHE_ADVERTISE` | Host/IP advertised to jobs for cache access. | `127.0.0.1` |
40+
| `NIMBUS_NEAR_CACHE_PORT` | Fixed port for the cache listener (random within range when unset). | unset |
41+
| `NIMBUS_NEAR_CACHE_PORT_START` | Lower bound of random cache port allocation range. | `38000` |
42+
| `NIMBUS_NEAR_CACHE_PORT_END` | Upper bound of random cache port allocation range. | `39000` |
43+
| `NIMBUS_NEAR_CACHE_S3_BUCKET` | Optional S3 bucket for cache read-through/write-through. | unset |
44+
| `NIMBUS_NEAR_CACHE_S3_ENDPOINT` | S3-compatible endpoint URL for cache fallbacks. | unset |
45+
| `NIMBUS_NEAR_CACHE_S3_REGION` | Region identifier used with the cache S3 endpoint. | unset |
46+
| `NIMBUS_NEAR_CACHE_S3_WRITE_THROUGH` | Enable uploading cached artifacts back to S3. | `false` |
47+
| `NIMBUS_NEAR_CACHE_MOUNT_TAG` | Virtio-fs mount tag exposed to Firecracker guests. | `nimbus-cache` |
48+
| `NIMBUS_NEAR_CACHE_MOUNT_PATH` | Mount path inside guests when virtio-fs is available. | `/mnt/nimbus-cache` |
49+
| `NIMBUS_NEAR_CACHE_VIRTIOFSD` | Absolute path to the `virtiofsd` binary for guest mounts. | unset |
3550
| `NIMBUS_AGENT_STATE_DATABASE_URL` | Async SQLAlchemy URL for the host agent state store. | `postgresql+asyncpg://localhost/nimbus_agent_state` |
3651
| `NIMBUS_LOG_SINK_URL` | Logging pipeline ingest endpoint. | optional |
3752
| `NIMBUS_AGENT_METRICS_HOST` | Prometheus metrics listener host. | `0.0.0.0` |

docs/nimbus-feature-guide.md

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
# Nimbus Feature Guide
2+
3+
This document explains the new platform capabilities that extend Nimbus beyond the baseline GitHub-compatible runner experience. Each section describes the architecture, configuration, and operational workflows for the feature set.
4+
5+
## Near-Runner Cache
6+
7+
The host agent now embeds a lightweight HTTP cache alongside every worker. Two access modes are supported:
8+
9+
1. **Loopback HTTP** – Jobs can use the advertised `cache.endpoint.*` metadata keys to reach the cache from either the host or guest network.
10+
2. **Virtio-fs passthrough** – When `virtiofsd` is available the cache directory is mounted directly into Firecracker guests, allowing POSIX access.
11+
12+
### Lifecycle
13+
14+
- The `NearRunnerCacheManager` spins up a FastAPI app on a dedicated port as part of the host agent boot sequence.
15+
- Cache tokens minted by the control plane guard every request; scope is validated per organization and operation (push/pull).
16+
- Writes are persisted to the local cache directory and optionally mirrored to S3 for durability.
17+
- Reads fall back to S3 automatically when the local artifact is missing.
18+
- When virtio-fs is enabled, the Firecracker launcher starts `virtiofsd` alongside the microVM and configures the device over MMDS metadata.
19+
20+
### Configuration
21+
22+
Enable and tune the cache with the following environment variables (documented in [configuration.md](./configuration.md)):
23+
24+
- `NIMBUS_NEAR_CACHE_ENABLE`, `NIMBUS_NEAR_CACHE_DIR`, `NIMBUS_NEAR_CACHE_BIND`, `NIMBUS_NEAR_CACHE_ADVERTISE`
25+
- Optional port range and S3 parameters: `NIMBUS_NEAR_CACHE_PORT`, `NIMBUS_NEAR_CACHE_PORT_START`, `NIMBUS_NEAR_CACHE_PORT_END`, `NIMBUS_NEAR_CACHE_S3_*`
26+
- Guest mount controls: `NIMBUS_NEAR_CACHE_MOUNT_TAG`, `NIMBUS_NEAR_CACHE_MOUNT_PATH`, `NIMBUS_NEAR_CACHE_VIRTIOFSD`
27+
28+
**Operational checklist**
29+
30+
- Ensure the cache directory resides on SSD/NVMe media for best performance.
31+
- Package `virtiofsd` with your host image when expecting guest mounts.
32+
- Configure IAM permissions for the S3 bucket when write-through is enabled.
33+
34+
## SSH Debugging Sessions
35+
36+
The control plane exposes a token-gated SSH workflow for live troubleshooting of Firecracker guests.
37+
38+
### Flow
39+
40+
1. An admin calls `POST /api/ssh/sessions` with the target job ID.
41+
2. The control plane mints an HMAC token (`NIMBUS_SSH_SESSION_SECRET`) and reserves a host port.
42+
3. The host agent polls `GET /api/agents/ssh/sessions` and, once a session is assigned, configures DNAT rules using the reserved port.
43+
4. When the microVM reports its guest IP, the agent activates the session via `POST /api/ssh/sessions/{id}/activate`.
44+
5. The CLI (or user) connects through the gateway using the Bearer token. Closing the session or TTL expiry revokes the token server-side.
45+
46+
### Required settings
47+
48+
- Control plane: `NIMBUS_SSH_SESSION_SECRET`, `NIMBUS_SSH_PORT_START`, `NIMBUS_SSH_PORT_END`, `NIMBUS_SSH_SESSION_TTL`
49+
- Host agent: `NIMBUS_SSH_ENABLE`, `NIMBUS_SSH_POLL_INTERVAL`, and `NIMBUS_SSH_AUTHORIZED_KEY`
50+
51+
**Operational tips**
52+
53+
- Rotate `NIMBUS_SSH_SESSION_SECRET` alongside other control-plane secrets.
54+
- Port ranges should avoid conflicts with existing ingress rules.
55+
- Agents automatically purge expired sessions; monitor the `Expired SSH sessions cleaned up` log line for drift.
56+
57+
## Prebuilt Runner Images
58+
59+
Nimbus ships with curated base images that mimic common GitHub runner stacks while incorporating platform hardening.
60+
61+
### Registry
62+
63+
The `nimbus.runners.images` module tracks canonical aliases. Current mappings:
64+
65+
| Alias | Image reference |
66+
| --- | --- |
67+
| `ubuntu-2404` | `nimbus/ubuntu-2404-runner:latest` |
68+
| `ubuntu-2204` | `nimbus/ubuntu-2204-runner:latest` |
69+
| `node-22` | `nimbus/node-22-runner:latest` |
70+
| `python-312` | `nimbus/python-312-runner:latest` |
71+
72+
### Consumption patterns
73+
74+
- Jobs can opt in with a label: `image:ubuntu-2204` or `image:python-312`.
75+
- Without explicit labels, the Docker executor maps generic tags (e.g. `node`, `python`) to the maintained images.
76+
- The control plane exposes `GET /api/runners/images` for UI and tooling to discover available aliases.
77+
78+
**Operational tips**
79+
80+
- Keep the image repository mirrored in your private registry if outbound pulls are restricted.
81+
- Use scheduled CI to validate language runtimes and security updates within these images.
82+
83+
## Hardware-Aware Scheduling
84+
85+
Host agents now include a hardware snapshot with every lease request. The control plane uses these metrics to respect job requirements expressed as labels.
86+
87+
### Metrics collected
88+
89+
- CPU core count and average MHz (via `/proc/cpuinfo`)
90+
- Total memory in MB (from `/proc/meminfo`)
91+
- NVMe presence (`/sys/block/nvme*` probe)
92+
93+
### Supported labels
94+
95+
| Label | Requirement |
96+
| --- | --- |
97+
| `cpu-high` | Host must report ≥ 8 cores |
98+
| `cpu-medium` | Host must report ≥ 4 cores |
99+
| `memory-high` | Host must report ≥ 32 GB RAM |
100+
| `storage-nvme` | Host must expose NVMe storage |
101+
102+
### Workflow
103+
104+
- Agents attach `hardware` metrics to `JobLeaseRequest`.
105+
- The control plane quick-scans the Redis queue, skipping jobs whose labels are not satisfied by the requesting host.
106+
- When no capable host is available the job remains queued, preserving order.
107+
108+
**Operational tips**
109+
110+
- Align autoscaling groups with these profiles (e.g. dedicated NVMe group) and spread agent IDs accordingly.
111+
- Expose additional metrics in `_hardware_snapshot()` if new labels are introduced.
112+
113+
## Self-Service Analytics
114+
115+
Nimbus now surfaces near-real-time execution insights through a programmatic API and the web dashboard.
116+
117+
### Control plane API
118+
119+
- `GET /api/analytics/jobs?days=<n>&org_id=<id>` returns daily buckets of job outcomes, filtered per organization when requested.
120+
- Each bucket contains a `date`, `total`, and per-status counts.
121+
- Authentication uses the existing admin bearer token scheme.
122+
123+
### Dashboard page
124+
125+
- The web UI includes an **Analytics** tab that queries the API when a control-plane base URL and admin token are configured in settings.
126+
- Users can review rolling success/failure totals and identify regressions without leaving the dashboard.
127+
128+
**Operational tips**
129+
130+
- Use the API to feed downstream BI tools or to trigger alerts on failure spikes.
131+
- Front-end fetch retries follow standard `fetch` semantics; configure a CDN or cache if latency becomes an issue.
132+
133+
---
134+
135+
For deployment runbooks and additional operational context, consult [operations.md](./operations.md) and [EXECUTOR_SYSTEM.md](./EXECUTOR_SYSTEM.md).

0 commit comments

Comments
 (0)