|
| 1 | +# Nimbus Feature Guide |
| 2 | + |
| 3 | +This document explains the new platform capabilities that extend Nimbus beyond the baseline GitHub-compatible runner experience. Each section describes the architecture, configuration, and operational workflows for the feature set. |
| 4 | + |
| 5 | +## Near-Runner Cache |
| 6 | + |
| 7 | +The host agent now embeds a lightweight HTTP cache alongside every worker. Two access modes are supported: |
| 8 | + |
| 9 | +1. **Loopback HTTP** – Jobs can use the advertised `cache.endpoint.*` metadata keys to reach the cache from either the host or guest network. |
| 10 | +2. **Virtio-fs passthrough** – When `virtiofsd` is available the cache directory is mounted directly into Firecracker guests, allowing POSIX access. |
| 11 | + |
| 12 | +### Lifecycle |
| 13 | + |
| 14 | +- The `NearRunnerCacheManager` spins up a FastAPI app on a dedicated port as part of the host agent boot sequence. |
| 15 | +- Cache tokens minted by the control plane guard every request; scope is validated per organization and operation (push/pull). |
| 16 | +- Writes are persisted to the local cache directory and optionally mirrored to S3 for durability. |
| 17 | +- Reads fall back to S3 automatically when the local artifact is missing. |
| 18 | +- When virtio-fs is enabled, the Firecracker launcher starts `virtiofsd` alongside the microVM and configures the device over MMDS metadata. |
| 19 | + |
| 20 | +### Configuration |
| 21 | + |
| 22 | +Enable and tune the cache with the following environment variables (documented in [configuration.md](./configuration.md)): |
| 23 | + |
| 24 | +- `NIMBUS_NEAR_CACHE_ENABLE`, `NIMBUS_NEAR_CACHE_DIR`, `NIMBUS_NEAR_CACHE_BIND`, `NIMBUS_NEAR_CACHE_ADVERTISE` |
| 25 | +- Optional port range and S3 parameters: `NIMBUS_NEAR_CACHE_PORT`, `NIMBUS_NEAR_CACHE_PORT_START`, `NIMBUS_NEAR_CACHE_PORT_END`, `NIMBUS_NEAR_CACHE_S3_*` |
| 26 | +- Guest mount controls: `NIMBUS_NEAR_CACHE_MOUNT_TAG`, `NIMBUS_NEAR_CACHE_MOUNT_PATH`, `NIMBUS_NEAR_CACHE_VIRTIOFSD` |
| 27 | + |
| 28 | +**Operational checklist** |
| 29 | + |
| 30 | +- Ensure the cache directory resides on SSD/NVMe media for best performance. |
| 31 | +- Package `virtiofsd` with your host image when expecting guest mounts. |
| 32 | +- Configure IAM permissions for the S3 bucket when write-through is enabled. |
| 33 | + |
| 34 | +## SSH Debugging Sessions |
| 35 | + |
| 36 | +The control plane exposes a token-gated SSH workflow for live troubleshooting of Firecracker guests. |
| 37 | + |
| 38 | +### Flow |
| 39 | + |
| 40 | +1. An admin calls `POST /api/ssh/sessions` with the target job ID. |
| 41 | +2. The control plane mints an HMAC token (`NIMBUS_SSH_SESSION_SECRET`) and reserves a host port. |
| 42 | +3. The host agent polls `GET /api/agents/ssh/sessions` and, once a session is assigned, configures DNAT rules using the reserved port. |
| 43 | +4. When the microVM reports its guest IP, the agent activates the session via `POST /api/ssh/sessions/{id}/activate`. |
| 44 | +5. The CLI (or user) connects through the gateway using the Bearer token. Closing the session or TTL expiry revokes the token server-side. |
| 45 | + |
| 46 | +### Required settings |
| 47 | + |
| 48 | +- Control plane: `NIMBUS_SSH_SESSION_SECRET`, `NIMBUS_SSH_PORT_START`, `NIMBUS_SSH_PORT_END`, `NIMBUS_SSH_SESSION_TTL` |
| 49 | +- Host agent: `NIMBUS_SSH_ENABLE`, `NIMBUS_SSH_POLL_INTERVAL`, and `NIMBUS_SSH_AUTHORIZED_KEY` |
| 50 | + |
| 51 | +**Operational tips** |
| 52 | + |
| 53 | +- Rotate `NIMBUS_SSH_SESSION_SECRET` alongside other control-plane secrets. |
| 54 | +- Port ranges should avoid conflicts with existing ingress rules. |
| 55 | +- Agents automatically purge expired sessions; monitor the `Expired SSH sessions cleaned up` log line for drift. |
| 56 | + |
| 57 | +## Prebuilt Runner Images |
| 58 | + |
| 59 | +Nimbus ships with curated base images that mimic common GitHub runner stacks while incorporating platform hardening. |
| 60 | + |
| 61 | +### Registry |
| 62 | + |
| 63 | +The `nimbus.runners.images` module tracks canonical aliases. Current mappings: |
| 64 | + |
| 65 | +| Alias | Image reference | |
| 66 | +| --- | --- | |
| 67 | +| `ubuntu-2404` | `nimbus/ubuntu-2404-runner:latest` | |
| 68 | +| `ubuntu-2204` | `nimbus/ubuntu-2204-runner:latest` | |
| 69 | +| `node-22` | `nimbus/node-22-runner:latest` | |
| 70 | +| `python-312` | `nimbus/python-312-runner:latest` | |
| 71 | + |
| 72 | +### Consumption patterns |
| 73 | + |
| 74 | +- Jobs can opt in with a label: `image:ubuntu-2204` or `image:python-312`. |
| 75 | +- Without explicit labels, the Docker executor maps generic tags (e.g. `node`, `python`) to the maintained images. |
| 76 | +- The control plane exposes `GET /api/runners/images` for UI and tooling to discover available aliases. |
| 77 | + |
| 78 | +**Operational tips** |
| 79 | + |
| 80 | +- Keep the image repository mirrored in your private registry if outbound pulls are restricted. |
| 81 | +- Use scheduled CI to validate language runtimes and security updates within these images. |
| 82 | + |
| 83 | +## Hardware-Aware Scheduling |
| 84 | + |
| 85 | +Host agents now include a hardware snapshot with every lease request. The control plane uses these metrics to respect job requirements expressed as labels. |
| 86 | + |
| 87 | +### Metrics collected |
| 88 | + |
| 89 | +- CPU core count and average MHz (via `/proc/cpuinfo`) |
| 90 | +- Total memory in MB (from `/proc/meminfo`) |
| 91 | +- NVMe presence (`/sys/block/nvme*` probe) |
| 92 | + |
| 93 | +### Supported labels |
| 94 | + |
| 95 | +| Label | Requirement | |
| 96 | +| --- | --- | |
| 97 | +| `cpu-high` | Host must report ≥ 8 cores | |
| 98 | +| `cpu-medium` | Host must report ≥ 4 cores | |
| 99 | +| `memory-high` | Host must report ≥ 32 GB RAM | |
| 100 | +| `storage-nvme` | Host must expose NVMe storage | |
| 101 | + |
| 102 | +### Workflow |
| 103 | + |
| 104 | +- Agents attach `hardware` metrics to `JobLeaseRequest`. |
| 105 | +- The control plane quick-scans the Redis queue, skipping jobs whose labels are not satisfied by the requesting host. |
| 106 | +- When no capable host is available the job remains queued, preserving order. |
| 107 | + |
| 108 | +**Operational tips** |
| 109 | + |
| 110 | +- Align autoscaling groups with these profiles (e.g. dedicated NVMe group) and spread agent IDs accordingly. |
| 111 | +- Expose additional metrics in `_hardware_snapshot()` if new labels are introduced. |
| 112 | + |
| 113 | +## Self-Service Analytics |
| 114 | + |
| 115 | +Nimbus now surfaces near-real-time execution insights through a programmatic API and the web dashboard. |
| 116 | + |
| 117 | +### Control plane API |
| 118 | + |
| 119 | +- `GET /api/analytics/jobs?days=<n>&org_id=<id>` returns daily buckets of job outcomes, filtered per organization when requested. |
| 120 | +- Each bucket contains a `date`, `total`, and per-status counts. |
| 121 | +- Authentication uses the existing admin bearer token scheme. |
| 122 | + |
| 123 | +### Dashboard page |
| 124 | + |
| 125 | +- The web UI includes an **Analytics** tab that queries the API when a control-plane base URL and admin token are configured in settings. |
| 126 | +- Users can review rolling success/failure totals and identify regressions without leaving the dashboard. |
| 127 | + |
| 128 | +**Operational tips** |
| 129 | + |
| 130 | +- Use the API to feed downstream BI tools or to trigger alerts on failure spikes. |
| 131 | +- Front-end fetch retries follow standard `fetch` semantics; configure a CDN or cache if latency becomes an issue. |
| 132 | + |
| 133 | +--- |
| 134 | + |
| 135 | +For deployment runbooks and additional operational context, consult [operations.md](./operations.md) and [EXECUTOR_SYSTEM.md](./EXECUTOR_SYSTEM.md). |
0 commit comments