|
| 1 | +# Nimbus Threat Model (STRIDE Baseline) |
| 2 | + |
| 3 | +This document captures the initial STRIDE analysis for Nimbus components. It highlights current mitigations and the blockers that must be resolved before a production release. |
| 4 | + |
| 5 | +## 1. System Overview |
| 6 | + |
| 7 | +- **Assets** |
| 8 | + - GitHub webhook secrets, agent JWT/signing keys, cache tokens. |
| 9 | + - Job definitions, artifacts, logs, ClickHouse telemetry. |
| 10 | + - Rootfs images, Docker/GPU executor environments. |
| 11 | + - Compliance export logs, RBAC policy definitions. |
| 12 | +- **Trust Boundaries** |
| 13 | + - Internet ▶️ Control Plane (GitHub webhooks, admin/API clients). |
| 14 | + - Control Plane ▶️ Redis/Postgres/ClickHouse. |
| 15 | + - Control Plane ▶️ Host Agents (lease issuance, fencing). |
| 16 | + - Host Agents ▶️ Executors (Firecracker, Docker, GPU). |
| 17 | + - Tenants ▶️ Cache Proxy / Docker Cache / Logging endpoints. |
| 18 | + |
| 19 | +## 2. STRIDE Breakdown |
| 20 | + |
| 21 | +| Component | Spoofing | Tampering | Repudiation | Information Disclosure | DoS | Elevation of Privilege | |
| 22 | +|-----------|----------|-----------|-------------|------------------------|-----|------------------------| |
| 23 | +| GitHub Webhook Ingress | ✅ HMAC + timestamp + delivery nonce. | ❌ Need replay fuzzing & idempotency tests. | ✅ Structured logging. | ✅ Payload minima, but fuzzing pending. | ⚠️ Risk of burst -> redis backlog. | ⚠️ No step-up auth for admin endpoints. | |
| 24 | +| Job Lease Service | ✅ Fence tokens, agent auth. | ⚠️ Missing property tests for lease monotonicity. | ✅ Audit tables. | ⚠️ Lease data in Redis (no encryption). | ⚠️ Rate-limit coverage low. | ⚠️ Missing agent capability verification tests. | |
| 25 | +| Cache Proxy | ✅ Bearer cache tokens. | ⚠️ Need fuzz tests for key sanitiser + eviction. | ✅ Request logging. | ⚠️ S3/local backend quota bypass risk. | ⚠️ Circuit breaker tuning untested. | ⚠️ Tokens scoped by org but no policy proofs. | |
| 26 | +| Docker Cache | ✅ Token scope enforcement. | ⚠️ Metadata tamper via partial uploads. | ✅ Audit events. | ⚠️ Org isolation not fuzzed. | ⚠️ Potential blob storm. | ⚠️ No attestation enforcement. | |
| 27 | +| Logging Pipeline | ✅ Cache token scope. | ⚠️ No fuzzing of ClickHouse payload. | ✅ Ingest logs. | ⚠️ Need row-level security enforcement tests. | ⚠️ Backpressure + batching thresholds untested. | ⚠️ Query policy lacks deny-by-default tests. | |
| 28 | +| Host Agent | ✅ Control-plane JWTs. | ⚠️ Egress policy bypass surfaces (curl, DNS). | ✅ Job status logs. | ⚠️ GPU telemetry leaks (NVML, MIG). | ⚠️ Warm pool exhaustion -> DoS. | ⚠️ Supply-chain checks best-effort only. | |
| 29 | +| Web Dashboard | ✅ SSO / RBAC (docs). | ⚠️ No permission matrix regression tests. | ✅ HTTP access logs. | ⚠️ Potential cross-tenant data leakage. | ⚠️ Unbounded queries vs ClickHouse. | ⚠️ UI step-up, scoped tokens not validated. | |
| 30 | + |
| 31 | +Legend: ✅ covered, ⚠️ gap, ❌ missing mitigation. |
| 32 | + |
| 33 | +## 3. Priority Gaps |
| 34 | + |
| 35 | +1. **Policy Verification** |
| 36 | + - Formalise RBAC/OPA policies and create golden allow/deny fixture tests. |
| 37 | + - Add CI suite to run policy regression on every commit. |
| 38 | + |
| 39 | +2. **Egress Enforcement** |
| 40 | + - Netns + iptables unit tests ensuring metadata endpoints and deny-list hits are blocked. |
| 41 | + - Red-team scenarios (DNS tunnelling, curl-in-image, sidecar pivot) automated in integration tests. |
| 42 | + |
| 43 | +3. **Replay & Idempotency** |
| 44 | + - Property-based tests to ensure webhook replays do not duplicate jobs. |
| 45 | + - Agent lease state machine tests for monotonic fence tokens and exact-once completion semantics. |
| 46 | + |
| 47 | +4. **Multi-tenant Isolation** |
| 48 | + - ClickHouse row-level security with regression harness. |
| 49 | + - Cache/Docker registries enforcing org quotas with adversarial testing. |
| 50 | + |
| 51 | +5. **Secrets & Key Management** |
| 52 | + - Store secrets in Vault/KMS; rotate via automation; add dry-run restore tests. |
| 53 | + |
| 54 | +## 4. Next Actions |
| 55 | + |
| 56 | +- [ ] Produce OPA policy suite (`policy/` + `tests/policy`) with deny-by-default. |
| 57 | +- [ ] Implement fuzz/property frameworks per `docs/testing-hardening-roadmap.md`. |
| 58 | +- [ ] Add network namespace simulation tests validating OfflineEgressEnforcer. |
| 59 | +- [ ] Document GPU sharing strategy (MIG/MPS) and enforce per-job restrictions. |
| 60 | +- [ ] Update runbooks with SLOs, DR, backup procedures. |
| 61 | + |
| 62 | +This document should be updated as mitigations land. Once all ⚠️ and ❌ items are addressed, review with security engineering before declaring the system production-ready. |
0 commit comments