Skip to content

Commit 26e30f3

Browse files
committed
docs: capture initial stride threat model
1 parent fe6fb7b commit 26e30f3

File tree

1 file changed

+62
-0
lines changed

1 file changed

+62
-0
lines changed

docs/threat-model.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Nimbus Threat Model (STRIDE Baseline)
2+
3+
This document captures the initial STRIDE analysis for Nimbus components. It highlights current mitigations and the blockers that must be resolved before a production release.
4+
5+
## 1. System Overview
6+
7+
- **Assets**
8+
- GitHub webhook secrets, agent JWT/signing keys, cache tokens.
9+
- Job definitions, artifacts, logs, ClickHouse telemetry.
10+
- Rootfs images, Docker/GPU executor environments.
11+
- Compliance export logs, RBAC policy definitions.
12+
- **Trust Boundaries**
13+
- Internet ▶️ Control Plane (GitHub webhooks, admin/API clients).
14+
- Control Plane ▶️ Redis/Postgres/ClickHouse.
15+
- Control Plane ▶️ Host Agents (lease issuance, fencing).
16+
- Host Agents ▶️ Executors (Firecracker, Docker, GPU).
17+
- Tenants ▶️ Cache Proxy / Docker Cache / Logging endpoints.
18+
19+
## 2. STRIDE Breakdown
20+
21+
| Component | Spoofing | Tampering | Repudiation | Information Disclosure | DoS | Elevation of Privilege |
22+
|-----------|----------|-----------|-------------|------------------------|-----|------------------------|
23+
| GitHub Webhook Ingress | ✅ HMAC + timestamp + delivery nonce. | ❌ Need replay fuzzing & idempotency tests. | ✅ Structured logging. | ✅ Payload minima, but fuzzing pending. | ⚠️ Risk of burst -> redis backlog. | ⚠️ No step-up auth for admin endpoints. |
24+
| Job Lease Service | ✅ Fence tokens, agent auth. | ⚠️ Missing property tests for lease monotonicity. | ✅ Audit tables. | ⚠️ Lease data in Redis (no encryption). | ⚠️ Rate-limit coverage low. | ⚠️ Missing agent capability verification tests. |
25+
| Cache Proxy | ✅ Bearer cache tokens. | ⚠️ Need fuzz tests for key sanitiser + eviction. | ✅ Request logging. | ⚠️ S3/local backend quota bypass risk. | ⚠️ Circuit breaker tuning untested. | ⚠️ Tokens scoped by org but no policy proofs. |
26+
| Docker Cache | ✅ Token scope enforcement. | ⚠️ Metadata tamper via partial uploads. | ✅ Audit events. | ⚠️ Org isolation not fuzzed. | ⚠️ Potential blob storm. | ⚠️ No attestation enforcement. |
27+
| Logging Pipeline | ✅ Cache token scope. | ⚠️ No fuzzing of ClickHouse payload. | ✅ Ingest logs. | ⚠️ Need row-level security enforcement tests. | ⚠️ Backpressure + batching thresholds untested. | ⚠️ Query policy lacks deny-by-default tests. |
28+
| Host Agent | ✅ Control-plane JWTs. | ⚠️ Egress policy bypass surfaces (curl, DNS). | ✅ Job status logs. | ⚠️ GPU telemetry leaks (NVML, MIG). | ⚠️ Warm pool exhaustion -> DoS. | ⚠️ Supply-chain checks best-effort only. |
29+
| Web Dashboard | ✅ SSO / RBAC (docs). | ⚠️ No permission matrix regression tests. | ✅ HTTP access logs. | ⚠️ Potential cross-tenant data leakage. | ⚠️ Unbounded queries vs ClickHouse. | ⚠️ UI step-up, scoped tokens not validated. |
30+
31+
Legend: ✅ covered, ⚠️ gap, ❌ missing mitigation.
32+
33+
## 3. Priority Gaps
34+
35+
1. **Policy Verification**
36+
- Formalise RBAC/OPA policies and create golden allow/deny fixture tests.
37+
- Add CI suite to run policy regression on every commit.
38+
39+
2. **Egress Enforcement**
40+
- Netns + iptables unit tests ensuring metadata endpoints and deny-list hits are blocked.
41+
- Red-team scenarios (DNS tunnelling, curl-in-image, sidecar pivot) automated in integration tests.
42+
43+
3. **Replay & Idempotency**
44+
- Property-based tests to ensure webhook replays do not duplicate jobs.
45+
- Agent lease state machine tests for monotonic fence tokens and exact-once completion semantics.
46+
47+
4. **Multi-tenant Isolation**
48+
- ClickHouse row-level security with regression harness.
49+
- Cache/Docker registries enforcing org quotas with adversarial testing.
50+
51+
5. **Secrets & Key Management**
52+
- Store secrets in Vault/KMS; rotate via automation; add dry-run restore tests.
53+
54+
## 4. Next Actions
55+
56+
- [ ] Produce OPA policy suite (`policy/` + `tests/policy`) with deny-by-default.
57+
- [ ] Implement fuzz/property frameworks per `docs/testing-hardening-roadmap.md`.
58+
- [ ] Add network namespace simulation tests validating OfflineEgressEnforcer.
59+
- [ ] Document GPU sharing strategy (MIG/MPS) and enforce per-job restrictions.
60+
- [ ] Update runbooks with SLOs, DR, backup procedures.
61+
62+
This document should be updated as mitigations land. Once all ⚠️ and ❌ items are addressed, review with security engineering before declaring the system production-ready.

0 commit comments

Comments
 (0)