Skip to content

Add security documentation: threat model, attack surface, recovery procedures #10

@Evrard-Nil

Description

@Evrard-Nil

Problem

The project lacks security documentation:

  • No threat model documented
  • Attack surface not defined
  • No incident response procedures
  • Key rotation process not documented
  • No security considerations for operators

CLAUDE.md has good build/architecture docs but doesn't cover security operations.

Impact

High - Without security documentation, operators cannot:

  • Properly deploy the service securely
  • Respond to security incidents
  • Understand the security guarantees
  • Perform key rotation safely

Solution

Create SECURITY.md with the following sections:

1. Threat Model

## Threat Model

### Assets
- Signing keys (ECDSA, Ed25519) - stored in dstack KMS
- Authentication tokens - stored in environment
- Cached signatures - in-memory TTL cache
- TDX quotes and GPU attestations

### Threat Actors
- External attackers without credentials
- External attackers with stolen credentials
- Malicious insiders with valid tokens
- Compromised backend services
- Side-channel attackers

### Threats
1. **Key Compromise**: If signing keys leak, attacker can forge signatures
2. **Token Theft**: Stolen auth token enables full proxy access
3. **DoS Attacks**: Resource exhaustion via flooding
4. **Signature Forgery**: Invalid signatures accepted by clients
5. **Path Traversal**: Accessing unintended backend endpoints
6. **Timing Attacks**: Token leakage via timing side-channels
7. **Memory Disclosure**: Token/key leakage via memory dumps
8. **Supply Chain**: Compromised dependencies

2. Attack Surface

## Attack Surface

### External Attack Surface
- HTTP endpoints (authenticated)
  - `/v1/chat/completions` - JSON/streaming chat
  - `/v1/completions` - Text completion
  - `/v1/embeddings`, `/v1/rerank`, `/v1/score` - ML endpoints
  - `/v1/images/generations`, `/v1/images/edits` - Image endpoints
  - `/v1/audio/transcriptions` - Audio endpoint
  - `/v1/signature/{chat_id}` - Signature retrieval
  - `/v1/attestation/report` - TEE attestation
  - `/*` (catch-all) - Arbitrary path forwarding

- HTTP endpoints (unauthenticated)
  - `/`, `/version` - Health checks
  - `/v1/metrics`, `/v1/models` - Info endpoints

### Internal Attack Surface
- dstack KMS API - Key retrieval
- Backend vLLM/sglang service - Request forwarding
- Python subprocess - GPU attestation
- File system - Git revision file

### Network Surface
- Inbound: 0.0.0.0:8000 (configurable)
- Outbound: Backend URL, dstack KMS, Python interpreter

3. Incident Response

## Incident Response

### Token Compromise
1. Immediately rotate `TOKEN` environment variable
2. Restart proxy service
3. Audit logs for unauthorized access
4. Revoke and reissue tokens to legitimate clients

### Key Compromise
1. **DO NOT** restart - this will generate new keys and invalidate all signatures
2. Contact dstack team to rotate KMS keys
3. Coordinate with clients to update to new signing addresses
4. Archive old signatures with compromise timestamp

### DoS Attack
1. Check metrics for unusual request patterns
2. Enable rate limiting if not already active
3. Block attacking IPs at firewall/load balancer
4. Scale horizontally if needed

### Signature Verification Failures
1. Check that signing keys are initialized correctly
2. Verify dstack KMS is reachable
3. Check backend is not returning corrupted responses
4. Review recent code changes to signing logic

4. Key Rotation

## Key Rotation Procedures

### Planned Rotation

**Warning**: Key rotation invalidates all cached signatures!

1. Schedule maintenance window (cache TTL + buffer)
2. Let cache expire naturally (default: 20 minutes)
3. Work with dstack team to rotate keys in KMS:
   ```bash
   dstack-cli rotate-key MODEL_NAME/ecdsa-signing-key
   dstack-cli rotate-key MODEL_NAME/ed25519-signing-key
  1. Restart proxy service to load new keys
  2. Verify new signing addresses in logs
  3. Update documentation/contracts with new addresses
  4. Notify clients of address change

Emergency Rotation (Key Compromise)

  1. Immediately rotate in KMS
  2. Restart proxy (accepts brief downtime)
  3. Notify all clients ASAP
  4. Post-incident review

Testing Key Rotation

# In dev mode, test rotation process
DEV=1 cargo run  # generates random keys
# Check signing addresses in logs, verify they're different each run

### 5. Security Checklist for Deployment

```markdown
## Deployment Security Checklist

- [ ] `TOKEN` is strong (32+ random characters)
- [ ] `TOKEN` is unique per environment
- [ ] `DEV=1` is NOT set in production
- [ ] TLS terminates at load balancer
- [ ] Rate limiting is configured
- [ ] Backend URL points to internal network (not internet)
- [ ] Monitoring/alerting is configured
- [ ] Log aggregation is enabled
- [ ] Regular security updates scheduled
- [ ] Backup procedure documented
- [ ] Incident response plan reviewed

File to Create

SECURITY.md at repository root

Additional Files

Consider also creating:

  • docs/OPERATIONS.md - Operational runbook
  • docs/KEY_ROTATION.md - Detailed rotation procedures

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1: HighHigh priority - should fix soondocumentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions