SystemManager implements defense-in-depth security for control plane gateway deployments:
- Network Layer: Tailscale ACLs control WHO can reach the gateway
- Application Layer: Bearer tokens + scopes control WHAT they can do
- Policy Gate: Capability-based authorization prevents "LLM imagination" risk
- Target Registry: Explicit allowlisting of managed systems
- Audit Layer: Comprehensive logging tracks WHO did WHAT WHERE
Critical: Gateway access ≠ target access. All operations are explicitly authorized.
The control plane gateway architecture reduces operational overhead by deploying one trusted node per network segment rather than agents on every target.
graph TD
A[Threat: Compromised Gateway] -- Network Access --> B[Tailscale ACLs]
B -- Application Access --> C[Bearer Tokens + Scopes]
C -- Operation Authorization --> D[Policy Gate]
D -- Target Validation --> E[Target Registry]
E -- Capability Enforcement --> F[Execution Layer]
F -- All Operations --> G[Audit Trail]
H[Threat: LLM Imagination] -- Prevented by --> I[Capability Allowlisting]
I -- Explicit Authorization --> J[Target Registry]
K[Benefit: Reduced Blast Radius] -- Gateway Isolation --> L[Segment-Per-Gateway]
L -- Target Separation --> M[Limited Impact]
- Segment Isolation: Compromise affects only gateway, not all targets
- Target Separation: Gateways manage specific target groups
- Network Segmentation: Gateways deployed per network segment
- Explicit Allowlisting: Only explicitly configured operations are permitted
- Parameter Validation: Operation parameters validated against constraints
- LLM Imagination Prevention: Prevents AI from "imagining" unauthorized operations
- Single Point of Control: Updates and configuration managed centrally
- Audit Centralization: All operations logged through gateway
- Secrets Management: Credentials stored only on gateway
SystemManager gateways must be deployed in isolated environments (Proxmox LXC containers, dedicated VMs, or secure jump hosts).
- ✅ Safe: Gateway running in isolated LXC container with Tailscale
- ❌ UNSAFE: Gateway running on production target systems
- ❌ UNSAFE: Gateway with broad network access to all targets
- ❌ UNSAFE: Gateway without proper credential isolation
- ✅ Safe: SSH keys stored securely on gateway with limited permissions
- ❌ UNSAFE: SSH keys with root access to all targets
- ❌ UNSAFE: SSH keys stored insecurely or shared across gateways
- ✅ Safe: Docker socket access with limited container capabilities
- ❌ UNSAFE: Docker socket with full root-equivalent access
- ❌ UNSAFE: Docker API tokens with broad permissions
- ✅ Safe: API tokens with least-privilege scopes
- ❌ UNSAFE: API tokens with administrative permissions
- ❌ UNSAFE: API tokens stored in plaintext configuration
- ✅ Safe: Only explicitly configured capabilities are permitted
- ❌ UNSAFE: Broad capabilities that allow "LLM imagination" risk
- ❌ UNSAFE: Capabilities without parameter validation
- ✅ Safe: Targets explicitly defined in registry with constraints
- ❌ UNSAFE: Dynamic target discovery without validation
- ❌ UNSAFE: Targets without capability restrictions
- Reduced Blast Radius: Compromise affects only gateway segment
- Target Separation: Gateways manage specific target groups
- Network Segmentation: Gateways deployed per network segment
- Failover Capability: Multiple gateways can manage critical targets
- Load Distribution: Operations spread across gateways
- Maintenance Isolation: Update gateways without affecting all targets
The Policy Gate enforces security through explicit capability allowlisting, preventing "LLM imagination" risk where AI assistants might attempt unauthorized operations.
Each target in the registry defines explicit capabilities:
# Example target with limited capabilities
web-server-01:
id: "web-server-01"
capabilities:
- "system:read" # View system metrics
- "network:read" # Check network status
- "container:read" # List containers
constraints:
sudo_policy: "none" # No sudo access
# Example target with broader capabilities
docker-host-01:
id: "docker-host-01"
capabilities:
- "system:read"
- "container:read"
- "container:control" # Start/stop containers
- "stack:deploy" # Deploy Docker stacks
constraints:
sudo_policy: "limited" # Limited sudo access# Policy Gate validates every operation
await policy_gate.authorize(
operation="restart_container",
target="docker-host-01",
tier="control",
parameters={"container": "nginx"}
)
# Authorization checks:
# 1. Target exists in registry
# 2. Target has required capability
# 3. Operation parameters are valid
# 4. User has appropriate scopes
# 5. Operation is within constraints- Explicit Allowlisting: AI can only perform explicitly configured operations
- Parameter Validation: Operation parameters validated against constraints
- Target Isolation: Operations limited to specific targets
- Network Layer: Tailscale ACLs control gateway access
- Application Layer: Bearer tokens control user access
- Policy Layer: Capability allowlisting controls operation access
- Target Layer: Target-specific constraints limit impact
| Risk Level | Examples | Policy Gate Controls |
|---|---|---|
| low | get_system_status, list_containers | Target capability validation |
| moderate | ping_host, dns_lookup | Network access validation |
| high | manage_container, file_operations | Capability + parameter validation |
| critical | install_package, pull_docker_image | Multi-layer approval + validation |
# Authentication
SYSTEMMANAGER_REQUIRE_AUTH=true # Enforce bearer tokens
SYSTEMMANAGER_SHARED_SECRET=your-secret # HMAC secret for tokens
SYSTEMMANAGER_JWT_SECRET=jwt-secret # Or use JWT
# Approval System
SYSTEMMANAGER_ENABLE_APPROVAL=true # Enable approval gates
SYSTEMMANAGER_APPROVAL_WEBHOOK=https://approval.example.com
# Audit Logging
SYSTEMMANAGER_AUDIT_LOG=/var/log/systemmanager/audit.logCreate tokens with scopes using scripts/mint_token.py:
# Read-only token (safe for monitoring)
python scripts/mint_token.py \
--agent "grafana-datasource" \
--scopes "readonly" \
--ttl 30d
# Admin token (use sparingly, short TTL)
python scripts/mint_token.py \
--agent "ops-automation" \
--scopes "admin" \
--ttl 1h
# Custom scopes (principle of least privilege)
python scripts/mint_token.py \
--agent "container-manager" \
--scopes "container:read,container:write,network:read" \
--ttl 7d{
"timestamp": "2025-11-15T20:30:00Z",
"tool": "install_package",
"subject": "ops-automation",
"args": {"package_name": "nginx", "auto_approve": true},
"result_status": "success",
"scopes": ["admin"],
"risk_level": "critical",
"approved": true,
"tailscale": {
"tailscale_node": "dev1",
"tailscale_user": "[email protected]",
"tailscale_tags": ["tag:systemmanager-client"],
"tailnet": "example.ts.net"
}
}# Find all critical operations
jq 'select(.risk_level == "critical")' /var/log/systemmanager/audit.log
# Find operations by a specific Tailscale user
jq 'select(.tailscale.tailscale_user == "[email protected]")' audit.log
# Find failed authorization attempts (lateral movement detection)
jq 'select(.result_status == "error" and .error | contains("Insufficient"))' audit.log
# Find unapproved critical operations
jq 'select(.risk_level == "critical" and .approved == false)' audit.logCombine application audit logs with Tailscale network logs:
- Enable Tailscale flow logs in admin console
- Correlate by timestamp + source IP + tailnet user
- Detect anomalies:
- Network connection without successful auth
- Multiple auth failures from same node
- Unusual source node for specific tools
- Deploy gateway in isolated LXC container
- Configure Tailscale ACLs for gateway access
- Define target registry with limited capabilities
- Use
readonlyscope for monitoring tools - Configure target-specific constraints
- All of minimum security
- Deploy multiple gateways per network segment
- Configure overlapping target sets for redundancy
- Set
SYSTEMMANAGER_REQUIRE_AUTH=true - Generate scoped tokens with short TTLs
- Tag gateway with
tag:systemmanager-gateway - Only allow
tag:systemmanager-clientto connect - Monitor gateway audit logs daily
- Rotate gateway tokens monthly
- Validate target registry configuration regularly
- All of recommended security
- Deploy gateways with segment isolation
- Implement approval webhook for critical operations
- Use separate tokens per gateway segment
- Token TTL ≤ 24 hours for critical gateways
- Send gateway audit logs to SIEM
- Alert on gateway authorization failures
- Correlate with Tailscale flow logs
- Review gateway audit trail weekly
- Implement "break glass" procedure for gateway emergencies
- Define explicit capabilities per target
- Use least-privilege principle for capabilities
- Validate operation parameters against constraints
- Regularly review and update target capabilities
- Test capability enforcement with dry-run mode
- Store SSH keys securely with limited permissions
- Use Docker socket access with container constraints
- Configure API tokens with least-privilege scopes
- Rotate target credentials regularly
- Monitor target connectivity health
Attack: Attacker compromises a node already in your tailnet
Mitigations:
- Tailscale ACL denies access (must be tagged appropriately)
- Bearer token required (attacker doesn't have it)
- Audit log shows connection from unexpected node
Attack: Token leaked via logs, config file, environment variable
Mitigations:
- Token has limited scopes (principle of least privilege)
- Token expires quickly (short TTL)
- Audit log shows usage from unexpected Tailscale user/node
- Approval required for critical operations
Attack: Authorized user attempts malicious operation
Mitigations:
- Scopes limit what token can do
- Critical operations require approval
- Audit log records who did what with Tailscale identity
- Tailscale flow logs provide additional evidence
Attack: Use http_request_test to scan internal network
Mitigations:
- Tool requires
network:diagscope - Tool marked as
requires_approval=true - Implement URL allowlist in production
- Audit log tracks all HTTP requests made
If you detect suspicious activity:
- Immediate: Revoke compromised token (update
SYSTEMMANAGER_SHARED_SECRET) - Immediate: Review recent audit logs for scope of compromise
- Within 1 hour: Check Tailscale flow logs for lateral movement
- Within 4 hours: Rotate all tokens
- Within 24 hours: Review and strengthen ACLs
- Post-mortem: Update monitoring/alerting based on lessons learned
Planned security improvements:
- mTLS client certificates
- Integration with external IdP (Okta, Azure AD)
- Real-time alerting on critical operations
- Webhook-based approval workflows
- Path-based file access controls
- Rate limiting per token
- Token usage analytics
- Automated token rotation
{ "tagOwners": { "tag:systemmanager-server": ["group:ops"], "tag:systemmanager-client": ["group:ops", "group:automation"], }, "acls": [ // Only tagged clients can reach the MCP server { "action": "accept", "src": ["tag:systemmanager-client"], "dst": ["tag:systemmanager-server:8080"], }, // Deny everything else { "action": "deny", "src": ["*"], "dst": ["tag:systemmanager-server:*"], }, ], "services": { "svc:systemmanager-mcp": { "protocol": "tcp", "port": 8080, "tags": ["tag:systemmanager-server"], "allowedTags": ["tag:systemmanager-client"], }, }, }