Skip to content

Latest commit

 

History

History
497 lines (371 loc) · 17.6 KB

File metadata and controls

497 lines (371 loc) · 17.6 KB

Architecture

Deep dive into the openclaw-tenancy system design, component interactions, and infrastructure decisions.


Architecture Overview

flowchart TB
    TG(["☁ Telegram"])

    subgraph cp ["Control Plane"]
        ALB["ALB · HTTPS"]
        R["Router ×2 · :9090"]
        O["Orchestrator ×2"]
    end

    subgraph stores ["Data Stores"]
        Redis[("Redis<br/><sub>cache · lock · config</sub>")]
        DDB[("DynamoDB<br/><sub>tenant registry</sub>")]
    end

    subgraph dp ["Data Plane · Kata VMs · bare metal"]
        direction LR
        TP1["🧠 Tenant Pod<br/><sub>{tenantID}</sub>"]
        TP2["🧠 Tenant Pod<br/><sub>{tenantID}</sub>"]
        WP["💤 Warm Pool ×N"]
    end

    subgraph aws ["AWS Managed"]
        direction LR
        S3[("S3<br/><sub>state sync · ABAC</sub>")]
        BR["Bedrock<br/><sub>LLM inference</sub>"]
    end

    TG -- "POST /tg/{tenantID}" --> ALB
    ALB --> R
    R -. "cache check/set" .-> Redis
    R -- "POST /wake" --> O
    R -- "forward :8787" --> TP1
    O -. "tenant CRUD" .-> DDB
    O -. "distributed lock" .-> Redis
    O -- "create pod" --> TP1
    O -- "consume" --> WP
    TP1 -. "s3 sync" .-> S3
    TP1 -. "InvokeModel" .-> BR

    style cp fill:none,stroke:#3b82f6,stroke-width:2px
    style stores fill:none,stroke:#818cf8,stroke-width:1px,stroke-dasharray:5 5
    style dp fill:none,stroke:#f59e0b,stroke-width:2px
    style aws fill:none,stroke:#a78bfa,stroke-width:1px,stroke-dasharray:5 5
Loading

AWS Infrastructure

flowchart TB
    subgraph internet [" "]
        direction LR
        User(["👤 Telegram User"])
        TG(["☁ Telegram API"])
    end

    subgraph aws ["AWS"]
        subgraph vpc ["VPC · 10.0.0.0/16"]
            ALB["⚡ Application Load Balancer<br/><sub>HTTPS · ACM cert</sub>"]

            subgraph eks ["EKS Cluster"]
                subgraph ns_tenants ["namespace: tenants"]
                    direction TB
                    subgraph mgmt ["Management Pods"]
                        direction LR
                        R["Router ×2<br/><sub>Deployment · arm64/amd64</sub>"]
                        O["Orchestrator ×2<br/><sub>Deployment · arm64/amd64<br/>Leader Election via Lease</sub>"]
                        RD[("Redis<br/><sub>Deployment ×1</sub>")]
                    end

                    subgraph metal ["Bare Metal Nodes · c6i/c7i.metal · amd64"]
                        direction LR
                        TP1["🧠 Tenant Pod<br/><sub>Kata VM (QEMU)<br/>:8787 webhook</sub>"]
                        TP2["🧠 Tenant Pod<br/><sub>Kata VM (QEMU)</sub>"]
                        WP["💤 Warm Pool<br/><sub>Deployment ×N<br/>sleep ∞ · image prefetch</sub>"]
                    end

                    NP{{"NetworkPolicy<br/><sub>VPC CNI eBPF<br/>standard mode</sub>"}}
                end

                KP["Karpenter<br/><sub>NodePool: kata-metal<br/>on-demand · amd64 only</sub>"]

                PI["Pod Identity Agent<br/><sub>session tags enabled</sub>"]
            end
        end

        subgraph aws_svc ["AWS Managed Services"]
            direction LR
            DDB[("DynamoDB<br/><sub>tenant-registry<br/>PAY_PER_REQUEST</sub>")]
            S3[("S3<br/><sub>state bucket<br/>tenants/{id}/ prefix<br/>ABAC via session tags</sub>")]
            ECR[("ECR<br/><sub>orchestrator<br/>router · openclaw</sub>")]
            BR["Bedrock<br/><sub>Claude · on-demand</sub>"]
        end
    end

    User -- message --> TG
    TG -- "webhook POST" --> ALB
    ALB --> R
    R --> RD
    R -- "POST /wake" --> O
    R -- "forward" --> TP1

    O --> DDB
    O --> RD
    O -. "create/delete" .-> TP1
    O -. "consume" .-> WP
    KP -. "provisions" .-> metal

    PI -. "inject credentials
+ session tags" .-> TP1

    TP1 -. "aws s3 sync" .-> S3
    TP1 -. "InvokeModel" .-> BR

    NP -. "allow Router→Pod
ports 8787+18789
block cross-tenant" .-> metal

    style internet fill:none,stroke:none
    style aws fill:none,stroke:#f59e0b,stroke-width:2px
    style vpc fill:none,stroke:#64748b,stroke-width:1px,stroke-dasharray:5 5
    style eks fill:none,stroke:#3b82f6,stroke-width:2px
    style ns_tenants fill:none,stroke:#3b82f6,stroke-width:1px,stroke-dasharray:5 5
    style mgmt fill:none,stroke:#64748b,stroke-width:1px,stroke-dasharray:3 3
    style metal fill:none,stroke:#f59e0b,stroke-width:1px,stroke-dasharray:3 3
    style aws_svc fill:none,stroke:#a78bfa,stroke-width:1px,stroke-dasharray:5 5
Loading

Key Infrastructure Decisions

Decision Choice Reason
Instance type c6i/c7i.metal (amd64) Kata needs /dev/kvm; Graviton metal lacks KVM
Node provisioning Karpenter Auto-scale bare metal on demand, avoid idle cost
Container runtime kata-qemu VM-level tenant isolation (guest kernel per pod)
NetworkPolicy engine VPC CNI eBPF standard mode Native EKS support, validated with Kata on EKS 1.35 / VPC CNI v1.21.1
State storage S3 + aws s3 sync S3 CSI (mountpoint-s3) is write-once FUSE, can't overwrite
Data isolation S3 ABAC via Pod Identity session tags Zero extra IAM Roles; ${aws:PrincipalTag/kubernetes-pod-name} restricts prefix
Tenant registry DynamoDB PAY_PER_REQUEST Multi-replica concurrent R/W, cross-pod persistence
Image build docker buildx multi-arch Cluster has both amd64 + arm64 nodes
Image registry ECR (private) Same region, no cross-region pull latency

Message Flow

sequenceDiagram
    autonumber
    actor User
    participant TG as Telegram
    participant Router
    participant Redis
    participant Orch as Orchestrator
    participant Pod as Tenant Pod

    User->>TG: Send message
    TG->>Router: POST /tg/{tenantID}
    Router-->>TG: 200 OK (immediate ack)

    Router->>Redis: GET router:endpoint:{tenantID}

    alt Cache HIT — fast path
        Redis-->>Router: pod_ip
        Router->>Pod: POST pod_ip:8787/telegram-webhook
        Pod-->>User: LLM reply via Bot API
    else Cache MISS or stale IP
        Redis-->>Router: (nil)
        Router->>Orch: POST /wake/{tenantID}

        Note over Orch: Acquire Redis NX lock,<br/>check DynamoDB, warm pool...

        Orch-->>Router: pod_ip (+ bot_token)

        opt True cold start (no prior cached IP)
            Router->>User: 🏗 Starting up, please wait...
        end

        loop probe webhook port — every 3s, up to 5m30s
            Router->>Pod: TCP connect pod_ip:8787 (any HTTP response = ready)
            Pod-->>Router: connection success or refused
        end

        Router->>Pod: POST pod_ip:8787/telegram-webhook
        Router->>Redis: SET router:endpoint:{tenantID} pod_ip (re-cache)
        Pod-->>User: LLM reply via Bot API
    end
Loading

Timing

Scenario Total latency
Pod already running (cache hit) ~2–5 s (LLM response time)
Warm pool hit (node pre-provisioned) ~40–60 s (s3-restore ~3 s + OpenClaw init ~37 s)
Cold start (Karpenter provisions node) ~3–5 min (metal node provision + above)

Starting notification is sent only on true cold starts — when no cached IP existed before the wake call. Cache-miss retries (stale IP → re-wake) do not re-notify the user.

Cache preservation: after the healthz poll succeeds, the Router re-sets router:endpoint:{tenantID} to keep the cache warm for subsequent messages.


Pod Lifecycle

stateDiagram-v2
    [*] --> idle : create tenant

    idle --> running : message arrives / POST /wake
    running --> idle : idle timeout (30s tick)
    running --> auto_restart : pod dies within idle window
    auto_restart --> running : informer detects & re-wakes (~1s)
    running --> idle : pod missing in k8s (informer or safety-net)
    idle --> [*] : DELETE /tenants/{id}
Loading

State Transitions

Component Trigger Transition
API handler /wake POST /wake/{id} idle → running
Lifecycle controller 30 s tick (leader only) running → idle if now - last_active_at > idle_timeout_s
Reconciler K8s Informer (event-driven) + 5 min safety-net running → idle if pod not found in k8s
Reconciler pod dies within idle window (~1s detection) auto-restart → running
API handler /delete DELETE /tenants/{id} any → deleted

Pod Spec

Each tenant pod ({tenantID}) has three containers:

flowchart LR
    subgraph Pod ["Pod · {tenantID}"]
        direction TB

        subgraph Init ["initContainer"]
            S3R[s3-restore<br/><i>aws-cli:2.15.30</i>]:::init
        end

        subgraph Main ["Containers"]
            direction LR
            GW["openclaw gateway<br/>:8787 webhook"]:::main
            SYNC["s3-sync sidecar<br/><i>aws-cli:2.15.30</i><br/>every 60s + PreStop final sync"]:::sidecar
        end

        Init --> Main
    end

    subgraph Volumes
        ED1["emptyDir · openclaw-state<br/>/root/.openclaw/"]:::vol
        ED2["emptyDir · workspace<br/>/openclaw-workspace/"]:::vol
        CM["ConfigMap · config-template<br/>/etc/openclaw/ (readOnly)"]:::vol
    end

    ED1 -.- GW
    ED1 -.- SYNC
    ED1 -.- S3R
    ED2 -.- GW
    ED2 -.- S3R
    CM -.- GW

    classDef init fill:#457b9d,stroke:#1d3557,color:#fff
    classDef main fill:#2d6a4f,stroke:#1b4332,color:#fff
    classDef sidecar fill:#e76f51,stroke:#9c4225,color:#fff
    classDef vol fill:#6c757d,stroke:#495057,color:#fff
Loading

Container Details

Container Image Purpose
s3-restore (init) public.ecr.aws/aws-cli/aws-cli:2.15.30 Restore state & workspace from S3 before OpenClaw starts. Excludes openclaw.json, *.lock
openclaw (main) {ECR}/openclaw@sha256:{digest} (pinned) OpenClaw Gateway — Telegram webhook mode. Config rendered from ConfigMap template via envsubst on start
s3-sync (sidecar) public.ecr.aws/aws-cli/aws-cli:2.15.30 aws s3 sync every 60 s. PreStop hook runs final sync via shared ConfigMap script (openclaw-sync-script). Excludes openclaw.json*, .workspace-state.json.*, workspace-state.json.tmp-*

Volumes

Name Type Mount Notes
openclaw-state emptyDir /root/.openclaw/ OpenClaw state (memory, sessions)
workspace emptyDir /openclaw-workspace/ Agent workspace files
config-template ConfigMap /etc/openclaw/ (readOnly) openclaw.json.tpl — rendered on start

No PVCs. S3 CSI (mountpoint-s3) was evaluated and rejected — write-once FUSE, cannot overwrite or delete existing files.


State Persistence

S3 Layout

flowchart TB
    B(["📦 s3://{S3_BUCKET}/"])
    T["tenants/"]
    TID["{tenant_id}/"]
    ST["state/ → /root/.openclaw/<br/><sub>brain.db · sessions · memory</sub>"]
    WS["workspace/ → /openclaw-workspace/<br/><sub>agent work files</sub>"]

    B --> T --> TID
    TID --> ST
    TID --> WS
Loading

S3 ABAC (Attribute-Based Access Control)

Tenant pods use EKS Pod Identity with session tags to scope S3 access per tenant:

Mechanism Detail
Pod Identity Association Service account openclaw-tenant → IAM role openclaw-tenant-pod
Session Tag kubernetes-pod-name = {tenantID} (injected by Pod Identity webhook)
IAM Condition ListBucket scoped by s3:prefix; Get/Put/Delete scoped by Resource ARN with ${aws:PrincipalTag/kubernetes-pod-name}
Effect Each pod can only read/write its own tenants/{tenantID}/ prefix — enforced at IAM level

Consistency Properties

  • Last-write-wins: aws s3 sync with no --delete. S3 accumulates extra files; OpenClaw tolerates extras.
  • Loss window: If pod is killed without graceful shutdown (OOM, node failure), up to 60 s of state lost. PreStop hook handles graceful termination (up to 120s terminationGracePeriodSeconds). Acceptable for agent memory (append-only).
  • openclaw.json excluded: Always regenerated from template via envsubst. Restored config would have wrong auth tokens.
  • Lock files excluded: .lock files from previous lifetime cause "session file locked" errors.

Warm Pool

Pre-provisioned nodes to eliminate Karpenter provisioning latency (~3–4 min → ~40 s).

sequenceDiagram
    participant WPD as Warm Pool Deployment
    participant WP as Warm Pod (sleep ∞)
    participant Orch as Orchestrator
    participant TP as Tenant Pod

    Note over WPD: replicas = WARM_POOL_TARGET<br/>(Redis: otm config set warm-pool-target N)

    WPD->>WP: Create warm pod<br/>(node=ip-10-6-x, label warm=true)

    Note over Orch: Tenant wake arrives...

    Orch->>WP: Find Running pod with label warm=true
    Orch->>WP: Patch label: warm=true → warm=consuming
    Note over WPD: Selector loses pod → schedules replacement

    Orch->>WP: Delete warm pod (free node resources)
    Orch->>TP: Create {tenantID} pod<br/>(nodeName = same node)

    WPD->>WP: Create replacement warm pod (background)
Loading

Warm pods run sleep infinity — they pre-pull the openclaw image and hold the node but do not start OpenClaw. OpenClaw still needs ~37 s to initialize after the tenant pod starts.

Configuration: warm pool target is stored in Redis and adjustable at runtime:

otm config set warm-pool-target <N>

High Availability

Orchestrator runs 2 replicas. Coordination mechanisms:

Redis Wake Lock

Prevents duplicate pod creation for the same tenant.

sequenceDiagram
    participant A as Replica A
    participant Redis
    participant B as Replica B
    participant DDB as DynamoDB

    A->>Redis: SET tenant:waking:{tenantID} "1" NX EX 240s
    Redis-->>A: OK (lock acquired)

    B->>Redis: SET tenant:waking:{tenantID} "1" NX EX 240s
    Redis-->>B: nil (lock held)

    Note over A: Creates pod, updates DynamoDB...
    A->>DDB: status=running, pod_ip=...
    A->>Redis: DEL tenant:waking:{tenantID}

    loop Poll until running
        B->>DDB: GET tenant status
        DDB-->>B: status=running, pod_ip
    end
    B-->>B: Return pod_ip to caller
Loading

Kubernetes Lease Leader Election

Only one replica runs the idle timeout loop.

Parameter Value
Lease name orchestrator-leader (tenants namespace)
Duration 15 s
Renew 10 s
Retry 2 s
Leader runs checkIdleTenants() every 30 s

Reconciler (All Replicas)

Event-driven via K8s SharedInformer watching app=openclaw pods. Runs on every replica (idempotent).

Event-driven path (~1s response):

  • Pod DELETE → check DynamoDB, auto-restart if within idle window
  • Pod UPDATE (phase → Failed/Succeeded, or IP change) → reconcile single tenant

Safety-net full reconcile (every 5 min, configurable via RECONCILER_INTERVAL):

  • Stale running tenant: pod missing in k8s → reset DynamoDB to idle
  • Orphan pod: {tenantID} pod with no running tenant in DynamoDB → delete (90s grace)
  • IP drift: pod IP changed → update Redis cache + DynamoDB

Infrastructure

EKS Cluster

Item Value
Cluster <EKS_CLUSTER>, us-west-2, account <AWS_ACCOUNT_ID>
Namespace tenants
Ingress <DOMAIN> → ALB → Router:9090
Runtime Kata Containers (kata-qemu) — VM-level isolation for all tenant pods

Kata Containers / Bare Metal

Kata requires hardware virtualization (/dev/kvm). Only .metal EC2 instances expose this.

Karpenter NodePool kata-metal provisions c/m/r gen-6+ metal nodes with:

  • Devmapper thin-pool snapshotter (required by kata-qemu)
  • Taint kata-runtime=true:NoSchedule — only kata-tolerating pods schedule here

NetworkPolicy

Network isolation uses VPC CNI NetworkPolicy (NETWORK_POLICY_ENFORCING_MODE=standard). Validated compatible with both runc and Kata Containers on EKS 1.35 / VPC CNI v1.21.1 (8/8 tests passed for Kata pods). No Calico needed.

Policy Effect
tenant-pod-isolation Ingress: only app=router on :8787. Egress: DNS, Pod Identity Agent, IMDS, all external (except VPC/Service CIDRs)
orchestrator-policy Egress to Redis, K8s API server (172.20.0.1:443), Pod Identity Agent, external HTTPS
router-policy Ingress from VPC (ALB). Egress to orchestrator, Redis, tenant pods, external HTTPS
redis-policy Ingress from orchestrator + router only. No egress
warm-pool-policy No ingress, no egress (sleep infinity)

Key egress rules: Must explicitly allow Pod Identity Agent (169.254.170.23:80,443) and IMDS (169.254.169.254:80) — without these, AWS credential retrieval fails inside tenant pods.

IAM

Role Service Account Permissions
openclaw-tenant-pod openclaw-tenant Bedrock: InvokeModel, InvokeModelWithResponseStream; S3: read/write on <S3_BUCKET> (scoped by ABAC session tags)

Security

BotToken Handling

Aspect Detail
Stored in DynamoDB bot_token field (encrypted at rest)
Redacted from All public API responses
Internal access GET /tenants/{id}/bot_token — used by Router for Telegram notifications
Pod access TELEGRAM_BOT_TOKEN env var (used by OpenClaw)

Tenant Isolation

Layer Mechanism
Compute Each tenant in dedicated Kata VM (QEMU) — hardware-enforced memory/CPU isolation
Storage Dedicated S3 prefix per tenant (tenants/{tenantID}/), enforced by ABAC session tags
Network VPC CNI NetworkPolicy (eBPF standard mode) — deny lateral movement, allow only required egress
IAM Pod Identity session tags + IAM condition on ${aws:PrincipalTag/kubernetes-pod-name}

Shared IAM role caveat: All tenant pods share one IAM role (single service account). CloudTrail cannot attribute Bedrock usage per tenant — application-level tracking is needed for billing.