Skip to content

Latest commit

 

History

History
773 lines (635 loc) · 26.5 KB

File metadata and controls

773 lines (635 loc) · 26.5 KB

WorkCenter Architecture

Version: 0.1.0-draft Date: 2026-03-06

Overview

WorkCenter is the operations layer for Claude Code agent teams. It provides lifecycle management, state persistence, real-time observability, and operational controls for multi-agent software development workflows running on self-hosted Kubernetes infrastructure.

WorkCenter does not replace Claude Code. It wraps the existing agent team primitives — claude --print, CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1, and the ~/.claude/ state directory — with the tooling needed to run them reliably at scale on bare-metal clusters.


1. Claude Code Agent Teams: Native Primitives

This section documents what Claude Code provides natively (as of v2.1.x). WorkCenter builds on top of these primitives. Understanding the boundary is critical.

1.1 Agent Lifecycle

Primitive Mechanism Notes
Spawn agent claude --print with --output-format stream-json Headless. Env var CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 enables team tools.
Team creation TeamCreate tool (internal) Creates ~/.claude/teams/{name}/config.json.
Team deletion TeamDelete tool (internal) Removes team state.
Agent join Agent tool with team_name and name params Registers member in config.json members[].
Agent shutdown SendMessage with type: shutdown_request Recipient responds with shutdown_response (approve/reject).
Nudge (message) SendMessage with type: message Delivered to ~/.claude/teams/{team}/inboxes/{agent}.json.
Broadcast SendMessage with type: broadcast Sends to all team members. Expensive (N deliveries).

1.2 Task System

Primitive Tool Notes
Create task TaskCreate Fields: subject, description, owner, blocks, blockedBy.
Update task TaskUpdate Status: pending -> in_progress -> completed (or deleted). Supports activeForm, metadata, dependency edges.
List tasks TaskList Returns summary: id, subject, status, owner, blockedBy.
Get task TaskGet Returns full task including description, blocks, blockedBy.

Tasks are stored at ~/.claude/tasks/{team_or_session_id}/{id}.json.

1.3 State File Schemas

Team config~/.claude/teams/{name}/config.json:

{
  "name": "string",
  "description": "string",
  "createdAt": 1772804253786,
  "leadAgentId": "team-lead@{name}",
  "leadSessionId": "uuid",
  "members": [
    {
      "agentId": "{name}@{team}",
      "name": "string",
      "agentType": "team-lead | general-purpose",
      "model": "claude-opus-4-6",
      "prompt": "string (full system prompt)",
      "color": "string",
      "planModeRequired": false,
      "joinedAt": 1772804304872,
      "tmuxPaneId": "string",
      "cwd": "/absolute/path",
      "subscriptions": [],
      "backendType": "in-process"
    }
  ]
}

Task~/.claude/tasks/{team}/{id}.json:

{
  "id": "string",
  "subject": "string",
  "description": "string",
  "activeForm": "string (present-continuous for spinner)",
  "status": "pending | in_progress | completed | deleted",
  "blocks": ["task_id", ...],
  "blockedBy": ["task_id", ...],
  "owner": "agent_name",
  "metadata": {}
}

Inbox~/.claude/teams/{team}/inboxes/{agent}.json:

[
  {
    "from": "agent_name",
    "text": "string (may contain JSON-encoded structured messages)",
    "timestamp": "ISO8601",
    "color": "string",
    "read": false
  }
]

1.4 What Claude Code Does NOT Provide

  • No persistent daemon — each claude --print invocation is ephemeral
  • No process supervision or restart-on-failure
  • No centralized event stream — state changes are file mutations
  • No web UI or HTTP API
  • No multi-node coordination — state directory is local to one filesystem
  • No access controls — any process with filesystem access can read/write state
  • No cost tracking or audit logging
  • No retention or cleanup policies

These gaps define WorkCenter's scope.


2. Platform Components

┌─────────────────────────────────────────────────────────────┐
│                        Dashboard UI                         │
│                 (embedded static frontend)                   │
└────────────────┬──────────────────────┬─────────────────────┘
                 │ HTTP/WS              │ REST
┌────────────────▼──────────────────────▼─────────────────────┐
│                     Dashboard Server                        │
│               (Go binary, port 8080)                        │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌────────────┐  │
│  │ REST API │  │ WS Hub   │  │ Auth     │  │ Static     │  │
│  │ Handler  │  │          │  │ Middleware│  │ Embed      │  │
│  └────┬─────┘  └────┬─────┘  └──────────┘  └────────────┘  │
└───────┼──────────────┼──────────────────────────────────────┘
        │              │
┌───────▼──────────────▼──────────────────────────────────────┐
│                     Orchestrator                            │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────────────┐ │
│  │ Agent        │  │ State        │  │ Event Bus          │ │
│  │ Lifecycle    │  │ Layer        │  │ (fsnotify ->       │ │
│  │ Manager      │  │              │  │  typed events)     │ │
│  └──────┬───────┘  └──────┬───────┘  └────────┬───────────┘ │
└─────────┼──────────────────┼───────────────────┼────────────┘
          │                  │                   │
          │ exec             │ read/write        │ watch
          ▼                  ▼                   ▼
    ┌──────────┐     ┌─────────────┐     ┌─────────────┐
    │ claude   │     │ ~/.claude/  │     │ ~/.claude/  │
    │ --print  │     │ teams/      │     │ tasks/      │
    │ process  │     │ tasks/      │     │ teams/      │
    └──────────┘     └─────────────┘     └─────────────┘
                            │
                     ┌──────▼──────┐
                     │ iSCSI PVC   │
                     │ (persistent │
                     │  storage)   │
                     └─────────────┘

2.1 Orchestrator

Purpose: Manage Claude Code agent process lifecycle. The orchestrator is the core daemon that spawns, monitors, and terminates claude --print processes.

Runs as: Kubernetes Deployment (single replica, leader election for HA).

Responsibilities:

  1. Spawn: Execute claude --print --output-format stream-json with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 set. Pass team name, agent name, system prompt, model, and working directory as arguments.
  2. Monitor: Track process health via stdout stream parsing. Detect crashes, hangs (no output within configurable timeout), and unexpected exits.
  3. Kill: Send SIGTERM to agent process. If no exit within grace period (default 10s), SIGKILL.
  4. Restart: Re-spawn agents that crash, with exponential backoff (1s, 2s, 4s, ..., max 60s). Configurable per-agent restart policy (always, on-failure, never).
  5. Nudge: Write messages to agent inbox files or invoke SendMessage via a new claude --print session targeting the team.

Process table (in-memory, reconstructed from state on startup):

type AgentProcess struct {
    TeamName    string
    AgentName   string
    PID         int
    Cmd         *exec.Cmd
    StartedAt   time.Time
    RestartCount int
    Status      AgentStatus // running, stopped, crashed, starting
    StdoutPipe  io.ReadCloser
}

Key design decision: One orchestrator manages one ~/.claude/ state directory. Multi-node requires shared filesystem (iSCSI/NFS), not distributed consensus.

2.2 State Layer

Purpose: Read and write Claude Code's native state files. This is a library, not a service.

Design principle: WorkCenter does not define its own state format. It reads and writes the same JSON files that Claude Code uses natively. This means a running WorkCenter instance and a manual claude session can coexist on the same state directory.

Operations:

Operation File Method
List teams ~/.claude/teams/*/config.json Glob + parse
Get team ~/.claude/teams/{name}/config.json Read + parse
Create team ~/.claude/teams/{name}/config.json Write (atomic via rename)
List tasks ~/.claude/tasks/{team}/*.json Glob + parse
Get task ~/.claude/tasks/{team}/{id}.json Read + parse
Update task ~/.claude/tasks/{team}/{id}.json Read-modify-write (file lock)
Read inbox ~/.claude/teams/{team}/inboxes/{agent}.json Read + parse
Write inbox ~/.claude/teams/{team}/inboxes/{agent}.json Append (file lock)

Concurrency: File-level locking via flock(2). The state layer must handle concurrent access from both the orchestrator and claude processes.

State directory: Configurable via WORKCENTER_STATE_DIR env var. Default: ~/.claude/.

type StateLayer struct {
    BaseDir string // e.g., /home/workcenter/.claude
}

func (s *StateLayer) ListTeams() ([]TeamConfig, error)
func (s *StateLayer) GetTeam(name string) (*TeamConfig, error)
func (s *StateLayer) ListTasks(team string) ([]Task, error)
func (s *StateLayer) GetTask(team, id string) (*Task, error)
func (s *StateLayer) UpdateTask(team, id string, fn func(*Task)) error
func (s *StateLayer) ReadInbox(team, agent string) ([]InboxMessage, error)
func (s *StateLayer) WriteInbox(team, agent string, msg InboxMessage) error

2.3 Event Bus

Purpose: Convert filesystem mutations into typed, ordered events. Decouple state changes from consumers (dashboard, CLI, webhooks).

Implementation: fsnotify watches the ~/.claude/ directory tree. On each file write/create/delete, the event bus:

  1. Identifies the file type (team config, task, inbox) from the path.
  2. Reads the new file contents.
  3. Diffs against cached previous state.
  4. Emits a typed event.

Event types:

type EventType string

const (
    EventTeamCreated    EventType = "team_created"
    EventTeamUpdated    EventType = "team_updated"
    EventTeamDeleted    EventType = "team_deleted"
    EventTaskCreated    EventType = "task_created"
    EventTaskUpdated    EventType = "task_updated"
    EventTaskDeleted    EventType = "task_deleted"
    EventAgentJoined    EventType = "agent_joined"
    EventAgentLeft      EventType = "agent_left"
    EventAgentStatus    EventType = "agent_status"
    EventMessage        EventType = "message"
)

type Event struct {
    Type      EventType       `json:"type"`
    Team      string          `json:"team"`
    Agent     string          `json:"agent,omitempty"`
    TaskID    string          `json:"taskId,omitempty"`
    Timestamp time.Time       `json:"timestamp"`
    Payload   json.RawMessage `json:"payload"`
}

Delivery: Fan-out to registered subscribers. Each WebSocket connection and each CLI --follow session is a subscriber. Events are not persisted by the bus — they are derived from state file changes which are already persistent.

Debouncing: File writes may trigger multiple fsnotify events. The bus debounces with a 50ms window per file path before emitting.

2.4 Dashboard

Purpose: Web UI for operational visibility and control. Single Go binary that serves both the REST API and embedded static frontend.

Architecture: The dashboard server is a Go HTTP server. The frontend is a static SPA (HTML + JS + CSS) embedded in the binary via embed.FS. No Node.js build step in production. No framework heavier than vanilla JS or a minimal reactive library (e.g., Preact).

Port: 8080 (configurable via WORKCENTER_PORT).

Pages:

Route View Description
/ Team list All teams with member count, task progress
/teams/{name} Team detail Agent status cards, task board, message log
/teams/{name}/tasks Task graph DAG visualization of task dependencies
/teams/{name}/messages Message log Chronological message stream

Real-time updates: WebSocket connection at /ws receives events from the event bus. The frontend applies events to local state without polling.

Controls (via REST API, rendered as buttons/forms in UI):

  • Spawn agent into team
  • Kill agent
  • Send nudge message to agent
  • Create/update/delete task
  • Create/delete team

2.5 CLI

Purpose: Terminal-native interface to the same REST API the dashboard uses. For operators who prefer kubectl-style workflows.

Binary: workcenter (same module, different cmd/ entrypoint).

Commands:

workcenter team list
workcenter team get <name>
workcenter team create <name> [--description <desc>]
workcenter team delete <name>

workcenter agent spawn <team> <name> [--model <model>] [--prompt <prompt>]
workcenter agent list <team>
workcenter agent kill <team> <name>
workcenter agent nudge <team> <name> <message>

workcenter task list <team>
workcenter task get <team> <id>
workcenter task create <team> --subject <s> [--description <d>] [--owner <o>]
workcenter task update <team> <id> [--status <s>] [--owner <o>]

workcenter logs <team> [--follow]
workcenter status

Output formats: --output text (default, human-readable tables), --output json, --output yaml.

Connection: --server <url> (default: http://localhost:8080).

2.6 Agent SDK

Purpose: Go library for building custom orchestration workflows. Exposes WorkCenter primitives as importable functions.

Package: github.com/workcenter/workcenter/pkg/sdk

Core types:

// Client connects to a WorkCenter server.
type Client struct { /* ... */ }

func NewClient(serverURL string, opts ...Option) *Client

func (c *Client) SpawnAgent(ctx context.Context, req SpawnRequest) (*Agent, error)
func (c *Client) KillAgent(ctx context.Context, team, name string) error
func (c *Client) NudgeAgent(ctx context.Context, team, name, msg string) error
func (c *Client) ListTeams(ctx context.Context) ([]Team, error)
func (c *Client) ListTasks(ctx context.Context, team string) ([]Task, error)
func (c *Client) Subscribe(ctx context.Context) (<-chan Event, error)

Use cases:

  • Custom CI/CD pipelines that spawn review teams
  • Scheduled batch workflows (nightly refactoring runs)
  • Integration with external systems (Slack bot, GitHub webhook)

3. Interfaces

3.1 REST API

Base path: /api/v1

Method Path Description Request Body Response
GET /health Health check {"status":"ok"}
GET /teams List teams [TeamSummary]
POST /teams Create team {name, description} TeamConfig
GET /teams/{name} Get team TeamConfig
DELETE /teams/{name} Delete team 204
GET /teams/{name}/agents List agents [AgentStatus]
POST /teams/{name}/agents Spawn agent {name, model, prompt, cwd} AgentStatus
DELETE /teams/{name}/agents/{agent} Kill agent 204
POST /teams/{name}/agents/{agent}/nudge Send message {message} 202
GET /teams/{name}/tasks List tasks [Task]
POST /teams/{name}/tasks Create task {subject, description, owner, blocks, blockedBy} Task
GET /teams/{name}/tasks/{id} Get task Task
PATCH /teams/{name}/tasks/{id} Update task {status, owner, ...} Task
DELETE /teams/{name}/tasks/{id} Delete task 204
GET /teams/{name}/messages Get messages ?agent=<name> [InboxMessage]

All endpoints return JSON. Errors use standard HTTP status codes with {"error": "message"} body.

3.2 WebSocket Protocol

Endpoint: /ws

Connection: Client connects and optionally sends a subscribe message:

{"type": "subscribe", "teams": ["workcenter"]}

If no subscribe message is sent, the client receives events for all teams.

Server -> Client messages (Event objects as defined in Section 2.3):

{
  "type": "task_updated",
  "team": "workcenter",
  "taskId": "1",
  "timestamp": "2026-03-06T13:38:35.786Z",
  "payload": {
    "id": "1",
    "subject": "Write ARCHITECTURE.md",
    "status": "in_progress",
    "owner": "architect"
  }
}
{
  "type": "agent_status",
  "team": "workcenter",
  "agent": "architect",
  "timestamp": "2026-03-06T13:38:35.786Z",
  "payload": {
    "name": "architect",
    "status": "running",
    "pid": 12345,
    "startedAt": "2026-03-06T13:38:00.000Z",
    "restartCount": 0
  }
}

Client -> Server messages:

Type Fields Description
subscribe teams: []string Filter events to specific teams
ping Keepalive (server responds with pong)

Keepalive: Server sends ping every 30s. Client must respond with pong within 10s or connection is closed.


4. Deployment Model

4.1 Target Environment

  • Architecture: ARM64 (Raspberry Pi 4/5, CM3588+, similar SBCs)
  • Kubernetes: K3s or standard kubeadm clusters
  • Storage: iSCSI StorageClass for persistent state (ReadWriteOnce)
  • Registry: Self-hosted Forgejo container registry (insecure HTTP)
  • GitOps: ArgoCD for declarative deployment
  • External access: Cloudflare Tunnel (no ingress controller required)

4.2 Kubernetes Resources

# Namespace
apiVersion: v1
kind: Namespace
metadata:
  name: workcenter
---
# PVC for ~/.claude/ state
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: workcenter-state
  namespace: workcenter
spec:
  storageClassName: iscsi
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 10Gi
---
# Dashboard + Orchestrator Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: workcenter
  namespace: workcenter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: workcenter
  template:
    metadata:
      labels:
        app: workcenter
    spec:
      containers:
      - name: workcenter
        image: 192.168.8.197:30080/tim/workcenter:latest
        ports:
        - containerPort: 8080
        env:
        - name: WORKCENTER_STATE_DIR
          value: /state
        - name: CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS
          value: "1"
        - name: ANTHROPIC_API_KEY
          valueFrom:
            secretKeyRef:
              name: workcenter-secrets
              key: anthropic-api-key
        volumeMounts:
        - name: state
          mountPath: /state
      volumes:
      - name: state
        persistentVolumeClaim:
          claimName: workcenter-state
---
# Service (NodePort)
apiVersion: v1
kind: Service
metadata:
  name: workcenter
  namespace: workcenter
spec:
  type: NodePort
  selector:
    app: workcenter
  ports:
  - port: 8080
    targetPort: 8080
    nodePort: 30800

4.3 Container Image

Multi-stage build. Final image is distroless (no shell, no package manager). The claude CLI must be installed in the image for the orchestrator to spawn agents.

FROM golang:1.22-alpine AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -o /out/workcenter ./cmd/workcenter
RUN CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -o /out/dashboard ./cmd/dashboard

FROM node:22-slim AS claude
RUN npm install -g @anthropic-ai/claude-code

FROM gcr.io/distroless/static:nonroot
COPY --from=build /out/dashboard /dashboard
COPY --from=build /out/workcenter /workcenter
COPY --from=claude /usr/local/lib/node_modules /usr/local/lib/node_modules
COPY --from=claude /usr/local/bin/node /usr/local/bin/node
COPY --from=claude /usr/local/bin/claude /usr/local/bin/claude
ENV PATH="/usr/local/bin:$PATH"
EXPOSE 8080
ENTRYPOINT ["/dashboard"]

Note: The claude CLI requires Node.js. The distroless base is augmented with node and the claude package from a separate build stage. This adds ~100MB to image size but avoids requiring a full OS layer.

4.4 ArgoCD Application

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: workcenter
  namespace: argocd
spec:
  project: default
  source:
    repoURL: http://192.168.8.197:30080/tim/workcenter.git
    targetRevision: HEAD
    path: k8s
  destination:
    server: https://kubernetes.default.svc
    namespace: workcenter
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

5. Security Model

5.1 MVP (Single-Tenant)

The MVP assumes a single trusted operator with full access. No authentication or authorization on the REST API or WebSocket.

API key management: The Anthropic API key is stored as a Kubernetes Secret and injected via environment variable. It is never exposed through the REST API or dashboard.

Network boundary: The dashboard listens on a ClusterIP or NodePort. External access (if desired) goes through Cloudflare Tunnel with Cloudflare Access for authentication.

5.2 Enterprise (Multi-Tenant)

Feature Implementation
Authentication OIDC/SSO via reverse proxy (e.g., oauth2-proxy)
Authorization RBAC: roles per team (admin, operator, viewer)
Audit log Append-only log of all API mutations with actor, timestamp, diff
Secret rotation API key rotation without agent restart
Network policy K8s NetworkPolicy restricting pod-to-pod traffic

6. MVP Scope

6.1 Included in MVP (Open-Core, Free)

Component Deliverable
Orchestrator Spawn, kill, restart agents. Process monitoring.
State Layer Read/write Claude Code native state files.
Event Bus fsnotify watcher with typed events.
Dashboard Web UI with team list, agent status, task board, message log. Spawn/kill/nudge controls.
CLI workcenter command with team/agent/task subcommands.
K8s manifests Deployment, Service, PVC, Namespace.
Container image Multi-stage Dockerfile for ARM64.

6.2 Excluded from MVP (Enterprise, Paid)

Feature Rationale
RBAC Single-tenant MVP needs no access control.
Audit logging Adds write amplification; unnecessary for solo operators.
SSO/OIDC Cloudflare Access covers single-user external auth.
Multi-cluster federation Requires distributed state; out of scope for v1.
Cost tracking Requires API response parsing for token counts; deferred.
Retention policies Manual cleanup is acceptable for MVP.
Agent SDK Library release after API stabilizes.

7. Data Flow

7.1 Spawn Agent

User (Dashboard/CLI)
  │
  ▼
REST API: POST /api/v1/teams/{name}/agents
  │
  ▼
Orchestrator.SpawnAgent()
  ├── Write team config (add member to config.json)
  ├── exec: claude --print --output-format stream-json \
  │         --system-prompt "..." --model claude-opus-4-6
  │         (env: CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1)
  ├── Start goroutine: read stdout stream, detect crashes
  └── Update process table
  │
  ▼
Event Bus detects config.json change
  │
  ▼
WebSocket: {"type": "agent_joined", "team": "...", "agent": "..."}
  │
  ▼
Dashboard UI updates agent card

7.2 Task Update (by Agent)

Claude agent writes ~/.claude/tasks/{team}/{id}.json
  │
  ▼
Event Bus (fsnotify) detects file write
  ├── Read new file contents
  ├── Diff against cached state
  └── Emit: {"type": "task_updated", ...}
  │
  ▼
WebSocket subscribers receive event
  │
  ▼
Dashboard UI updates task board

7.3 Nudge Agent

User (Dashboard/CLI)
  │
  ▼
REST API: POST /api/v1/teams/{name}/agents/{agent}/nudge
  │
  ▼
State Layer: append message to
  ~/.claude/teams/{team}/inboxes/{agent}.json
  │
  ▼
Claude agent's idle notification picks up new inbox message
  │
  ▼
Agent processes the nudge and resumes work

8. Open Questions

# Question Impact Default Assumption
1 Should the orchestrator and dashboard be one binary or two? Deployment complexity vs. separation of concerns One binary, two cmd/ entrypoints, can run as single process with subcommand
2 How to handle claude CLI versioning inside the container? Breaking changes in state format Pin to specific npm version in Dockerfile
3 Is iSCSI ReadWriteOnce sufficient or do we need ReadWriteMany? HA/multi-replica RWO (single replica orchestrator) for MVP
4 Should the event bus persist events for replay? Late-joining dashboard clients miss events No — clients do full state read on connect, then apply events
5 How to authenticate the claude CLI inside the container? API key injection ANTHROPIC_API_KEY env var from K8s Secret
6 Should WorkCenter manage CLAUDE.md and project settings? Reproducibility Deferred — pass via --system-prompt for now

9. Glossary

Term Definition
Agent A running claude --print process that is a member of a team.
Team A named group of agents with shared state directory, task list, and message inboxes.
Task A unit of work with subject, description, status, owner, and dependency edges.
Nudge A message sent to an agent's inbox to prompt action.
State directory The ~/.claude/ filesystem tree containing team configs, tasks, and inboxes.
Event A typed notification derived from a state file mutation.
Orchestrator The WorkCenter daemon that manages agent process lifecycle.
Dashboard The web UI and HTTP server for operational control.