Skip to content

Latest commit

 

History

History
813 lines (678 loc) · 51.9 KB

File metadata and controls

813 lines (678 loc) · 51.9 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

OpenClaw on AgentCore Runtime — a multi-channel AI messaging bot (Telegram, Slack) running as per-user serverless containers on AWS Bedrock AgentCore Runtime. Each user gets their own microVM with workspace persistence. A Router Lambda handles webhook ingestion from Telegram and Slack (text and images), resolves user identity via DynamoDB, and invokes per-user AgentCore sessions. Image uploads are stored in S3 and passed to Bedrock as multimodal content.

Tech Stack

  • Infrastructure: CDK v2 (Python), 7 stacks
  • Runtime: Bedrock AgentCore Runtime (serverless ARM64 container, VPC mode, per-user sessions)
  • Channel Ingestion: Router Lambda behind API Gateway HTTP API (Telegram webhook, Slack Events API, image uploads)
  • Multimodal: Image upload support — photos downloaded by Router Lambda, stored in S3, fetched by proxy, sent to Bedrock as multimodal content
  • Messaging: OpenClaw (Node.js) — headless mode, messages bridged via WebSocket
  • Tools & Skills: Built-in tool groups (full profile) + 5 ClawHub skills + 5 custom skills (S3 user files, EventBridge cron, ClawHub manage, API keys, agentcore-browser) + 2 built-in shim tools (web_fetch, web_search)
  • Scheduling: EventBridge Scheduler for recurring tasks — cron executor Lambda warms sessions and delivers responses to channels
  • Per-User File Storage: S3-backed per-user file isolation via custom s3-user-files skill
  • Workspace Persistence: .openclaw/ directory synced to/from S3 per user
  • AI Model: MiniMax M2.1 via Bedrock ConverseStream (configurable via default_model_id in cdk.json, default minimax.minimax-m2.1)
  • Identity: DynamoDB identity table (channel→user mapping, cross-channel binding) + Cognito User Pool
  • Observability: CloudWatch dashboards + alarms, Bedrock invocation logging
  • Token Monitoring: Lambda + DynamoDB (single-table) + CloudWatch custom metrics
  • API Key Management: Dual-mode storage — native file-based (S3-synced) or AWS Secrets Manager (KMS-encrypted, CloudTrail-auditable) via api-keys skill
  • Security: VPC endpoints, KMS CMK, Secrets Manager, cdk-nag. SECURITY.md is a thin policy pointer; docs/security.md is the single source of truth for the full security architecture

Architecture

  Telegram webhook / Slack Events API
              |
  +-----------v-----------+
  |   Router Lambda       |  <-- API Gateway HTTP API, async self-invoke
  |   - User resolution   |      DynamoDB identity table
  |   - Session mgmt      |      Cross-channel binding
  |   - Channel dispatch   |
  +-----------+-----------+
              |
  +-----------v-----------+
  | InvokeAgentRuntime    |  <-- Per-user session (runtimeSessionId)
  | (session per user)    |
  +-----------+-----------+
              |
  +-----------v-----------+
  | AgentCore Runtime     |  <-- Per-user microVM (ARM64, VPC mode)
  |                       |
  | agentcore-contract.js (8080) -- /ping (Healthy), /invocations
  |   -> boot: pre-fetch secrets from Secrets Manager
  |   -> first /invocations (parallel):
  |     1. Start proxy (18790) + OpenClaw (18789) + restore .openclaw/
  |     2. Wait for proxy only (~5s)
  |     3. Lightweight agent handles messages immediately
  |   -> background: OpenClaw starts (~1-2 min)
  |   -> handoff: once OpenClaw ready, route via WebSocket bridge
  |   -> SIGTERM: save .openclaw/ to S3
  |                       |
  | lightweight-agent.js  -- warm-up shim (proxy -> Bedrock, 17 tools: s3-user-files, eventbridge-cron, clawhub-manage, api-keys, web_fetch, web_search)
  | agentcore-proxy.js    (18790) -- OpenAI -> Bedrock ConverseStream
  | OpenClaw Gateway      (18789) -- headless, no channels
  +-----------+-----------+
              |
  +-----------v-----------+
  |   Amazon Bedrock      |
  |   ConverseStream API  |
  |   MiniMax M2.1      |
  +-----------------------+

  +-----------------------+        +------------------------+
  | S3 User Files         |        | S3 Workspace Sync      |
  | {namespace}/file.md   |        | {namespace}/.openclaw/  |
  | Via s3-user-files      |        | Restored on init,      |
  | skill                 |        | saved periodically     |
  +-----------------------+        +------------------------+

  +------------------------------------------+
  | S3 Image Uploads                         |
  | {namespace}/_uploads/img_*.{jpeg,png,...} |
  | Router Lambda uploads, proxy fetches     |
  | for Bedrock multimodal ConverseStream    |
  +------------------------------------------+

  +------------------------------------------------------+
  | EventBridge Scheduler (Cron Jobs)                    |
  |                                                      |
  | openclaw-cron schedule group                         |
  |   -> Cron Lambda (openclaw-cron-executor)            |
  |     1. Warm up user's AgentCore session              |
  |     2. Send cron message via AgentCore               |
  |     3. Deliver response to Telegram/Slack            |
  +------------------------------------------------------+

  Supporting: VPC, KMS, Secrets Manager, Cognito,
             CloudWatch, DynamoDB, CloudTrail

Project Structure

openclaw-on-agentcore/
  app.py                          # CDK app entry point (7 stacks)
  cdk.json                        # Configuration (model, budgets, sessions, cron)
  requirements.txt                # Python deps (aws-cdk-lib, cdk-nag)
  stacks/
    __init__.py                   # Shared helper (RetentionDays converter)
    vpc_stack.py                  # VPC, subnets, NAT, 7 VPC endpoints, flow logs
    security_stack.py             # KMS CMK, Secrets Manager, Cognito, optional CloudTrail
    agentcore_stack.py            # Runtime, WorkloadIdentity, ECR, S3, IAM
    router_stack.py               # Router Lambda + API Gateway HTTP API + DynamoDB identity
    observability_stack.py        # Dashboards, alarms, Bedrock logging
    token_monitoring_stack.py     # Lambda processor, DynamoDB, token analytics
    cron_stack.py                 # EventBridge Scheduler, Cron executor Lambda, IAM
  bridge/
    Dockerfile                    # Container image (node:22-slim, ARM64, clawhub skills)
    entrypoint.sh                 # Startup: configure IPv4, start contract server
    agentcore-contract.js         # AgentCore HTTP contract with hybrid routing (shim + OpenClaw)
    lightweight-agent.js          # Warm-up agent shim (s3-user-files + eventbridge-cron + clawhub-manage + api-keys tools)
    lightweight-agent.test.js     # Lightweight agent unit tests (node:test, 110 tests)
    agentcore-proxy.js            # OpenAI -> Bedrock ConverseStream adapter + Identity + multimodal images
    image-support.test.js         # Image support unit tests (node:test)
    content-extraction.test.js    # Content block extraction tests (node:test)
    workspace-sync.js             # .openclaw/ directory S3 sync (restore/save/periodic)
    scoped-credentials.js         # Per-user STS session-scoped credentials (S3, Secrets Manager, DynamoDB)
    scoped-credentials.test.js    # Scoped credentials unit tests (node:test)
    workspace-sync.test.js        # Workspace sync credential tests (node:test)
    force-ipv4.js                 # DNS patch for Node.js 22 IPv6 issue
    skills/
      s3-user-files/              # Custom per-user file storage skill (S3-backed)
        SKILL.md                  # OpenClaw skill manifest
        common.js                 # Shared utilities (sanitize, buildKey, validation)
        read.js / write.js        # Read/write files in user's S3 namespace
        list.js / delete.js       # List/delete files in user's S3 namespace
      eventbridge-cron/           # Cron scheduling skill (EventBridge Scheduler)
        SKILL.md                  # OpenClaw skill manifest
        common.js                 # Shared utilities (schedule group, DynamoDB helpers)
        create.js / update.js     # Create/update EventBridge schedules
        list.js / delete.js       # List/delete schedules
      clawhub-manage/             # ClawHub skill installer (install/uninstall/list)
        SKILL.md                  # OpenClaw skill manifest
        common.js                 # Skill name validation
        install.js / uninstall.js # Install/uninstall ClawHub skills
        list.js                   # List installed skills
      api-keys/                   # Dual-mode API key management (native + Secrets Manager)
        SKILL.md                  # OpenClaw skill manifest
        common.js                 # Shared validation (userId, keyName)
        native.js / secret.js    # Native file CRUD / Secrets Manager CRUD
        retrieve.js              # Unified lookup (SM first, native fallback)
        migrate.js               # Move keys between backends
      agentcore-browser/          # Headless browser skill (optional, enable_browser=true)
        SKILL.md                  # OpenClaw skill manifest
        common.js                 # Session file reader, S3 upload helper
        navigate.js               # Navigate to URL, return title + content
        screenshot.js             # Capture PNG screenshot, upload to S3
        interact.js               # Click, type, scroll, wait on elements
  lambda/
    token_metrics/index.py        # Bedrock log -> DynamoDB + CloudWatch metrics
    router/index.py               # Webhook router (Telegram + Slack, image uploads)
    router/test_image_upload.py        # Image upload unit tests (pytest)
    router/test_content_extraction.py  # Content block extraction tests (pytest)
    router/test_markdown_html.py       # Markdown-to-HTML conversion tests (pytest)
    cron/index.py                      # Cron executor (warmup, invoke, deliver to channel)
  scripts/
    setup-telegram.sh             # Telegram webhook + admin allowlist (one-step)
    setup-slack.sh                # Slack Event Subscriptions + admin allowlist
    manage-allowlist.sh           # Add/remove/list users in the allowlist
  tests/
    e2e/                          # E2E tests (simulated Telegram webhooks + CloudWatch logs)
  docs/
    architecture.md               # Detailed architecture diagrams
    architecture-detailed.md      # Technical deep-dive (sequence diagrams, container internals, data flows)
    security.md                   # Complete security architecture (single source of truth — threat model, 10 defense layers, operations runbook)

CDK Stacks (7 stacks)

Stack Key Resources Dependencies
OpenClawVpc VPC (2 AZ), subnets, NAT, 7 VPC endpoints, flow logs None
OpenClawSecurity KMS CMK, Secrets Manager (8 secrets incl. webhook validation + feishu), Cognito User Pool, optional CloudTrail None
OpenClawAgentCore Execution Role, Security Group, S3 bucket (Runtime/Endpoint managed by Starter Toolkit) Vpc, Security
OpenClawRouter Lambda, API Gateway HTTP API (explicit routes, throttling), DynamoDB identity table AgentCore, Security
OpenClawObservability Operations dashboard, alarms, SNS, Bedrock invocation logging None
OpenClawTokenMonitoring DynamoDB (single-table, 4 GSIs), Lambda processor, analytics dashboard Observability
OpenClawCron EventBridge Scheduler group, Cron executor Lambda, Scheduler IAM role AgentCore, Router, Security

Expected Commands

Hybrid Deploy (CDK + Starter Toolkit)

Deployment uses a 3-phase hybrid model: CDK for infrastructure, Starter Toolkit for Runtime/container.

# Full deploy via script (recommended)
./scripts/deploy.sh                  # all 3 phases
./scripts/deploy.sh --phase1         # CDK foundation only
./scripts/deploy.sh --runtime-only   # Starter Toolkit only
./scripts/deploy.sh --phase3         # CDK dependent stacks only

Phase 1: CDK foundation stacks

source .venv/bin/activate
export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export CDK_DEFAULT_REGION=us-west-2
cdk deploy OpenClawVpc OpenClawSecurity OpenClawAgentCore OpenClawObservability --require-approval never

Phase 2: Starter Toolkit (Runtime + Docker build)

# Configure (first time only)
agentcore configure --name openclaw_agent --entrypoint bridge/agentcore-contract.js \
  --execution-role <ROLE_ARN> --region us-west-2 --vpc \
  --subnets <SUBNET_IDS> --security-groups <SG_ID> \
  --deployment-type container --language typescript --non-interactive

# Deploy (builds Docker image locally or via CodeBuild, creates/updates Runtime)
agentcore deploy --agent openclaw_agent --local-build --auto-update-on-conflict \
  --env "BEDROCK_MODEL_ID=global.anthropic.claude-opus-4-6-v1" \
  --env "S3_USER_FILES_BUCKET=openclaw-user-files-..." ...

# Update cdk.json with runtime_id and runtime_endpoint_id from toolkit output

Phase 3: CDK dependent stacks

cdk deploy OpenClawRouter OpenClawCron OpenClawTokenMonitoring --require-approval never

Other CDK commands

cdk synth                                    # synthesize + cdk-nag checks
cdk diff                                     # preview changes
cdk destroy --all                            # tear down (does NOT destroy Starter Toolkit resources)
agentcore destroy --agent openclaw_agent     # destroy Starter Toolkit resources

Webhook Setup (Telegram)

The setup script registers the webhook and adds you to the allowlist in one step:

./scripts/setup-telegram.sh

Or manually:

# Get Router API Gateway URL
API_URL=$(aws cloudformation describe-stacks \
  --stack-name OpenClawRouter \
  --query "Stacks[0].Outputs[?OutputKey=='ApiUrl'].OutputValue" \
  --output text --region $CDK_DEFAULT_REGION)

# Get webhook secret (for Telegram request validation)
WEBHOOK_SECRET=$(aws secretsmanager get-secret-value \
  --secret-id openclaw/webhook-secret \
  --region $CDK_DEFAULT_REGION --query SecretString --output text)

# Set up Telegram webhook with secret_token for validation
TELEGRAM_TOKEN=$(aws secretsmanager get-secret-value \
  --secret-id openclaw/channels/telegram \
  --region $CDK_DEFAULT_REGION --query SecretString --output text)
curl "https://api.telegram.org/bot${TELEGRAM_TOKEN}/setWebhook?url=${API_URL}webhook/telegram&secret_token=${WEBHOOK_SECRET}"

# Add yourself to the allowlist (find your ID via @userinfobot on Telegram)
./scripts/manage-allowlist.sh add telegram:YOUR_TELEGRAM_USER_ID

Channel Setup

# Store Telegram bot token
aws secretsmanager update-secret \
  --secret-id openclaw/channels/telegram \
  --secret-string 'BOT_TOKEN' \
  --region $CDK_DEFAULT_REGION

# Store Slack credentials (JSON: bot token + signing secret for HMAC validation)
aws secretsmanager update-secret \
  --secret-id openclaw/channels/slack \
  --secret-string '{"botToken":"xoxb-YOUR-BOT-TOKEN","signingSecret":"YOUR-SIGNING-SECRET"}' \
  --region $CDK_DEFAULT_REGION

Slack Setup (Event Subscriptions + Allowlist)

./scripts/setup-slack.sh

This displays the webhook URL for Slack Event Subscriptions, prompts for your Slack member ID, and adds you to the allowlist.

Feishu Setup (Event Subscriptions + Allowlist)

./scripts/setup-feishu.sh

This displays the webhook URL, guides you through Feishu developer console setup (app creation, permissions, events, publishing), stores credentials in Secrets Manager, and adds you to the allowlist. Store credentials format: {"appId":"...","appSecret":"...","verificationToken":"...","encryptKey":"..."}

Deploy New Bridge Version (Starter Toolkit — preferred)

The project uses CDK + AgentCore Starter Toolkit hybrid deployment:

  • CDK manages infrastructure (VPC, Lambda, DynamoDB, S3, etc.)
  • Starter Toolkit (agentcore CLI) manages the AgentCore Runtime (container image, lifecycle config)

Config file: .bedrock_agentcore.yaml (in repo root, on deploy/starter-toolkit-hybrid branch)

# 1. Build image locally
docker build --platform linux/arm64 -t openclaw-bridge:v${TAG} bridge/

# 2. Push to ECR (starter toolkit ECR repo, NOT the CDK one)
ECR_REPO=<ACCOUNT_ID>.dkr.ecr.us-west-2.amazonaws.com/bedrock-agentcore-openclaw_agent
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin $ECR_REPO
docker tag openclaw-bridge:v${TAG} ${ECR_REPO}:v${TAG}
docker push ${ECR_REPO}:v${TAG}

# 3. Update AgentCore Runtime to use new image
#    CRITICAL: update-agent-runtime is a FULL REPLACE — any field you omit gets cleared!
#    You MUST include --environment-variables every time, or all env vars will be wiped.
aws bedrock-agentcore-control update-agent-runtime \
  --agent-runtime-id <RUNTIME_ID> \
  --agent-runtime-artifact "{\"containerConfiguration\":{\"containerUri\":\"${ECR_REPO}:v${TAG}\"}}" \
  --role-arn "arn:aws:iam::<ACCOUNT_ID>:role/openclaw-agentcore-execution-role-us-west-2" \
  --network-configuration '{"networkMode":"VPC","networkModeConfig":{"securityGroups":["<SECURITY_GROUP_ID>"],"subnets":["<PRIVATE_SUBNET_1>","<PRIVATE_SUBNET_2>"]}}' \
  --environment-variables '{
    "AWS_REGION":"us-west-2",
    "BEDROCK_AGENTCORE_MEMORY_ID":"<MEMORY_ID>",
    "BEDROCK_AGENTCORE_MEMORY_NAME":"openclaw_agent_mem",
    "BEDROCK_MODEL_ID":"global.anthropic.claude-opus-4-6-v1",
    "COGNITO_CLIENT_ID":"<COGNITO_CLIENT_ID>",
    "COGNITO_PASSWORD_SECRET_ID":"openclaw/cognito-password-secret",
    "COGNITO_USER_POOL_ID":"<COGNITO_USER_POOL_ID>",
    "CRON_LAMBDA_ARN":"arn:aws:lambda:us-west-2:<ACCOUNT_ID>:function:openclaw-cron-executor",
    "CRON_LEAD_TIME_MINUTES":"5",
    "EVENTBRIDGE_ROLE_ARN":"arn:aws:iam::<ACCOUNT_ID>:role/openclaw-cron-scheduler-role-us-west-2",
    "EVENTBRIDGE_SCHEDULE_GROUP":"openclaw-cron",
    "EXECUTION_ROLE_ARN":"arn:aws:iam::<ACCOUNT_ID>:role/openclaw-agentcore-execution-role-us-west-2",
    "GATEWAY_TOKEN_SECRET_ID":"openclaw/gateway-token",
    "IDENTITY_TABLE_NAME":"openclaw-identity",
    "S3_USER_FILES_BUCKET":"openclaw-user-files-<ACCOUNT_ID>-us-west-2",
    "SUBAGENT_BEDROCK_MODEL_ID":"global.anthropic.claude-opus-4-6-v1"
  }' \
  --region us-west-2

# 4. Verify update completed
aws bedrock-agentcore-control get-agent-runtime \
  --agent-runtime-id <RUNTIME_ID> --region us-west-2 \
  --query '{status:status,version:agentRuntimeVersion,image:agentRuntimeArtifact.containerConfiguration.containerUri}'

# 5. New sessions will use the new image automatically (per-user idle termination)

Starter Toolkit CLI (agentcore)

# Check runtime status
agentcore status

# Deploy via CodeBuild (cloud build, no local Docker needed)
agentcore deploy -a openclaw_agent --auto-update-on-conflict --image-tag v${TAG}

# Deploy with local build (requires Docker)
agentcore deploy -a openclaw_agent --local-build --auto-update-on-conflict --image-tag v${TAG}

# Stop a runtime session
agentcore stop-session

# Invoke runtime for testing
agentcore invoke -a openclaw_agent

Stop a Running Session

Required when deploying a new image and you need the user's next message to start a fresh session with the new version. Also useful for debugging.

# 1. Get the session ID from DynamoDB (look up by user's internal ID)
aws dynamodb query --table-name openclaw-identity --region us-west-2 \
  --key-condition-expression "PK = :pk AND SK = :sk" \
  --expression-attribute-values '{":pk":{"S":"USER#<internalUserId>"},":sk":{"S":"SESSION"}}' \
  --query 'Items[0].sessionId.S' --output text

# 2. Stop the session via data plane API
aws bedrock-agentcore stop-runtime-session \
  --agent-runtime-arn "arn:aws:bedrock-agentcore:us-west-2:<ACCOUNT_ID>:runtime/<RUNTIME_ID>" \
  --runtime-session-id "<sessionId>" \
  --region us-west-2

The session receives SIGTERM, saves workspace to S3, and shuts down. The next message from that user triggers a new session with the latest image.

Key IDs (us-west-2)

  • Runtime ID: <RUNTIME_ID>
  • Runtime ARN: arn:aws:bedrock-agentcore:us-west-2:<ACCOUNT_ID>:runtime/<RUNTIME_ID>
  • ECR Repo: <ACCOUNT_ID>.dkr.ecr.us-west-2.amazonaws.com/bedrock-agentcore-openclaw_agent
  • Execution Role: arn:aws:iam::<ACCOUNT_ID>:role/openclaw-agentcore-execution-role-us-west-2
  • Control Plane API: aws bedrock-agentcore-control (get/update/list runtime, endpoints, sessions)
  • Data Plane API: aws bedrock-agentcore (invoke-agent-runtime, stop-runtime-session)

Deploy New Bridge Version (CDK — legacy, infrastructure only)

# CDK no longer manages the AgentCore Runtime container image.
# Use CDK only for infrastructure changes (Lambda, DynamoDB, VPC, etc.)
source .venv/bin/activate && cdk deploy --all --require-approval never

⚠️ Always run ./post-deploy.sh after any CDK deploy. This applies auto-delete=no tags to all OpenClaw resources to prevent automated cleanup processes from deleting production infrastructure. The script is gitignored (internal only) — if it doesn't exist, ask the user to restore it.

Bridge Tests

cd bridge && node --test proxy-identity.test.js       # identity + workspace tests
cd bridge && node --test image-support.test.js         # image upload + multimodal tests
cd bridge && node --test lightweight-agent.test.js     # lightweight agent tools + buildToolArgs tests
cd bridge && node --test subagent-routing.test.js      # subagent model routing + detection tests
cd bridge && node --test content-extraction.test.js    # recursive content block extraction tests
cd bridge && node --test scoped-credentials.test.js    # per-user STS credential scoping tests
cd bridge && node --test workspace-sync.test.js        # workspace sync credential tests
cd bridge && node --test agentcore-browser.test.js     # browser skill unit tests
cd bridge && node --test browser-lifecycle.test.js     # browser session lifecycle tests
cd bridge/skills/s3-user-files && AWS_REGION=$CDK_DEFAULT_REGION node --test common.test.js  # S3 skill tests

Router Lambda Tests

cd lambda/router && python -m pytest test_image_upload.py -v        # image upload unit tests
cd lambda/router && python -m pytest test_content_extraction.py -v  # content block extraction tests
cd lambda/router && python -m pytest test_markdown_html.py -v       # markdown-to-HTML conversion tests
cd lambda/router && python -m pytest test_screenshot_handling.py -v # screenshot marker detection + delivery tests

E2E Tests

cd tests/e2e && python -m pytest bot_test.py -v                    # simulated Telegram webhook tests (requires deployed stack)
cd tests/e2e && python -m pytest bot_test.py -v -k TestBrowserFeature  # browser E2E tests (requires enable_browser=true)

Runtime Operations

# Get runtime status (via Starter Toolkit)
agentcore status --agent openclaw_agent --verbose

# Get container-level status (openclawReady, proxyReady, uptime, logs)
agentcore invoke '{"action":"status"}' -a openclaw_agent

# Send test chat to trigger init + verify end-to-end
agentcore invoke '{"action":"chat","userId":"test","actorId":"test:123","channel":"test","message":"hello"}' -a openclaw_agent

# Stop a specific session (forces new container on next invocation)
# IMPORTANT: update-agent-runtime changes env vars but does NOT replace running containers.
# You must stop-session to force a fresh container with updated env vars.
agentcore stop-session -a openclaw_agent -s <SESSION_ID>

# Find session ID from Router Lambda logs
aws logs filter-log-events --log-group-name /openclaw/lambda/router --region $CDK_DEFAULT_REGION \
  --start-time $(python3 -c "import time; print(int((time.time()-3600)*1000))") \
  --filter-pattern "session=ses_" --query 'events[-1].message' --output text

# Update runtime config (env vars, image, subnets) — bypasses Starter Toolkit limitations
aws bedrock-agentcore-control update-agent-runtime \
  --agent-runtime-id <RUNTIME_ID> \
  --role-arn "arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/openclaw-agentcore-execution-role-$CDK_DEFAULT_REGION" \
  --agent-runtime-artifact '{"containerConfiguration":{"containerUri":"<ECR_URI>:<TAG>"}}' \
  --network-configuration '{"networkMode":"VPC","networkModeConfig":{"subnets":["<SUBNET1>","<SUBNET2>"],"securityGroups":["<SG>"]}}' \
  --environment-variables '{...}' \
  --region $CDK_DEFAULT_REGION

# Check DynamoDB identity table
aws dynamodb scan --table-name openclaw-identity --region $CDK_DEFAULT_REGION

# Check Router Lambda errors (last 5 min)
aws logs filter-log-events --log-group-name /openclaw/lambda/router --region $CDK_DEFAULT_REGION \
  --start-time $(python3 -c "import time; print(int((time.time()-300)*1000))") \
  --filter-pattern "ERROR" --query 'events[*].message' --output text

# Check ECR images
aws ecr describe-images --repository-name bedrock-agentcore-openclaw_agent --region $CDK_DEFAULT_REGION \
  --query 'imageDetails[*].{tag:imageTags[0],size:imageSizeInBytes,pushed:imagePushedAt}' --output table

Docker Build & Push (local)

# Build ARM64 image from current branch
cd bridge && sudo docker build --platform linux/arm64 -t openclaw-bridge:latest .

# Test locally before pushing
sudo docker run --rm -d --name test -p 8080:8080 -e AWS_REGION=$CDK_DEFAULT_REGION openclaw-bridge:latest
curl http://localhost:8080/ping   # should return {"status":"Healthy",...}
sudo docker stop test

# Login + push to ECR
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
aws ecr get-login-password --region $CDK_DEFAULT_REGION | sudo docker login --username AWS --password-stdin $ACCOUNT.dkr.ecr.$CDK_DEFAULT_REGION.amazonaws.com
sudo docker tag openclaw-bridge:latest $ACCOUNT.dkr.ecr.$CDK_DEFAULT_REGION.amazonaws.com/bedrock-agentcore-openclaw_agent:local-build
sudo docker push $ACCOUNT.dkr.ecr.$CDK_DEFAULT_REGION.amazonaws.com/bedrock-agentcore-openclaw_agent:local-build

Key Configuration (cdk.json)

Parameter Default Description
account (empty) AWS account ID. Falls back to CDK_DEFAULT_ACCOUNT
region us-west-2 AWS region. Falls back to CDK_DEFAULT_REGION
default_model_id global.anthropic.claude-opus-4-6-v1 Bedrock model ID. The global. prefix routes to any available region
runtime_id (empty) AgentCore Runtime ID from Starter Toolkit. Populated by deploy script after agentcore deploy
runtime_endpoint_id (empty) AgentCore Runtime Endpoint ID. Typically DEFAULT when using Starter Toolkit
image_version 1 Bridge container version tag. Bump to force container redeploy
cloudwatch_log_retention_days 30 Log retention
daily_token_budget 1000000 Token budget alarm threshold
daily_cost_budget_usd 5 Cost budget alarm threshold
token_ttl_days 90 DynamoDB TTL
user_files_ttl_days 365 S3 per-user file expiration
session_idle_timeout 1800 Per-user session idle timeout (seconds)
session_max_lifetime 28800 Per-user session max lifetime (seconds)
workspace_sync_interval_seconds 300 .openclaw/ S3 sync interval
router_lambda_timeout_seconds 300 Router Lambda timeout
router_lambda_memory_mb 256 Router Lambda memory
registration_open false If true, any user can register. If false, only allowlisted users
cron_lambda_timeout_seconds 600 Cron executor Lambda timeout (must exceed warmup time)
cron_lambda_memory_mb 256 Cron executor Lambda memory
enable_cloudtrail false Deploy a dedicated CloudTrail trail (S3 bucket + trail). Off by default — most accounts already have one
cron_lead_time_minutes 5 Minutes before schedule time to start warmup
subagent_model_id (empty) Bedrock model for sub-agents. Empty = use default_model_id
enable_browser false Enable headless Chromium browser. Sets BROWSER_IDENTIFIER env var on container

Container Startup Sequence

  1. entrypoint.sh: Configure Node.js IPv4 DNS patch, start contract server
  2. agentcore-contract.js (port 8080): Responds to /ping with Healthy immediately
  3. On first /invocations with action: chat or action: warmup (lazy init):
    • Fetch secrets from Secrets Manager (gateway token, Cognito secret)
    • Create STS scoped credentials restricting S3 + Secrets Manager + DynamoDB to user's namespace
    • Configure workspace-sync with scoped credentials
    • Start agentcore-proxy.js (port 18790) with USER_ID/CHANNEL env vars
    • Start OpenClaw gateway (port 18789) with scoped credentials env (no container credentials)
    • Restore .openclaw/ from S3 via workspace-sync.js in background
    • Start credential refresh timer (45 min interval)
    • If BROWSER_IDENTIFIER set: create browser session via AgentCore Browser API, write session file to /tmp/agentcore-browser-session.json
    • Wait for proxy only (~5s)
  4. Warm-up phase (t=~10s to ~1-2min): lightweight-agent.js handles messages via proxy -> Bedrock (supports s3-user-files, eventbridge-cron, clawhub-manage, api-keys, web_fetch, web_search tools)
  5. Handoff (~1-2min): OpenClaw becomes ready, all subsequent messages route via WebSocket bridge
  6. After handoff: Full OpenClaw features — web_fetch, web_search (built-in), 5 ClawHub skills (Jina reader, deep-research-pro, etc.), sub-agent support, session management
  7. action: warmup: Triggers init only; returns {ready: true} when OpenClaw is ready (used by cron Lambda to pre-warm sessions)
  8. action: cron: Sends a cron message via the WebSocket bridge (same as chat but intended for scheduled tasks)
  9. action: status: Returns current init state ({openclawReady, proxyReady, uptime}) without triggering init
  10. SIGTERM: Save .openclaw/ to S3, kill child processes, exit

DynamoDB Identity Table Schema

Table: openclaw-identity (PAY_PER_REQUEST, TTL on ttl attribute)

PK SK Purpose
CHANNEL#telegram:123456789 PROFILE Channel→user lookup
USER#user_abc123 PROFILE User profile
USER#user_abc123 CHANNEL#telegram:123456789 User's bound channels
USER#user_abc123 SESSION Current session
BIND#ABC123 BIND Cross-channel bind code (10 min TTL)
ALLOW#telegram:123456789 ALLOW User allowlist entry
USER#user_abc123 CRON#schedule-name User's cron schedule metadata (expression, message, timezone, channel)

Cross-channel binding: User says "link accounts" on Telegram → gets 6-char code → enters code on Slack → both channels route to same user/session.

User Allowlist

When registration_open is false (default), only users with an ALLOW# record in DynamoDB can register. Existing users (already have a CHANNEL# record) are always allowed. Cross-channel binding bypasses the allowlist since it links to an already-approved user.

Unauthorized users who message the bot receive a rejection message that includes their channel ID (e.g. telegram:123456), so they can share it with the admin for onboarding.

First-User Bootstrap

After initial deployment, no users exist. The easiest path is the setup script, which registers the webhook and adds you to the allowlist in one step:

./scripts/setup-telegram.sh

Alternatively, if you don't know your Telegram user ID:

  1. Message the bot from Telegram
  2. The bot replies with a rejection message showing your ID, e.g. telegram:123456
  3. Add yourself to the allowlist:
    ./scripts/manage-allowlist.sh add telegram:123456
  4. Message the bot again — you are now registered

Adding New Users

When someone wants access to the bot:

  1. They message the bot and receive: "Your ID: telegram:789012. Send this ID to the bot admin to request access."
  2. The admin adds them:
    ./scripts/manage-allowlist.sh add telegram:789012
  3. The user messages the bot again — they are now registered

Managing the Allowlist

# Add a user to the allowlist
./scripts/manage-allowlist.sh add telegram:123456

# Remove a user
./scripts/manage-allowlist.sh remove telegram:123456

# List all allowed users
./scripts/manage-allowlist.sh list

Only the first channel identity needs to be allowlisted. When a user binds a second channel (e.g. Slack) via link, the new channel maps to their existing approved user — no separate allowlist entry needed.

Gotchas

AgentCore Runtime

  • Hybrid deploy: Runtime/Endpoint/ECR managed by Starter Toolkit (agentcore deploy), not CDK. CDK manages IAM Role, SG, S3, Lambda, etc. See ./scripts/deploy.sh for the 3-phase flow
  • ARM64 required: Build with --platform linux/arm64. This machine is ARM64 native — use --local-build mode
  • Docker Hub rate limit: Dockerfile uses public.ecr.aws/docker/library/node:22-slim (ECR Public Gallery) instead of Docker Hub to avoid anonymous pull rate limits in CodeBuild
  • IAM role names are region-suffixed: openclaw-agentcore-execution-role-{region} and openclaw-cron-scheduler-role-{region} to avoid cross-region conflicts (IAM roles are global)
  • Trust policy self-assume: Uses AccountRootPrincipal() + ArnEquals condition (not ArnPrincipal) to avoid chicken-and-egg during role creation
  • update-agent-runtime is a FULL REPLACE: Omitting --environment-variables wipes ALL env vars. Always include the full env vars JSON in every update call. This is the most common deployment mistake — the container starts but init fails because secrets/config env vars are missing
  • Resource names: Must match ^[a-zA-Z][a-zA-Z0-9_]{0,47}$ — underscores, not hyphens
  • Health check timing: Contract server on port 8080 must start within seconds
  • Per-user sessions: Contract returns Healthy (not HealthyBusy) — allows natural idle termination
  • Session recreation: InvokeAgentRuntime with terminated session creates new microVM; workspace restored on init
  • VPC endpoints: bedrock-agentcore-runtime endpoint not available in all regions
  • Starter Toolkit source_path: Must point to bridge/ directory (contains Dockerfile and all COPY sources). If source_path is project root, COPY commands fail because paths are relative to bridge/

IAM / Bedrock

  • Cross-region inference: Model minimax.minimax-m2.1 uses a global cross-region inference profile that routes to any available region — IAM uses arn:aws:bedrock:*::foundation-model/* and inference-profile wildcards
  • Inference profile ARN: Separate from foundation model — arn:aws:bedrock:{region}:{account}:inference-profile/*

Node.js 22 + VPC

  • IPv6 issue: Node.js 22 Happy Eyeballs fails in VPCs without IPv6 — force-ipv4.js patches dns.lookup() to force IPv4
  • NODE_OPTIONS: --dns-result-order=ipv4first --no-network-family-autoselection -r /app/force-ipv4.js

CDK

  • logs.RetentionDays is an enum — use helper in stacks/__init__.py
  • Cross-stack cyclic deps: use string ARN params + add_to_policy() instead of grant_*()
  • Empty cdk.json account: falls back to CDK_DEFAULT_ACCOUNT env var via app.py

OpenClaw

  • Startup takes ~1-2 minutes (plugin registration); lightweight agent shim handles messages during this time
  • Correct start command: openclaw gateway run --port 18789 --verbose (no --bind lan — localhost binding sufficient since both processes run in the same container)
  • Tool profile: Uses "full" profile with a deny list. Do NOT use "basic" (undocumented, may disable web tools). Documented profiles: minimal, coding, messaging, full
  • Deny list: ["write", "edit", "apply_patch", "read", "browser", "canvas", "cron", "gateway"] — local writes use S3 skill, read blocked to prevent credential reads, no browser/UI in container, EventBridge replaces built-in cron. exec is NOT denied — skills like clawhub-manage need it; scoped STS credentials limit blast radius
  • Sub-agent sandbox: Must be "off" — no Docker inside AgentCore microVMs. MicroVMs already provide per-user isolation
  • Sub-agent model: Configurable via SUBAGENT_BEDROCK_MODEL_ID env var (from subagent_model_id in cdk.json). Empty = use same as main model. Subagents use a distinct model name (bedrock-agentcore-subagent) so the proxy can detect and count them separately
  • skipBootstrap removed: No longer a valid config key — OpenClaw rejects unknown keys and exits with code 1
  • skills.allowBundled: Must be an array (e.g., [] for none, ["*"] for all), not a boolean. Set to [] for fast startup
  • ClawHub skill paths: clawhub install installs to managed skills path — OpenClaw scans this automatically. Custom skills in /skills/ loaded via extraDirs
  • ClawHub VirusTotal flags: Some skills flagged for external API calls — use --no-input --force for non-interactive Docker builds
  • 5 ClawHub skills installed: jina-reader, deep-research-pro, telegram-compose, transcript, task-decomposer (reduced from 8 — duckduckgo-search, hackernews, news-feed removed to optimize cold start; web search handled by lightweight agent's built-in web_search tool)
  • Image updates: New sessions use new image automatically (no keepalive restart needed)
  • WebSocket bridge protocol: Connect → auth (type:req, method:connect, protocol:3, auth:{token}) → agent.chat → streaming deltas → final
  • OpenClaw 2026.3.2 WebSocket origin enforcement: OpenClaw enforces origin checks on all WebSocket connections that carry an Origin header. The ws Node.js library must use the origin option (not headers.Origin) to set the header correctly for the HTTP upgrade request. Config: controlUi: { enabled: false, allowInsecureAuth: true, dangerouslyDisableDeviceAuth: true, allowedOrigins: ["*"] }. Without both the origin option on the client and allowedOrigins in config, connections fail with "Auth failed: origin not allowed"
  • Workspace sync overwrites config: The .openclaw/ S3 sync can overwrite openclaw.json with stale configs. openclaw.json is excluded from sync via SKIP_PATTERNS — config is always programmatically generated by writeOpenClawConfig()

Cognito Identity

  • Self-signup disabled — users auto-provisioned by proxy via AdminCreateUser
  • Passwords: HMAC-SHA256(secret, actorId).slice(0, 32) — deterministic, never stored
  • Usernames are channel-prefixed: telegram:123456789
  • JWT tokens cached per user with 60s early refresh

Router Lambda

  • API Gateway HTTP API: Only explicit routes exposed (POST /webhook/telegram, POST /webhook/slack, GET /health). Rate limiting: burst 50, sustained 100 req/s
  • Webhook validation: Telegram uses X-Telegram-Bot-Api-Secret-Token header (set via secret_token on setWebhook). Slack uses X-Slack-Signature HMAC-SHA256 with 5-minute replay window
  • Async dispatch: Self-invokes with InvocationType=Event for actual processing; returns 200 immediately to webhook
  • Slack: Handles url_verification challenge synchronously; ignores retries via x-slack-retry-num header
  • Cold start latency: First message to a new user triggers microVM creation; lightweight agent responds in ~10-15s while OpenClaw starts in background (~1-2 min)
  • Typing indicator + progress message: Telegram typing indicator sent every 4s while waiting; after 30s of waiting, a one-time progress message ("Working on your request...") is sent to both Telegram and Slack so users know the bot is still working during long subagent tasks
  • Content block extraction: _extract_text_from_content_blocks() recursively unwraps nested [{"type":"text","text":"..."}] JSON — subagent responses (deep-research-pro, task-decomposer) can wrap content multiple levels deep
  • Markdown-to-HTML conversion: _markdown_to_telegram_html() converts markdown to Telegram-compatible HTML before sending. Handles bold, italic, strikethrough, code blocks, inline code, headers, links, blockquotes, horizontal rules, and markdown tables (rendered as monospace <pre> blocks with aligned columns). Uses parse_mode: "HTML" (not "Markdown" v1 which is too strict for AI-generated content)
  • Cross-channel binding: "link accounts" generates 6-char code in DynamoDB with 10-min TTL
  • Image uploads: Telegram photos and Slack file attachments (JPEG, PNG, GIF, WebP, max 3.75 MB) are downloaded by the Router Lambda, uploaded to S3 under {namespace}/_uploads/, and passed to AgentCore as a structured message {text, images[{s3Key, contentType}]}
  • Telegram captions: message.get("text", "") or message.get("caption", "") — photos use caption, not text
  • Secret cache TTL: Secrets Manager values cached for 15 minutes (was indefinite). Rotated secrets reflected within 15 min without container restart

Image Upload Flow

  • Router Lambda downloads image from channel API (Telegram getFile / Slack url_private_download), uploads to S3 {namespace}/_uploads/img_{ts}_{hex}.{ext}
  • Contract server converts structured message to bridge text with [OPENCLAW_IMAGES:[...]] marker appended
  • Proxy extracts marker via regex, fetches image bytes from S3 (with namespace validation to prevent cross-user reads), builds Bedrock multimodal content blocks ({image: {format, source: {bytes}}})
  • Supported types: image/jpeg, image/png, image/gif, image/webp (max 3.75 MB per Bedrock limit)
  • Security: S3 key validated against user's namespace prefix + path traversal (..) rejection. Format validated against VALID_BEDROCK_FORMATS set
  • Slack prerequisite: Bot needs files:read OAuth scope to download image files

Workspace Sync

  • S3 prefix: {namespace}/.openclaw/ where namespace = actorId.replace(/:/g, "_")
  • Periodic saves: Every 5 min (configurable via WORKSPACE_SYNC_INTERVAL_MS)
  • SIGTERM grace: 10s max for final save before exit (AgentCore gives 15s total)
  • Skip patterns: node_modules/, .cache/, *.log, files > 10MB
  • Same S3 bucket: Uses S3_USER_FILES_BUCKET (shared with s3-user-files skill)

EventBridge Cron Scheduling

  • Schedule group: All schedules created under openclaw-cron group in EventBridge Scheduler
  • Schedule naming: openclaw-{namespace}-{shortId} (e.g., openclaw-telegram_123456789-87a86927)
  • DynamoDB storage: Schedule metadata stored as CRON# SK under the user's PK in the identity table
  • Cron executor Lambda: Warms up the user's AgentCore session (sends action: warmup), then sends the cron message (sends action: cron), then delivers the response to the user's chat channel
  • Lead time: Cron Lambda invoked with cron_lead_time_minutes (default 5 min) to allow session warmup before the scheduled time
  • Environment variables: Container receives EVENTBRIDGE_SCHEDULE_GROUP, CRON_LAMBDA_ARN, EVENTBRIDGE_ROLE_ARN, IDENTITY_TABLE_NAME, CRON_LEAD_TIME_MINUTES for the eventbridge-cron skill

AgentCore Browser (Optional)

  • Opt-in: Only active when enable_browser=true in cdk.json and BROWSER_IDENTIFIER is set as container env var
  • Session file: /tmp/agentcore-browser-session.json — written by contract server on init, read by skill scripts
  • Session timeout: 1 hour (BROWSER_SESSION_TIMEOUT_SECONDS = 3600). Session recreated automatically on expiry
  • Screenshot delivery: Screenshots uploaded to {namespace}/_screenshots/ in S3, embedded in response as [SCREENSHOT:key] marker. Router Lambda detects markers and delivers as photos to Telegram/Slack
  • Not available during warm-up: Browser skill requires full OpenClaw startup — the lightweight agent does not include browser tools
  • Lifecycle: startBrowserSession() called during init (parallel with proxy/OpenClaw start), stopBrowserSession() called on SIGTERM
  • Skill scripts: navigate.js (CDP Page.navigate), screenshot.js (CDP Page.captureScreenshot → S3 upload), interact.js (click/type/scroll/wait via CDP)

Per-User Identity Resolution

  • Priority order: (0) USER_ID env var (set by contract server) → (1) x-openclaw-actor-id header → (2) OpenAI user field → (3) message envelope parsing → (4) message name field → (5) fallback default-user
  • Per-user sessions: Contract server sets USER_ID env var when starting proxy, so identity is always resolved from environment in per-user mode
  • S3-backed isolation: User files in s3://openclaw-user-files-{account}-{region}/{namespace}/
  • Namespace immutability: System-determined from channel identity, cannot be changed by user request
  • actorId vs namespace: actorId uses colon format (telegram:123456789), namespace uses underscore format (telegram_123456789). Skill scripts (s3-user-files, eventbridge-cron) expect namespace format. The lightweight agent's chat() converts via userId.replace(/:/g, "_") before passing to tools. The proxy and workspace sync also use namespace format for S3 keys

Per-User Credential Isolation

  • STS session-scoped credentials: On init, the contract server calls STS:AssumeRole on the execution role with a minimal session policy that restricts S3 access to {namespace}/*. Other services (DynamoDB, Scheduler, SecretsManager) use Resource: "*" in the session policy — the execution role's own policy provides the actual resource-level restrictions. This design keeps the session policy under the AWS 2048-byte packed limit (long policies with per-resource Conditions easily exceed this)
  • Session policy size limit: AWS STS AssumeRole session policies have a 2048-byte packed limit. If exceeded, AssumeRole fails with "Packed policy consumes N% of allotted space". The current policy is ~668 bytes (well under limit). Adding Condition blocks (e.g., dynamodb:LeadingKeys, s3:prefix) quickly blows past the limit — avoid them in session policies
  • Credential files: Scoped credentials written to /tmp/scoped-creds/ in credential_process format. OpenClaw uses AWS_CONFIG_FILE + AWS_SDK_LOAD_CONFIG=1 to pick them up
  • OpenClaw env isolation: OpenClaw spawned with explicit env that excludes AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, AWS_CONTAINER_CREDENTIALS_RELATIVE_URI, and AWS_CONTAINER_CREDENTIALS_FULL_URI
  • Credential refresh: 45-minute interval timer re-assumes the role and updates credential files (STS self-assume max duration is 1 hour)
  • Trust policy condition: STS self-assume trust requires sts:RoleSessionName matching "scoped-*" prefix, preventing unconditioned re-assumption
  • Zero-access fallback: If EXECUTION_ROLE_ARN is not set or STS fails, OpenClaw starts with zero AWS access (all credential env vars stripped). Tools will fail gracefully but no cross-user data access is possible
  • Proxy keeps full credentials: The proxy process is trusted code and retains full execution role credentials for Bedrock, Cognito, and S3 image access (with application-level namespace enforcement)

Workflow Conventions

Branch Awareness

Always confirm which git branch you are on BEFORE making any code changes or deployments. If the user specifies a branch, switch to it first and verify with git branch --show-current. Never assume the current branch is correct.

Deployment Target

Default deployment region is us-west-2. Hybrid deployment: CDK manages infrastructure (VPC, IAM, S3, Lambda, etc.), Starter Toolkit manages Runtime/Endpoint/ECR/Docker build. Use ./scripts/deploy.sh for the full 3-phase flow. After deploying, verify the old session/container is replaced — stale sessions can mask fixes.

Git Operations

Never push to any remote (GitHub, GitLab, or otherwise) without explicit user confirmation. Always ask before pushing.

Planning vs Implementation

When asked to create a plan, produce it concisely in ONE iteration. Do not endlessly revise or research unless asked. If the user says 'implement', move directly to code changes — do not re-plan. If a plan is approved, begin implementation immediately.

Adding a New Channel (Checklist)

To add a new messaging channel (e.g., WhatsApp, Discord, LINE), follow the Feishu implementation as a reference:

1. Secrets Manager (CDK: security_stack.py)

  • Add a new secret for the channel bot token/credentials
  • Export the secret name for cross-stack reference

2. Router Lambda (lambda/router/index.py)

  • Add credential fetching function (e.g., _get_feishu_credentials())
  • Add webhook validation function (e.g., validate_feishu_webhook())
  • Add message sending function (e.g., send_feishu_message())
  • Add progress notification function (e.g., _feishu_progress_notify())
  • Add main handler function (e.g., handle_feishu())
  • Wire into the Lambda handler: sync path (webhook validation + async dispatch) and async path (message processing)
  • Handle channel-specific features: event decryption (Feishu AES-256-CBC), signature verification, bot mention stripping (group chat), image download, etc.

3. API Gateway Route (CDK: router_stack.py)

  • Add POST /webhook/<channel> route
  • Pass the new secret name as Lambda environment variable

4. Cron Lambda (lambda/cron/index.py)

  • Add response delivery function for the new channel (e.g., send_feishu_message())

5. Setup Script (scripts/setup-<channel>.sh)

  • Interactive script: display webhook URL, prompt for credentials, store in Secrets Manager, add user to allowlist

6. Tests (lambda/router/test_<channel>.py)

  • Webhook validation, event parsing, message sending, edge cases

Key design decisions:

  • Webhook validation: Each channel has its own signature/token verification. Fail-closed (reject if validation fails)
  • Async dispatch: Return 200 immediately to the webhook, self-invoke Lambda asynchronously for processing (prevents webhook timeouts)
  • User ID format: <channel>:<platform_user_id> (e.g., feishu:ou_xxxx, telegram:123456)
  • Event encryption: Some platforms (Feishu) encrypt webhook events. Decrypt in the handler using platform-provided keys. Use system libcrypto (ctypes) for AES to avoid native dependency issues across Lambda architectures

Deployment Gotchas (Learned the Hard Way)

Starter Toolkit + CDK Hybrid

  • ECR repo naming: Starter Toolkit creates repos with bedrock-agentcore- prefix (e.g., bedrock-agentcore-openclaw_agent). CDK IAM policies must include this pattern — mismatch causes "initialization exceeded 120s" (misleading error, actually ECR permission denied)
  • update-agent-runtime does NOT replace running containers: Env var changes only apply to NEW sessions. Always agentcore stop-session after updating runtime env vars
  • Starter Toolkit --local-build skips CodeBuild: Useful for pre-pushed images. Default mode always triggers CodeBuild which rebuilds and overwrites the image tag
  • Starter Toolkit VPC subnet changes: "Immutable" via agentcore configure, but actually mutable via direct aws bedrock-agentcore-control update-agent-runtime API
  • CodeBuild Docker Hub rate limit: Dockerfile must use public.ecr.aws/docker/library/node:22-slim instead of Docker Hub

VPC + Bedrock

  • Cross-region inference profiles work through VPC endpoints: global.anthropic.claude-opus-4-6-v1 works fine through bedrock-runtime VPC endpoint (despite initial suspicion otherwise)
  • Security group egress: TCP 443 only is sufficient — DNS uses VPC internal resolver (not affected by SG)

Session Management

  • Session ID is deterministic: ses_{userId}_{hash} — same user always gets same session ID, so stop-session with the correct ID is essential after config changes
  • Cold start timing: VPC mode ~30-60s for ENI creation + image pull. First message triggers init (proxy + OpenClaw startup)

Git Worktree Guide

This project uses git worktrees for parallel branch development:

# Current worktrees
git worktree list

# The deploy branch is checked out at ~/g-repo/openclaw-deploy (worktree)
# The main repo is at ~/g-repo/sample-host-openclaw-on-amazon-bedrock-agentcore

# When done with the deploy branch, merge to main and clean up:
cd ~/g-repo/sample-host-openclaw-on-amazon-bedrock-agentcore
git checkout main
git merge deploy/starter-toolkit-hybrid
git worktree remove ~/g-repo/openclaw-deploy   # removes worktree directory
git branch -d deploy/starter-toolkit-hybrid     # delete branch if fully merged

# Or keep the worktree for continued work — no cleanup needed

Project Context

This is a Python/Node.js project (OpenClaw on AWS Bedrock AgentCore). Key components: Telegram bot, Slack Socket Mode, CDK infrastructure, Docker/ECR deployments, S3 workspace, per-user memory isolation. Subagents are OpenClaw-native running on the same AgentCore runtime — they are NOT separate Bedrock agents.