Skip to content

Jz/fresh account deployment fix#53

Open
jeffzeng-aws wants to merge 43 commits intoanthropics:mainfrom
KBB99:jz/fresh-account-deployment-fix
Open

Jz/fresh account deployment fix#53
jeffzeng-aws wants to merge 43 commits intoanthropics:mainfrom
KBB99:jz/fresh-account-deployment-fix

Conversation

@jeffzeng-aws
Copy link

No description provided.

zealoushacker and others added 30 commits January 30, 2026 07:33
Add infrastructure-as-code support so the agent can define serverless
backends (Lambda + API Gateway + DynamoDB) via CDK, with CI/CD handling
deployment and atomic SSM-based state signaling for async handoff.

- Install CDK CLI, AWS CLI v2, esbuild in Dockerfile
- Add CDK/AWS security validators (block deploy, allow synth + read-only)
- New deploy-infrastructure.yml workflow with atomic SSM JSON state
- Update deploy-preview.yml with workflow_call trigger + VITE_API_URL
- Add GitHubInfraDeployRole with scoped allowlist + explicit denies
- Add AgentCore read-only infra verification permissions
- Update prompts and BUILD_PLAN for serverless full-stack architecture
- Add infrastructure-aware prompt construction in agent harness
- Add API client utility and React Query to frontend scaffold

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update all region references from us-west-2 to us-east-1
- Configure Bedrock model access (CLAUDE_CODE_USE_BEDROCK=1)
- Set AgentCore runtime ID (claude_code_reinvent-1eBYMO7kHw)
- Make CDK stack resilient to missing AgentCore role (conditional)
- Import existing VPC via context to avoid VPC limit
- Add account ID suffix to S3 bucket names for uniqueness
- Remove backup plan (use EFS automatic backups instead)
- Update GitHub repo to KBB99 fork

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The agent structures monorepo projects with frontend/ workspace,
so the Vite build output lands at generated-app/frontend/dist/
instead of generated-app/dist/. Also fix vite config and
BrowserRouter patching to check both locations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
workflow_dispatch triggers run from main which doesn't have
the generated app. Checkout agent-runtime branch explicitly
when not triggered by a push event.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add `make reset` target to wipe all agent state (branch, issues, SSM,
  S3, CloudFront) for clean restarts
- Restructure README.md with Quick Start, Creating a New Project,
  Resetting the Agent, and Configuration sections
- Create CLAUDE.md with architecture overview, PROJECT_NAME flow,
  CDK context variables, and common issues
- Update DEFAULT_MODEL to us.anthropic.claude-opus-4-6-v1 in Makefile
  and bedrock_entrypoint.py
- Add AgentCoreXRayPolicy to CDK stack granting xray:PutTraceSegments
  and xray:PutTelemetryRecords to fix OTEL trace export 403 errors

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hemas

The Canopy agent was gravitating toward frontend scaffolding and never
building backend/infra. Root cause: the BUILD_PLAN had no phasing and
the API spec was just a route list with no request/response schemas.

Changes:
- Replace <api_specification> with <api_contract> containing full Zod
  schemas for every entity, inferred types, and a typed endpoint map
- Add shared/ workspace package as the single source of truth for types
- Enforce phased execution: shared contract → infra + backend → frontend
- Remove all Dexie/IndexedDB fallback references (replaced by API-first)
- Update monorepo structure, implementation order, and critical paths
- Add "Phased Execution" section to system_prompt.txt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The critical step between CDK deploy and agent launch — rebuilding
and pushing the Docker image to ECR — was undocumented. Added a
"Deploying Changes" section with the full deployment sequence and a
table showing which changes require an image rebuild vs CDK deploy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When agent-runtime branch doesn't exist, the agent now clones from
BASE_BRANCH (env var, default: main) instead of hardcoded main. This
allows testing prompt/code changes on feature branches without merging.

Usage: make launch BASE_BRANCH=kb/improved-harness

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The SDK env dict only had CLAUDE_CODE_USE_BEDROCK and AWS_REGION,
stripping IAM credentials needed for Bedrock auth. This caused
"Invalid API key" errors on every call. Now forwards static creds
and container credential endpoint env vars.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add permission_mode="bypassPermissions" to ClaudeAgentOptions for
  non-interactive container operation. Without this, the CLI defaults
  to prompting for interactive permission on each tool use, which fails
  in headless environments and may cause the "Invalid API key" errors
  we've been seeing.
- Remove redundant AWS credential forwarding from env dict. The SDK
  merges env with os.environ ({**os.environ, **env}), so the subprocess
  inherits all parent env vars automatically. Only Bedrock-specific
  overrides are needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause of "Invalid API key" errors: The Docker container runs as
root, and Claude CLI refuses --dangerously-skip-permissions (used by
permission_mode="bypassPermissions") when running as root/sudo.

Fixes:
- Add non-root 'agent' user to Dockerfile and switch to it
- Add permission_mode="bypassPermissions" to ClaudeAgentOptions for
  non-interactive autonomous operation
- Fix AWS CLI install URL to auto-detect ARM64 vs x86_64 architecture
- Update CLAUDE.md to document --platform linux/arm64 for docker build

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The handler's async generator monitoring loop may not execute if the
AgentCore framework stops consuming the generator after the initial
streaming response. This left the post-commit hook as the only push
mechanism, which can silently fail.

Added a polling push loop directly in run_agent_background() that
runs every PUSH_INTERVAL_SECONDS (default 300s) in the same thread
as the agent subprocess. This ensures commits get pushed regardless
of whether the handler's generator is consumed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause of all agent failures (issues #5-#13): the update-runtime-env
Makefile target was missing CLAUDE_CODE_USE_BEDROCK and AWS_REGION from
the environment-variables JSON. Every time we ran `make update-runtime-env`,
those critical vars were wiped, causing Claude CLI to try the Anthropic API
(not Bedrock) and fail silently.

Also adds Python logging to bedrock_entrypoint.py so agent subprocess output
is captured by OTEL auto-instrumentation and visible in CloudWatch. Previously
all print() output was invisible.

Changes:
- Makefile: Add CLAUDE_CODE_USE_BEDROCK=1 and AWS_REGION to update-runtime-env
- bedrock_entrypoint.py: Replace print() with logging.getLogger() in critical
  paths, pipe subprocess stdout through logger for OTEL visibility

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds sections covering: what's working (full pipeline), what's not
(agent ignores phased execution), key config values, step-by-step
quick start for running a new test, monitoring commands, and pitfalls
discovered during debugging (env var wipe, print vs logging, ARM64,
non-root requirement, commit timing).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ld order

The agent was skipping shared/ and backend/ phases because the system prompt
incentivized UI quality and 200+ tests upfront, causing it to build a
frontend-only app. This replaces grading language with phase-gate evaluation,
reduces test count from 200 to ~50 weighted toward backend/infra, adds the
Phased Execution section to the canopy prompt, and simplifies test verification.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…100% frontend test bias

The agent wrote 220 tests that were ALL frontend UI tests because the
security hook required a Playwright screenshot to mark any test as passing.
With no equivalent verification path for backend tests, the agent rationally
skipped non-frontend tests entirely.

Changes:
- Add backend-verify.cjs scaffold script for shared/infra/backend test verification
- Add alternative -result.txt verification path in security.py (alongside screenshot path)
- Allowlist AWS data-plane commands (DynamoDB scan/query, CloudWatch logs, Lambda invoke)
- Add IAM policy for canopy-* scoped DynamoDB, CloudWatch Logs, and Lambda access
- Add Backend Test Verification section and Resource Naming Convention to prompts
- Update initial/continuation messages with backend-verify.cjs guidance and VITE_API_URL wiring
- Block Write tool from forging -result.txt files (must use backend-verify.cjs)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The CDK deploy was attaching IAM policies (Secrets Manager, SSM,
CloudWatch, etc.) to AmazonBedrockAgentCoreSDKRuntime instead of
claude-code-agentcore-role, which is the role the container actually
assumes. This caused the container to silently crash on startup when
it couldn't access Secrets Manager for the GitHub token.

- Add AGENTCORE_ROLE_NAME and VPC_ID as Makefile variables with defaults
- Wire them into deploy-infra target so make deploy-infra just works
- Document the pitfall in CLAUDE.md lessons learned

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The workflow failed with "Cannot find asset at generated-app/backend/dist"
because CDK references backend Lambda code as a bundled asset but the
workflow never built the shared or backend packages before running tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The agent was writing CDK infrastructure and backend handlers simultaneously,
then never checking if CI/CD deployed the stack. The frontend was never wired
to the real API. Now:

- Phase 2 split into 2a (CDK + stubs, commit) and 2b (wait for deploy, then
  implement real handlers)
- Agent must poll SSM deploy-state and wait for "succeeded" before writing
  full backend handlers or frontend API calls
- VITE_API_URL wiring is part of Phase 3
- CLAUDE.md updated with current state (issue #22 success, remaining issues)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…r polished frontend

Update CLAUDE.md to reflect prompt changes from 3744400 and note next step
is a live agent test. Update grading in both system prompts to explicitly
call out that localStorage fallback = incomplete and deployed API = highest
score, reinforcing the CDK-first deployment flow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…builds

deploy-infrastructure: Replace hardcoded npm ci with lockfile-aware fallback
(npm ci if lock exists, npm install otherwise) for infrastructure, shared,
and backend install steps. The agent doesn't generate lockfiles.

deploy-preview: Add shared package build step before frontend build (frontend
imports from @canopy/shared). Add fallback to build frontend/ directly when
root-level npm run build fails (Vite can't resolve index.html via workspace
delegation).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ructure

The "Signal deploy failed" handler runs on failure() but previously had no
AWS credentials when build/test steps failed (credentials were configured
after synth). Now credentials are configured right after Node setup so the
failure handler can always write status to SSM.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove separate shared/backend build steps from deploy-infrastructure
workflow. CDK's NodejsFunction bundles with esbuild at synth time,
resolving @canopy/shared imports automatically.

Update prompts to explicitly require NodejsFunction (not lambda.Function
with Code.fromAsset) and ensure shared/package.json has a build script.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Vite resolves workspace package imports directly via esbuild. No need
to pre-build shared/ before the frontend build.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Early agent commits (shared/, infrastructure/) trigger deploy-preview
but have no buildable frontend. Instead of failing, skip the deploy
and let later commits with frontend code trigger a successful build.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion bundling

The agent's Lambda handlers imported uuid which caused esbuild bundling
failures in CI/CD (Docker fallback can't resolve node_modules). Prompts
now instruct the agent to use Node.js built-in crypto.randomUUID() and
configure NodejsFunction to only externalize @aws-sdk/*.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The agent polls SSM for deploy status but previously only saw
"status":"failed" with no details. Now the deploy-infrastructure
workflow captures build/test/synth/deploy output and includes the
last 40 lines in the SSM error field on failure. Prompts updated
to tell the agent to read the error field and fix the issue.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ts-node has path resolution issues on GitHub Actions runners where npx
picks up a cached version instead of the project-local one, causing
"Cannot find module ./canopy.ts" errors during cdk deploy. tsx is a
drop-in replacement that handles this reliably.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
KBB99 and others added 13 commits February 20, 2026 16:48
When CDK deploy fails, the outputs file isn't created, causing a jq error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ting projects

Enhancement mode (triggered when agent-runtime already has generated-app/)
crashed with FileNotFoundError on system_prompt.txt because:
1. claude_code.py only ran setup_session_prompts() for fresh builds,
   so generated-app/prompts/ was never created in enhancement mode
2. bedrock_entrypoint.py didn't pass --project in enhancement mode,
   so prompts_dir resolved to the wrong directory

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The deploy-preview workflow was building the frontend with an empty
VITE_API_URL because the "Resolve API URL" step ran before AWS
credentials were configured. The raw IAM user creds lack
cloudformation:DescribeStacks, so the call silently failed and Vite
inlined shouldUseApi() as `return false`, causing the entire app to
fall back to localStorage.

- Move "Configure AWS credentials" before "Resolve API URL" and "Build"
- Add cloudformation:DescribeStacks to the preview deploy IAM role

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add read-only commands so the agent can diagnose deployment issues
instead of spinning in an SSM poll loop when deploys stall:
- cloudformation describe-stack-events
- cloudformation describe-stack-resources
- ssm get-parameters-by-path
- lambda get-function-url-config

Also update CLAUDE.md with issue #26-#28 postmortems and lessons learned.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `| tee` pipes in install, test, synth, and deploy steps were
swallowing non-zero exit codes — tee always exits 0, so cdk deploy
failures didn't fail the step, which meant `failure()` was false and
"Signal deploy failed" never fired. SSM stayed stuck at "deploying"
and the agent polled forever.

Adding `set -o pipefail` ensures pipeline exit codes propagate correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The agent generates CDK code from scratch each run, so construct IDs
were inconsistent (e.g. CanopyMainTable vs CanopyTable). CloudFormation
sees different logical IDs as new resources and fails with "already
exists" when the physical name matches an existing resource.

Pin the exact construct IDs and route structure to match the deployed
stack so incremental CDK updates work across runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The CF stack persisted across resets because make reset only deleted
the agent-runtime branch, not the deployed infrastructure. The next
agent run would generate fresh CDK code with different construct IDs,
causing "already exists" errors on deploy.

Now runs cdk destroy first (with cloudformation delete-stack fallback),
so each fresh run starts with a clean slate.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Not needed — make reset now destroys the CF stack, so each fresh run
starts clean and the agent can use whatever construct IDs it wants.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a user starts a Claude Code session without a specific task, CLAUDE.md
now instructs Claude to run an interactive onboarding flow: check prerequisites,
then either deploy Canopy or walk through creating a new project with a
BUILD_PLAN.md generated from a stack-agnostic template.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update runtime ID, AWS account ID, VPC, GitHub repo, and BASE_BRANCH
to point to the local deployment in account 405645222728.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove all hardcoded AWS account IDs, runtime IDs, VPC IDs, and repo
names so any deployer can fork and run this from a fresh AWS account.

Key changes:
- Makefile: clear account-specific defaults, add -include Makefile.local,
  add APP_STACK_NAME, dynamic AWS_ACCOUNT_ID/AGENT_RUNTIME_ARN, add
  create-runtime target (replaces broken agentcore launch CLI), replace
  canopy-app-stack with $(APP_STACK_NAME) in reset target
- Makefile.local.example: new template for per-deployer config
- claude-code-stack.ts: add AgentCoreECRPolicy, parameterize canopy-*
  resource ARNs via appName context, widen Bedrock policy to * regions
- bedrock_entrypoint.py: replace 5 hardcoded runtime ARN fallbacks with
  _resolve_agent_runtime_arn() helper (derives from AGENT_RUNTIME_ID + STS)
- agent-builder.yml: AWS_REGION/AGENTCORE_AGENT_ID from vars, derive
  account ID via STS at invoke time
- stop-agent-on-close.yml: same vars pattern, STS account lookup in Python
- deploy-preview.yml: AWS_REGION from vars, APP_CDK_STACK_NAME for stack
- deploy-infrastructure.yml: same
- ONBOARDING.md: replace runtime ID, repo, VPC, ECR, branch with placeholders
- README.md: fix deployment steps, add missing secrets/vars, fix defaults

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants