A sandbox that injects failures and automatically remediates via policies and controllers. The lab showcases policy-as-code, reliability automation, and hands-on systems engineering.
- A chaos experiment is started from the UI or the API.
- The backend records the experiment in DynamoDB and emits an EventBridge event.
- The policy engine records cluster events, decides a remediation, and emits a remediation trigger.
- The remediation executor simulates the action and marks the remediation + experiment as resolved.
- The React dashboard polls experiments, remediations, and cluster events to render the timeline.
Cluster events can also be pushed into the system via POST /cluster/events, which forwards to EventBridge and is captured by the same policy flow.
- Frontend: React control panel for experiments, policy status, remediation logs
- Alerts: Flutter app for mobile push-style notifications
- Backend API: Node.js Lambda handlers behind API Gateway
- Policy engine: Python Lambda that decides remediations
- Kubernetes: EKS cluster + Gatekeeper/Kyverno + Chaos Mesh samples
- Observability: CloudWatch logs, EventBridge event routing
- Storage: DynamoDB (experiments, remediations, cluster events) + S3 configs
infra/AWS CDK app (EKS, API Gateway, Lambda, DynamoDB, EventBridge, S3)backend/Node Lambda handlers for the REST APIpolicy/Python remediation enginefrontend/React control panelalerts/Flutter alerts UIk8s/Chaos Mesh and policy samplesdata-models/canonical data models
POST /experimentsstart chaos testGET /experimentslist experimentsPOST /experiments/{id}/stopstop experimentGET /remediationsremediation historyGET /cluster/eventsrecent cluster eventsPOST /cluster/eventsingest cluster events into EventBridge
- AWS account with permissions to deploy CDK resources
- Node.js 18 and npm
- AWS CLI + CDK v2 (
npm install -g aws-cdk) - kubectl 1.28+ for cluster access
Build the frontend so the CDK stack can upload frontend/dist to the static site bucket.
cd frontend
npm install
VITE_API_BASE_URL=https://your-api.execute-api.region.amazonaws.com/prod npm run build
cd infra
npm install
npx cdk bootstrap
npm run deploy
The stack outputs the API URL and the static site URL. After deploy, check infra/cdk-outputs.json or the CDK outputs in your terminal for:
ApiUrlFrontendUrlClusterName
If this is your first deploy, you can deploy once to get the API URL, then rebuild the frontend with VITE_API_BASE_URL and deploy again to update the static site.
The workflow at .github/workflows/deploy.yml deploys the CDK stack and uploads the built frontend to S3.
Configure these GitHub secrets:
AWS_ROLE_ARN: IAM role for GitHub OIDC to assumeVITE_API_BASE_URL: API Gateway base URL used at build time
Push to main or trigger the workflow manually to deploy. If you create a brand new API URL, update VITE_API_BASE_URL and rerun the workflow to redeploy the frontend with the new endpoint. Update AWS_REGION in the workflow if you deploy outside us-east-1.
Install your policy engines and Chaos Mesh, then apply the sample manifests in k8s/ to generate real cluster events.
aws eks update-kubeconfig --name <cluster-name> --region <region>
kubectl apply -f k8s/chaos
kubectl apply -f k8s/policies
See k8s/README.md for the specific Chaos Mesh and policy samples.
Start a CPU spike experiment:
curl -X POST "$API_URL/experiments" \
-H "Content-Type: application/json" \
-d '{"type":"cpu_spike","target":{"namespace":"default","deployment":"demo-app"}}'
Send a cluster event to EventBridge:
curl -X POST "$API_URL/cluster/events" \
-H "Content-Type: application/json" \
-d '{"eventType":"pod_deleted","source":"k8s","payload":{"pod":"demo-app-123"}}'
Fetch the remediation timeline:
curl "$API_URL/remediations"
Frontend:
cd frontend
npm install
VITE_API_BASE_URL=https://your-api.execute-api.region.amazonaws.com/prod npm run dev
Infrastructure:
cd infra
npm install
npm run deploy
Backend handlers can be built locally from backend/ with npm run build. See the package READMEs for deeper instructions.