Skip to content

noahspahn/Self-Healing-Kubernetes-Lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Self-Healing Kubernetes Lab

A sandbox that injects failures and automatically remediates via policies and controllers. The lab showcases policy-as-code, reliability automation, and hands-on systems engineering.

How it works

  1. A chaos experiment is started from the UI or the API.
  2. The backend records the experiment in DynamoDB and emits an EventBridge event.
  3. The policy engine records cluster events, decides a remediation, and emits a remediation trigger.
  4. The remediation executor simulates the action and marks the remediation + experiment as resolved.
  5. The React dashboard polls experiments, remediations, and cluster events to render the timeline.

Cluster events can also be pushed into the system via POST /cluster/events, which forwards to EventBridge and is captured by the same policy flow.

Architecture

  • Frontend: React control panel for experiments, policy status, remediation logs
  • Alerts: Flutter app for mobile push-style notifications
  • Backend API: Node.js Lambda handlers behind API Gateway
  • Policy engine: Python Lambda that decides remediations
  • Kubernetes: EKS cluster + Gatekeeper/Kyverno + Chaos Mesh samples
  • Observability: CloudWatch logs, EventBridge event routing
  • Storage: DynamoDB (experiments, remediations, cluster events) + S3 configs

Repo layout

  • infra/ AWS CDK app (EKS, API Gateway, Lambda, DynamoDB, EventBridge, S3)
  • backend/ Node Lambda handlers for the REST API
  • policy/ Python remediation engine
  • frontend/ React control panel
  • alerts/ Flutter alerts UI
  • k8s/ Chaos Mesh and policy samples
  • data-models/ canonical data models

API endpoints

  • POST /experiments start chaos test
  • GET /experiments list experiments
  • POST /experiments/{id}/stop stop experiment
  • GET /remediations remediation history
  • GET /cluster/events recent cluster events
  • POST /cluster/events ingest cluster events into EventBridge

Prerequisites

  • AWS account with permissions to deploy CDK resources
  • Node.js 18 and npm
  • AWS CLI + CDK v2 (npm install -g aws-cdk)
  • kubectl 1.28+ for cluster access

Deploy with CDK

Build the frontend so the CDK stack can upload frontend/dist to the static site bucket.

cd frontend
npm install
VITE_API_BASE_URL=https://your-api.execute-api.region.amazonaws.com/prod npm run build
cd infra
npm install
npx cdk bootstrap
npm run deploy

The stack outputs the API URL and the static site URL. After deploy, check infra/cdk-outputs.json or the CDK outputs in your terminal for:

  • ApiUrl
  • FrontendUrl
  • ClusterName

If this is your first deploy, you can deploy once to get the API URL, then rebuild the frontend with VITE_API_BASE_URL and deploy again to update the static site.

GitHub Actions deployment

The workflow at .github/workflows/deploy.yml deploys the CDK stack and uploads the built frontend to S3.

Configure these GitHub secrets:

  • AWS_ROLE_ARN: IAM role for GitHub OIDC to assume
  • VITE_API_BASE_URL: API Gateway base URL used at build time

Push to main or trigger the workflow manually to deploy. If you create a brand new API URL, update VITE_API_BASE_URL and rerun the workflow to redeploy the frontend with the new endpoint. Update AWS_REGION in the workflow if you deploy outside us-east-1.

Kubernetes setup

Install your policy engines and Chaos Mesh, then apply the sample manifests in k8s/ to generate real cluster events.

aws eks update-kubeconfig --name <cluster-name> --region <region>

kubectl apply -f k8s/chaos
kubectl apply -f k8s/policies

See k8s/README.md for the specific Chaos Mesh and policy samples.

Using the API

Start a CPU spike experiment:

curl -X POST "$API_URL/experiments" \
  -H "Content-Type: application/json" \
  -d '{"type":"cpu_spike","target":{"namespace":"default","deployment":"demo-app"}}'

Send a cluster event to EventBridge:

curl -X POST "$API_URL/cluster/events" \
  -H "Content-Type: application/json" \
  -d '{"eventType":"pod_deleted","source":"k8s","payload":{"pod":"demo-app-123"}}'

Fetch the remediation timeline:

curl "$API_URL/remediations"

Local development

Frontend:

cd frontend
npm install
VITE_API_BASE_URL=https://your-api.execute-api.region.amazonaws.com/prod npm run dev

Infrastructure:

cd infra
npm install
npm run deploy

Backend handlers can be built locally from backend/ with npm run build. See the package READMEs for deeper instructions.

About

A sandbox that injects failures and automatically remediates via policies and controllers.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors