|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Repository Overview |
| 6 | + |
| 7 | +Configuration Anomaly Detection (CAD) is a Go-based system that reduces manual SRE effort by pre-investigating alerts, detecting cluster anomalies, and sending relevant communications to cluster owners. It integrates with PagerDuty webhooks and uses Tekton pipelines for automated remediation. |
| 8 | + |
| 9 | +## Development Commands |
| 10 | + |
| 11 | +### Building |
| 12 | +- `make build` - Build all subprojects (cadctl and interceptor) |
| 13 | +- `make build-cadctl` - Build only the cadctl binary to `./bin/cadctl` |
| 14 | +- `make build-interceptor` - Build only the interceptor binary to `./bin/interceptor` |
| 15 | + |
| 16 | +### Testing |
| 17 | +- `make test` - Run all tests for both cadctl and interceptor |
| 18 | +- `make test-cadctl` - Run unit tests for cadctl and pkg modules |
| 19 | +- `make test-interceptor` - Run unit tests for interceptor |
| 20 | +- `make test-interceptor-e2e` - Run e2e tests for interceptor |
| 21 | + |
| 22 | +### Linting |
| 23 | +- `make lint` - Lint all subprojects |
| 24 | +- `make lint-cadctl` - Lint cadctl using golangci-lint |
| 25 | +- `make lint-interceptor` - Lint interceptor using golangci-lint |
| 26 | + |
| 27 | +### Code Generation |
| 28 | +- `make generate-cadctl` - Generate mocks for cadctl using mockgen |
| 29 | + |
| 30 | +### Local Testing |
| 31 | +For testing against clusters: |
| 32 | +1. **Create a test cluster** - Manual tests requiring cluster ID need an actual cluster to be created first |
| 33 | +2. `./test/generate_incident.sh <alertname> <clusterid>` - Create test incident payload with the cluster ID |
| 34 | +3. `source test/set_stage_env.sh` - Export required environment variables from vault |
| 35 | +4. `./bin/cadctl investigate --payload-path payload` - Run investigation |
| 36 | + |
| 37 | +**Note**: Tests that require a cluster ID (like manual tests using shell scripts) need you to create a cluster first and provide its ID. Only then can you trigger the PagerDuty alert for that cluster to have local CAD run an investigation on it. |
| 38 | + |
| 39 | +## Architecture |
| 40 | + |
| 41 | +### Core Components |
| 42 | + |
| 43 | +**cadctl** - CLI tool implementing alert investigations and remediations |
| 44 | +- Entry point: `cadctl/main.go` |
| 45 | +- Commands: `cadctl/cmd/` |
| 46 | +- Investigations registry: `pkg/investigations/registry.go` |
| 47 | + |
| 48 | +**interceptor** - Tekton interceptor for webhook filtering |
| 49 | +- Entry point: `interceptor/main.go` |
| 50 | +- Filters PagerDuty webhooks and validates signatures |
| 51 | +- Determines if alerts have implemented handlers |
| 52 | + |
| 53 | +**investigations** - Modular alert investigation implementations |
| 54 | +- Location: `pkg/investigations/` |
| 55 | +- Each investigation implements the `Investigation` interface |
| 56 | +- Investigations include: chgm, ccam, apierrorbudgetburn, etc. |
| 57 | + |
| 58 | +### Investigation Framework |
| 59 | + |
| 60 | +Investigations follow a consistent pattern: |
| 61 | +- Implement `Investigation` interface from `pkg/investigations/investigation/investigation.go` |
| 62 | +- Include `metadata.yaml` for RBAC permissions |
| 63 | +- Testing directory with manual test procedures |
| 64 | +- Auto-registered in `pkg/investigations/registry.go` |
| 65 | + |
| 66 | +### Integrations |
| 67 | + |
| 68 | +Pre-initialized clients available in investigation resources: |
| 69 | +- **AWS** (`pkg/aws`) - Instance info, CloudTrail events |
| 70 | +- **OCM** (`pkg/ocm`) - Cluster info, service logs, limited support reasons |
| 71 | +- **PagerDuty** (`pkg/pagerduty`) - Alert info, incident management, notes |
| 72 | +- **K8s** (`pkg/k8s`) - Kubernetes API client |
| 73 | +- **osd-network-verifier** (`pkg/networkverifier`) - Network verification |
| 74 | + |
| 75 | +### Workflow |
| 76 | + |
| 77 | +1. PagerDuty webhook → Tekton EventListener |
| 78 | +2. Interceptor validates and filters webhooks |
| 79 | +3. If handler exists → PipelineRun starts |
| 80 | +4. Pipeline executes `cadctl investigate` |
| 81 | +5. Investigation runs and posts results to PagerDuty |
| 82 | + |
| 83 | +## Adding New Investigations |
| 84 | + |
| 85 | +1. `make bootstrap-investigation` - Generates boilerplate code and directory structure |
| 86 | +2. Implement investigation logic in generated files |
| 87 | +3. Add test objects/scripts to `testing/` directory |
| 88 | +4. Update investigation-specific README with testing procedures |
| 89 | +5. Follow progressive deployment: Informing Stage (read-only) → Actioning Stage (read/write) |
| 90 | + |
| 91 | +## Required Environment Variables |
| 92 | + |
| 93 | +For local development (available via `source test/set_stage_env.sh`): |
| 94 | +- `CAD_OCM_CLIENT_ID`, `CAD_OCM_CLIENT_SECRET`, `CAD_OCM_URL` - OCM client configuration |
| 95 | +- `CAD_PD_EMAIL`, `CAD_PD_PW`, `CAD_PD_TOKEN`, `CAD_PD_USERNAME` - PagerDuty authentication |
| 96 | +- `CAD_SILENT_POLICY` - PagerDuty silent policy |
| 97 | +- `PD_SIGNATURE` - PagerDuty webhook signature validation |
| 98 | +- `BACKPLANE_URL`, `BACKPLANE_INITIAL_ARN` - Backplane access |
| 99 | +- `CAD_PROMETHEUS_PUSHGATEWAY` - Metrics endpoint |
| 100 | + |
| 101 | +Optional: |
| 102 | +- `BACKPLANE_PROXY` - Required for local development |
| 103 | +- `CAD_EXPERIMENTAL_ENABLED=true` - Enable experimental investigations |
| 104 | +- `LOG_LEVEL` - Logging level (default: info) |
0 commit comments