An autonomous AI agent for Kubernetes debugging and remediation, powered by Google's Agent Development Kit (ADK) and Gemini AI.
The Kubernetes Agent is an intelligent system that:
- Debugs Kubernetes pods automatically
- Analyzes logs, events, and metrics
- Identifies root causes of issues
- Creates GitHub pull requests with fixes
- Integrates with Argo Rollouts for canary analysis
- Pod Debugging: Analyze pod status, conditions, and container states
- Events: Retrieve and correlate cluster events
- Logs: Fetch and analyze container logs (including previous crashes)
- Metrics: Check resource usage and limits
- Resources: Inspect related deployments, services, and configmaps
- Git Operations: Clone, branch, commit, push (using JGit library)
- GitHub PRs: Automatically create pull requests with:
- Root cause analysis
- Code fixes
- Testing recommendations
- Links to Kubernetes resources
- REST API: Expose analysis capabilities via HTTP
- Integration: Works with
rollouts-plugin-metric-aifor canary analysis
Argo Rollouts Analysis
↓
rollouts-plugin-metric-ai
↓ (A2A HTTP)
Kubernetes Agent (ADK)
├── K8s Tools (Fabric8 client)
├── Git Operations (JGit)
├── GitHub PR (GitHub API)
└── AI Analysis (Gemini)
- Java 17+
- Maven 3.8+
- Kubernetes cluster
- Google API Key (Gemini)
- GitHub Personal Access Token
cd kubernetes-agent
mvn clean packageexport GOOGLE_API_KEY="your-google-api-key"
export GITHUB_TOKEN="your-github-token"java -jar target/kubernetes-agent-1.0.0.jar consolejava -jar target/kubernetes-agent-1.0.0.jar
# Server starts on port 8080
# Health check: http://localhost:8080/a2a/healthdocker build -t csanchez/kubernetes-agent:latest .
docker push csanchez/kubernetes-agent:latest# Copy template
cp deployment/secret.yaml.template deployment/secret.yaml
# Edit secret.yaml and add your keys
# Then apply:
kubectl apply -f deployment/secret.yaml# Update image in deployment/deployment.yaml if needed
kubectl apply -k deployment/# Check pods
kubectl get pods -n argo-rollouts | grep kubernetes-agent
# Check logs
kubectl logs -f deployment/kubernetes-agent -n argo-rollouts
# Test health endpoint
kubectl port-forward -n argo-rollouts svc/kubernetes-agent 8080:8080
curl http://localhost:8080/a2a/healthThe test-agent.sh script supports both Kubernetes and local modes:
# Test agent running in Kubernetes (default)
./test-agent.sh k8s
# Test agent running locally on localhost:8080
./test-agent.sh local
# Use custom local URL
LOCAL_URL=http://localhost:9090 ./test-agent.sh local
# Use custom Kubernetes context
CONTEXT=my-k8s-context ./test-agent.sh k8sThe test script will:
- ✅ Check health endpoint
- ✅ Send a sample analysis request
- ✅ Verify no errors in logs (K8s mode only)
$ java -jar kubernetes-agent.jar console
You > Debug pod my-app-canary in namespace production
Agent > Analyzing pod my-app-canary in namespace production...
[Agent gathers debug info, logs, events...]
Root Cause: Container crashloop due to OOMKilled - memory limit too low
Recommendation:
1. Increase memory limit from 256Mi to 512Mi
2. Add resource requests to prevent overcommitment
3. Review memory usage patterns in logsThe agent exposes a REST API for other systems to use:
Endpoint: POST /a2a/analyze
Request:
{
"userId": "argo-rollouts",
"prompt": "Analyze canary deployment issue. Namespace: rollouts-test-system, Pod: canary-demo-xyz",
"context": {
"namespace": "rollouts-test-system",
"podName": "canary-demo-xyz",
"stableLogs": "...",
"canaryLogs": "..."
}
}Response:
{
"analysis": "Detailed analysis text...",
"rootCause": "Identified root cause",
"remediation": "Suggested fixes",
"prLink": "https://github.com/owner/repo/pull/123",
"promote": false,
"confidence": 85
}apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: canary-analysis-with-agent
spec:
metrics:
- name: ai-analysis
provider:
plugin:
ai-metric:
# Use agent mode
analysisMode: agent
namespace: "{{args.namespace}}"
podName: "{{args.canary-pod}}"
# Fallback to default mode
stablePodLabel: app=rollouts-demo,revision=stable
canaryPodLabel: app=rollouts-demo,revision=canary
model: gemini-3-flash-preview- Check if agent is healthy
- Send analysis request with logs
- Receive intelligent analysis
- Get PR link if fix was created
- Decide to promote or abort canary
| Variable | Required | Description |
|---|---|---|
GOOGLE_API_KEY |
Conditional | Google Gemini API key (required when using Gemini models) |
GITHUB_TOKEN |
Yes | GitHub personal access token |
GIT_USERNAME |
No | Git commit username (default: "kubernetes-agent") |
GIT_EMAIL |
No | Git commit email (default: "agent@example.com") |
GEMINI_MODEL |
Yes | Model to use (e.g., "gemini-3-flash-preview" or "gemma-3-1b-it") |
VLLM_API_BASE |
Conditional | vLLM server base URL (required when using gemma-* models) |
VLLM_API_KEY |
No | vLLM API key (default: "not-needed") |
K8S_AGENT_URL |
No | Agent URL for plugin (default: http://kubernetes-agent.argo-rollouts.svc.cluster.local:8080) |
The agent supports two modes:
env:
- name: GOOGLE_API_KEY
value: "your-gemini-api-key"
- name: GEMINI_MODEL
value: "gemini-3-flash-preview"env:
- name: GEMINI_MODEL
value: "gemma-3-1b-it" # or gemma-2-9b-it; any model name starting with "gemma-"
- name: VLLM_API_BASE
value: "http://gemma-1b-server.gemma-system.svc.cluster.local:8000" # or gemma-9b-server
- name: VLLM_API_KEY
value: "not-needed" # OptionalNote: When
GEMINI_MODELstarts withgemma-, the agent automatically uses the vLLM endpoint specified inVLLM_API_BASE. Available deployments:gemma-1b-server(Gemma 3 1B),gemma-9b-server(Gemma 2 9B). See deployment/gemma/README.md.
Recommended settings for production:
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"# Check logs
kubectl logs deployment/kubernetes-agent -n argo-rollouts
# Common issues:
# 1. Missing API keys - check secrets
# 2. Invalid service account - check RBAC
# 3. Out of memory - increase limits# Test endpoint directly
kubectl port-forward -n argo-rollouts svc/kubernetes-agent 8080:8080
curl http://localhost:8080/a2a/health
# Should return:
# {"status":"healthy","agent":"KubernetesAgent","version":"1.0.0"}# Check GitHub token permissions:
# - repo (full control)
# - workflow (if modifying GitHub Actions)
# Check logs for git errors:
kubectl logs deployment/kubernetes-agent -n argo-rollouts | grep -i "git\|github"- RBAC: Agent only has read access to K8s resources (no write)
- Secrets: Store API keys in Kubernetes secrets
- Network: Use NetworkPolicies to restrict egress
- Git: Use fine-grained personal access tokens
- Review: Always review PRs before merging
kubernetes-agent/
├── src/main/java/com/google/adk/samples/agents/k8sagent/
│ ├── KubernetesAgent.java # Main agent
│ ├── tools/ # K8s debugging tools
│ ├── remediation/ # Git and GitHub operations
│ └── a2a/ # A2A REST controllers
├── deployment/ # Kubernetes manifests
├── pom.xml # Maven config
└── Dockerfile # Container image
mvn testdocker buildx build --platform linux/amd64,linux/arm64 \
-t csanchez/kubernetes-agent:latest \
--push .- Multi-cluster support
- Historical analysis (learn from past incidents)
- Cost optimization recommendations
- Security vulnerability detection
- Self-healing capabilities
- Slack/PagerDuty notifications
- Advanced code analysis before fixes
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
For issues or questions:
- GitHub Issues: https://github.com/carlossg/kubernetes-agent/issues
- Documentation: See the
docs/directory and inline README files
Additional development documentation is available in the docs/development/ directory:
- Maven plugin integration and logging
- Debug mode configuration
- Testing strategies
- Rate limiting and fork modes
The agent supports analyzing canary deployments with multiple LLMs running in parallel, using confidence-weighted voting for the final decision.
Enable multi-model analysis:
- name: ENABLE_MULTI_MODEL
value: "true"
- name: MODELS_TO_USE
value: "gemini-2.5-flash,gemma-3-1b-it,gemma-2-9b-it" # Comma-separated (optional; only one vLLM model at a time when using VLLM_API_BASE)
- name: VOTING_STRATEGY
value: "weighted" # Confidence-weighted voting- Parallel Execution: Each model analyzes independently and simultaneously
- Confidence-Weighted Voting:
promote_score = Σ(confidence/100) for PROMOTE votesrollback_score = Σ(confidence/100) for ROLLBACK votes- Decision: PROMOTE if
promote_score > rollback_score
- Consolidated Reporting: All model results preserved in response
- GitHub Issue Creation: On rollback, creates issue with:
- Voting breakdown (promote vs rollback scores)
- Individual model recommendations with confidence
- Detailed analyses from each model
- Timestamp and rollout metadata
- Higher Reliability: Multiple perspectives reduce false positives/negatives
- Confidence Validation: Cross-validation between models
- Model Diversity: Different models may catch different types of issues
- Full Observability: Complete transparency into each model's reasoning
- Fast: Parallel execution means latency ≈ slowest model (~3-5s for 2 models)
{
"analysis": "Multi-model analysis consensus:\n\n--- gemini-3-flash-preview ---\nDatabase connection timeout detected...\n\n--- gemma-3-1b-it ---\nHigh error rate in canary logs...",
"rootCause": "gemini-3-flash-preview: Database connection timeout; gemma-3-1b-it: High error rate",
"remediation": "- Increase database connection timeout\n- Add retry logic with exponential backoff",
"promote": false,
"confidence": 78,
"modelResults": [
{
"modelName": "gemini-3-flash-preview",
"promote": false,
"confidence": 85,
"executionTimeMs": 3245
},
{
"modelName": "gemma-3-1b-it",
"promote": false,
"confidence": 72,
"executionTimeMs": 2891
}
],
"votingRationale": "Confidence-weighted voting: Promote=0.00, Rollback=1.57. Final decision: ROLLBACK.\n\nIndividual model votes:\n- gemini-3-flash-preview: ROLLBACK (confidence: 85%)\n- gemma-3-1b-it: ROLLBACK (confidence: 72%)\n"
}Unified model comparison (choose models via parameters or interactively):
./test-models.sh # interactive: choose models
./test-models.sh gemini gemma-1b # Gemini + Gemma 1B
./test-models.sh -m gemini,gemma-1b,gemma-9b
./test-models.sh --helpAvailable models: gemini, gemma-1b, gemma-9b. The script runs the same analyze request with each selected model (sequential), saves result_<model>.json, and prints a comparison summary. Optionally wait for vLLM with --no-wait-vllm to skip when servers are already up.
Multi-model parallel (one request, multiple models with weighted voting):
./test-multi-model.shRequires Gemma deployments in gemma-system; see deployment/gemma/README.md. See docs/development/TEST_SCRIPTS.md for all test scripts.