Kubernetes AI Agent

An autonomous AI agent for Kubernetes debugging and remediation, powered by Google's Agent Development Kit (ADK) and Gemini AI.

Overview

The Kubernetes Agent is an intelligent system that:

Debugs Kubernetes pods automatically
Analyzes logs, events, and metrics
Identifies root causes of issues
Creates GitHub pull requests with fixes
Integrates with Argo Rollouts for canary analysis

Features

Kubernetes Debugging Tools

Pod Debugging: Analyze pod status, conditions, and container states
Events: Retrieve and correlate cluster events
Logs: Fetch and analyze container logs (including previous crashes)
Metrics: Check resource usage and limits
Resources: Inspect related deployments, services, and configmaps

Remediation Capabilities

Git Operations: Clone, branch, commit, push (using JGit library)
GitHub PRs: Automatically create pull requests with:
- Root cause analysis
- Code fixes
- Testing recommendations
- Links to Kubernetes resources

A2A Communication

REST API: Expose analysis capabilities via HTTP
Integration: Works with rollouts-plugin-metric-ai for canary analysis

Architecture

Argo Rollouts Analysis
	↓
rollouts-plugin-metric-ai
	↓ (A2A HTTP)
Kubernetes Agent (ADK)
	├── K8s Tools (Fabric8 client)
	├── Git Operations (JGit)
	├── GitHub PR (GitHub API)
	└── AI Analysis (Gemini)

Prerequisites

Java 17+
Maven 3.8+
Kubernetes cluster
Google API Key (Gemini)
GitHub Personal Access Token

Local Development

1. Build the project

cd kubernetes-agent
mvn clean package

2. Set environment variables

export GOOGLE_API_KEY="your-google-api-key"
export GITHUB_TOKEN="your-github-token"

3. Run locally (console mode)

java -jar target/kubernetes-agent-1.0.0.jar console

4. Run as server

java -jar target/kubernetes-agent-1.0.0.jar
# Server starts on port 8080
# Health check: http://localhost:8080/a2a/health

Deployment to Kubernetes

1. Build Docker image

docker build -t csanchez/kubernetes-agent:latest .
docker push csanchez/kubernetes-agent:latest

2. Create secrets

# Copy template
cp deployment/secret.yaml.template deployment/secret.yaml

# Edit secret.yaml and add your keys
# Then apply:
kubectl apply -f deployment/secret.yaml

3. Deploy agent

# Update image in deployment/deployment.yaml if needed
kubectl apply -k deployment/

4. Verify deployment

# Check pods
kubectl get pods -n argo-rollouts | grep kubernetes-agent

# Check logs
kubectl logs -f deployment/kubernetes-agent -n argo-rollouts

# Test health endpoint
kubectl port-forward -n argo-rollouts svc/kubernetes-agent 8080:8080
curl http://localhost:8080/a2a/health

5. Run tests

The test-agent.sh script supports both Kubernetes and local modes:

# Test agent running in Kubernetes (default)
./test-agent.sh k8s

# Test agent running locally on localhost:8080
./test-agent.sh local

# Use custom local URL
LOCAL_URL=http://localhost:9090 ./test-agent.sh local

# Use custom Kubernetes context
CONTEXT=my-k8s-context ./test-agent.sh k8s

The test script will:

✅ Check health endpoint
✅ Send a sample analysis request
✅ Verify no errors in logs (K8s mode only)

Usage

Direct Console Mode

$ java -jar kubernetes-agent.jar console

You > Debug pod my-app-canary in namespace production

Agent > Analyzing pod my-app-canary in namespace production...
[Agent gathers debug info, logs, events...]

Root Cause: Container crashloop due to OOMKilled - memory limit too low

Recommendation:
1. Increase memory limit from 256Mi to 512Mi
2. Add resource requests to prevent overcommitment
3. Review memory usage patterns in logs

A2A Integration

The agent exposes a REST API for other systems to use:

Endpoint: POST /a2a/analyze

Request:

{
	"userId": "argo-rollouts",
	"prompt": "Analyze canary deployment issue. Namespace: rollouts-test-system, Pod: canary-demo-xyz",
	"context": {
		"namespace": "rollouts-test-system",
		"podName": "canary-demo-xyz",
		"stableLogs": "...",
		"canaryLogs": "..."
	}
}

Response:

{
	"analysis": "Detailed analysis text...",
	"rootCause": "Identified root cause",
	"remediation": "Suggested fixes",
	"prLink": "https://github.com/owner/repo/pull/123",
	"promote": false,
	"confidence": 85
}

Integration with Argo Rollouts

1. Configure Analysis Template

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
	name: canary-analysis-with-agent
spec:
	metrics:
		- name: ai-analysis
			provider:
				plugin:
					ai-metric:
						# Use agent mode
						analysisMode: agent
					namespace: "{{args.namespace}}"
					podName: "{{args.canary-pod}}"
					# Fallback to default mode
					stablePodLabel: app=rollouts-demo,revision=stable
					canaryPodLabel: app=rollouts-demo,revision=canary
					model: gemini-3-flash-preview

2. The plugin will automatically:

Check if agent is healthy
Send analysis request with logs
Receive intelligent analysis
Get PR link if fix was created
Decide to promote or abort canary

Configuration

Environment Variables

Variable	Required	Description
`GOOGLE_API_KEY`	Conditional	Google Gemini API key (required when using Gemini models)
`GITHUB_TOKEN`	Yes	GitHub personal access token
`GIT_USERNAME`	No	Git commit username (default: "kubernetes-agent")
`GIT_EMAIL`	No	Git commit email (default: "agent@example.com")
`GEMINI_MODEL`	Yes	Model to use (e.g., "gemini-3-flash-preview" or "gemma-3-1b-it")
`VLLM_API_BASE`	Conditional	vLLM server base URL (required when using gemma-* models)
`VLLM_API_KEY`	No	vLLM API key (default: "not-needed")
`K8S_AGENT_URL`	No	Agent URL for plugin (default: http://kubernetes-agent.argo-rollouts.svc.cluster.local:8080)

Model Configuration

The agent supports two modes:

1. Google Gemini API (Cloud)

env:
  - name: GOOGLE_API_KEY
    value: "your-gemini-api-key"
  - name: GEMINI_MODEL
    value: "gemini-3-flash-preview"

2. Local vLLM Gemma (Self-hosted)

env:
  - name: GEMINI_MODEL
    value: "gemma-3-1b-it"  # or gemma-2-9b-it; any model name starting with "gemma-"
  - name: VLLM_API_BASE
    value: "http://gemma-1b-server.gemma-system.svc.cluster.local:8000"  # or gemma-9b-server
  - name: VLLM_API_KEY
    value: "not-needed"  # Optional

Note: When GEMINI_MODEL starts with gemma-, the agent automatically uses the vLLM endpoint specified in VLLM_API_BASE. Available deployments: gemma-1b-server (Gemma 3 1B), gemma-9b-server (Gemma 2 9B). See deployment/gemma/README.md.

Resource Limits

Recommended settings for production:

resources:
	requests:
		memory: "512Mi"
		cpu: "250m"
	limits:
		memory: "2Gi"
		cpu: "1000m"

Troubleshooting

Agent not starting

# Check logs
kubectl logs deployment/kubernetes-agent -n argo-rollouts

# Common issues:
# 1. Missing API keys - check secrets
# 2. Invalid service account - check RBAC
# 3. Out of memory - increase limits

Health check failing

# Test endpoint directly
kubectl port-forward -n argo-rollouts svc/kubernetes-agent 8080:8080
curl http://localhost:8080/a2a/health

# Should return:
# {"status":"healthy","agent":"KubernetesAgent","version":"1.0.0"}

PR creation failing

# Check GitHub token permissions:
# - repo (full control)
# - workflow (if modifying GitHub Actions)

# Check logs for git errors:
kubectl logs deployment/kubernetes-agent -n argo-rollouts | grep -i "git\|github"

Security Considerations

RBAC: Agent only has read access to K8s resources (no write)
Secrets: Store API keys in Kubernetes secrets
Network: Use NetworkPolicies to restrict egress
Git: Use fine-grained personal access tokens
Review: Always review PRs before merging

Development

Project Structure

kubernetes-agent/
├── src/main/java/com/google/adk/samples/agents/k8sagent/
│   ├── KubernetesAgent.java          # Main agent
│   ├── tools/                        # K8s debugging tools
│   ├── remediation/                  # Git and GitHub operations
│   └── a2a/                          # A2A REST controllers
├── deployment/                       # Kubernetes manifests
├── pom.xml                           # Maven config
└── Dockerfile                        # Container image

Running Tests

mvn test

Building Multi-arch Images

docker buildx build --platform linux/amd64,linux/arm64 \
	-t csanchez/kubernetes-agent:latest \
	--push .

Roadmap

Multi-cluster support
Historical analysis (learn from past incidents)
Cost optimization recommendations
Security vulnerability detection
Self-healing capabilities
Slack/PagerDuty notifications
Advanced code analysis before fixes

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Support

For issues or questions:

GitHub Issues: https://github.com/carlossg/kubernetes-agent/issues
Documentation: See the docs/ directory and inline README files

Development Documentation

Additional development documentation is available in the docs/development/ directory:

Maven plugin integration and logging
Debug mode configuration
Testing strategies
Rate limiting and fork modes

Multi-Model Parallel Analysis

The agent supports analyzing canary deployments with multiple LLMs running in parallel, using confidence-weighted voting for the final decision.

Configuration

Enable multi-model analysis:

- name: ENABLE_MULTI_MODEL
  value: "true"
- name: MODELS_TO_USE
  value: "gemini-2.5-flash,gemma-3-1b-it,gemma-2-9b-it"  # Comma-separated (optional; only one vLLM model at a time when using VLLM_API_BASE)
- name: VOTING_STRATEGY
  value: "weighted"  # Confidence-weighted voting

How It Works

Parallel Execution: Each model analyzes independently and simultaneously
Confidence-Weighted Voting:
- promote_score = Σ(confidence/100) for PROMOTE votes
- rollback_score = Σ(confidence/100) for ROLLBACK votes
- Decision: PROMOTE if promote_score > rollback_score
Consolidated Reporting: All model results preserved in response
GitHub Issue Creation: On rollback, creates issue with:
- Voting breakdown (promote vs rollback scores)
- Individual model recommendations with confidence
- Detailed analyses from each model
- Timestamp and rollout metadata

Benefits

Higher Reliability: Multiple perspectives reduce false positives/negatives
Confidence Validation: Cross-validation between models
Model Diversity: Different models may catch different types of issues
Full Observability: Complete transparency into each model's reasoning
Fast: Parallel execution means latency ≈ slowest model (~3-5s for 2 models)

Example Multi-Model Response

{
  "analysis": "Multi-model analysis consensus:\n\n--- gemini-3-flash-preview ---\nDatabase connection timeout detected...\n\n--- gemma-3-1b-it ---\nHigh error rate in canary logs...",
  "rootCause": "gemini-3-flash-preview: Database connection timeout; gemma-3-1b-it: High error rate",
  "remediation": "- Increase database connection timeout\n- Add retry logic with exponential backoff",
  "promote": false,
  "confidence": 78,
  "modelResults": [
    {
      "modelName": "gemini-3-flash-preview",
      "promote": false,
      "confidence": 85,
      "executionTimeMs": 3245
    },
    {
      "modelName": "gemma-3-1b-it",
      "promote": false,
      "confidence": 72,
      "executionTimeMs": 2891
    }
  ],
  "votingRationale": "Confidence-weighted voting: Promote=0.00, Rollback=1.57. Final decision: ROLLBACK.\n\nIndividual model votes:\n- gemini-3-flash-preview: ROLLBACK (confidence: 85%)\n- gemma-3-1b-it: ROLLBACK (confidence: 72%)\n"
}

Testing Multi-Model Locally

Unified model comparison (choose models via parameters or interactively):

./test-models.sh                          # interactive: choose models
./test-models.sh gemini gemma-1b          # Gemini + Gemma 1B
./test-models.sh -m gemini,gemma-1b,gemma-9b
./test-models.sh --help

Available models: gemini, gemma-1b, gemma-9b. The script runs the same analyze request with each selected model (sequential), saves result_<model>.json, and prints a comparison summary. Optionally wait for vLLM with --no-wait-vllm to skip when servers are already up.

Multi-model parallel (one request, multiple models with weighted voting):

./test-multi-model.sh

Requires Gemma deployments in gemma-system; see deployment/gemma/README.md. See docs/development/TEST_SCRIPTS.md for all test scripts.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github/workflows		.github/workflows
deployment		deployment
docs		docs
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pom.xml		pom.xml
test-agent.sh		test-agent.sh
test-config.sh		test-config.sh
test-models.sh		test-models.sh
test-multi-model.sh		test-multi-model.sh
test_native_tools_direct.py		test_native_tools_direct.py

License

carlossg/kubernetes-agent

Folders and files

Latest commit

History

Repository files navigation

Kubernetes AI Agent

Overview

Features

Kubernetes Debugging Tools

Remediation Capabilities

A2A Communication

Architecture

Prerequisites

Local Development

1. Build the project

2. Set environment variables

3. Run locally (console mode)

4. Run as server

Deployment to Kubernetes

1. Build Docker image

2. Create secrets

3. Deploy agent

4. Verify deployment

5. Run tests

Usage

Direct Console Mode

A2A Integration

Integration with Argo Rollouts

1. Configure Analysis Template

2. The plugin will automatically:

Configuration

Environment Variables

Model Configuration

1. Google Gemini API (Cloud)

2. Local vLLM Gemma (Self-hosted)

Resource Limits

Troubleshooting

Agent not starting

Health check failing

PR creation failing

Security Considerations

Development

Project Structure

Running Tests

Building Multi-arch Images

Roadmap

Contributing

License

Support

Development Documentation

Multi-Model Parallel Analysis

Configuration

How It Works

Benefits

Example Multi-Model Response

Testing Multi-Model Locally

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

Packages