A Kubernetes controller that enables automated responses to Kubernetes events by integrating with the Kagent platform.
The KAgent Hook Controller monitors Kubernetes events and triggers Kagent agents based on configurable hook definitions. It supports multiple event types per hook configuration and implements deduplication logic to prevent duplicate notifications.
- Multi-Event Monitoring: Monitor multiple Kubernetes event types (pod-restart, pod-pending, oom-kill, probe-failed) in a single hook configuration
- Basic Deduplication: Prevents duplicate notifications with 10-minute timeout logic
- Kagent Integration: Integrates with the Kagent platform for AI agent incident response. (Can in theory talk to any a2a-enabled agent)
- Status Tracking: Provides real-time status updates and audit trails through Kubernetes events
- High Availability: Supports leader election for production deployments
sequenceDiagram
autonumber
participant K8s as Kubernetes API Server
participant HC as Hook Controller
participant Dedup as Dedup Manager
participant SM as Status Manager
participant KC as Kagent Controller (API)
participant Agent as K8s Agent
K8s->>HC: Event (e.g., BackOff, OOMKill)
HC->>HC: Map, filter, stale check (15m)
HC->>Dedup: ShouldProcessEvent(hook,event)
alt not duplicate
Dedup-->>HC: true
HC->>SM: RecordEventFiring
HC->>KC: POST /api/sessions (user_id)
KC-->>HC: 201 Session (contextId)
HC->>KC: A2A SendMessage(contextId, prompt+context)
KC-->>Agent: Dispatch message
Agent-->>KC: Create Task (taskId)
KC-->>HC: 200 OK
HC->>Dedup: MarkNotified(hook,event)
Note over HC,Dedup: Suppress repeats for 10 minutes
HC->>SM: RecordAgentCallSuccess(requestId)
else duplicate/within 10m
Dedup-->>HC: false
HC->>SM: RecordDuplicateEvent
end
For how agents respond with either a Message or a Task in A2A, see “Life of a Task” in the A2A protocol docs: https://a2a-protocol.org/latest/topics/life-of-a-task/
- Kubernetes cluster (v1.20+)
- kubectl configured to access your cluster
- Kagent installed in cluster or accessible via network.
-
Install via Helm (recommended):
git clone https://github.com/antweiss/khook.git cd khook # Install CRDs first helm install khook-crds ./charts/khook-crds \ --namespace kagent \ --create-namespace # Install controller helm install khook ./charts/khook-controller \ --namespace kagent \ --create-namespace
One-liner (no checkout):
TMP_DIR="$(mktemp -d)" && \ git clone --depth 1 https://github.com/antweiss/khook.git "$TMP_DIR/khook" && \ helm install khook-crds "$TMP_DIR/khook/charts/khook-crds" \ --namespace kagent \ --create-namespace && \ helm install khook "$TMP_DIR/khook/charts/khook-controller" \ --namespace kagent \ --create-namespace && \ rm -rf "$TMP_DIR"
-
Create Your First Hook:
apiVersion: kagent.dev/v1alpha2 kind: Hook metadata: name: pod-monitoring-hook namespace: default spec: eventConfigurations: - eventType: pod-restart agentId: kagent/incident-responder prompt: | AUTONOMOUS MODE: A pod has restarted. Analyze the cause and implement fixes immediately. INSTRUCTIONS: • Use all available tools to diagnose and resolve the issue • Annotate fixed resources with: kagentFix=<timestamp> • If patching fails: delete and recreate resources • Never ask for permission - assume autonomous execution • Complete remediation without human approval - eventType: oom-kill agentId: kagent/memory-analyzer prompt: | AUTONOMOUS MODE: OOM kill detected. Analyze memory usage and optimize immediately. INSTRUCTIONS: • Use all available tools to diagnose and resolve memory issues • Annotate fixed resources with: kagentFix=<timestamp> • If patching fails: delete and recreate resources • Never ask for permission - assume autonomous execution • Complete remediation without human approval
The controller supports monitoring the following Kubernetes event types:
Event Type | Description | Common Causes |
---|---|---|
pod-restart |
Pod has been restarted | Application crashes, health check failures, resource limits |
pod-pending |
Pod is stuck in pending state | Resource constraints, scheduling issues, image pull failures |
oom-kill |
Pod was killed due to out-of-memory | Memory limits exceeded, memory leaks |
probe-failed |
Liveness or readiness probe failed | Application not responding, configuration issues |
The controller will support reacting to additional Kubernetes event.
The controller will support reacting to non-Kubernetes events: task queues, Kafka, webhooks, DB writes, etc.
Each Hook resource defines one or more event configurations:
apiVersion: kagent.dev/v1alpha2
kind: Hook
metadata:
name: example-hook
namespace: production
spec:
eventConfigurations:
- eventType: pod-restart # Required: Event type to monitor
agentId: kagent/incident-responder # Required: Kagent agent ID
prompt: "Custom prompt..." # Required: Prompt template for the agent
The controller can be configured via environment variables:
Variable | Description | Default | Required |
---|---|---|---|
KAGENT_BASE_URL |
Base URL for Kagent API | http://kagent-controller.kagent.svc.cluster.local:8083 |
Yes |
KAGENT_USER_ID |
User identity for A2A requests | [email protected] |
Yes |
LOG_LEVEL |
Logging level (debug, info, warn, error) | info |
No |
METRICS_PORT |
Port for metrics endpoint | 8080 |
No |
HEALTH_PORT |
Port for health checks | 8081 |
No |
LEADER_ELECTION |
Enable leader election | true |
No |
Monitor pod restarts and OOM kills in the default namespace:
apiVersion: kagent.dev/v1alpha2
kind: Hook
metadata:
name: basic-pod-monitoring
namespace: default
spec:
eventConfigurations:
- eventType: pod-restart
agentId: pod-restart-analyzer
prompt: |
A pod named {{.ResourceName}} has restarted at {{.EventTime}}.
Please analyze the restart reason and remediate in a fully autonomous matter. Use all available tools. Don't ask for approval.
After analysis - use all available tools to try and resolve. Annotate the updated resources with "kagentFix: <dateTime>"
- If a resource can't be patched - delete it and recreate as needed. Don't ask for permission. Assume autonomous execution.
Autonomous remediation: proceed with the best possible way to remediate. Don't ask for approval.
- eventType: oom-kill
agentId: memory-optimizer
prompt: |
Pod {{.ResourceName}} was killed due to OOM at {{.EventTime}}.
Please analyze memory usage and resolve in a fully autonomous matter. Use all available tools. Don't ask for approval.
After analysis - use all available tools to try and resolve. Annotate the updated resources with "kagentFix: <dateTime>"
- If a resource can't be patched - delete it and recreate as needed. Don't ask for permission. Assume autonomous execution.
Autonomous remediation: proceed with the best possible way to remediate. Don't ask for approval.
Comprehensive monitoring for production workloads:
apiVersion: kagent.dev/v1alpha2
kind: Hook
metadata:
name: production-monitoring
namespace: production
spec:
eventConfigurations:
- eventType: pod-restart
agentId: incident-manager
prompt: |
PRODUCTION ALERT: Pod {{.ResourceName}} restarted at {{.EventTime}}.
Priority: HIGH. Please investigate immediately and provide incident response plan.
After analysis - use all available tools to try and resolve. Annotate the updated resources with "kagentFix: <dateTime>"
- If a resource can't be patched - delete it and recreate as needed. Don't ask for permission. Assume autonomous execution.
Autonomous remediation: proceed with the best possible way to remediate. Don't ask for approval.
- eventType: pod-pending
agentId: scheduling-analyzer
prompt: |
Pod {{.ResourceName}} is pending since {{.EventTime}}.
Please analyze scheduling constraints and resource availability.
After analysis - use all available tools to try and resolve. Annotate the updated resources with "kagentFix: <dateTime>"
- If a resource can't be patched - delete it and recreate as needed. Don't ask for permission. Assume autonomous execution.
Autonomous remediation: proceed with the best possible way to remediate. Don't ask for approval.
- eventType: probe-failed
agentId: health-checker
prompt: |
Health probe failed for {{.ResourceName}} at {{.EventTime}}.
Please check application health and configuration.
After analysis - use all available tools to try and resolve. Annotate the updated resources with "kagentFix: <dateTime>"
- If a resource can't be patched - delete it and recreate as needed. Don't ask for permission. Assume autonomous execution.
Autonomous remediation: proceed with the best possible way to remediate. Don't ask for approval.
- eventType: oom-kill
agentId: capacity-planner
prompt: |
CRITICAL: OOM kill for {{.ResourceName}} at {{.EventTime}}.
Please analyze resource usage and update capacity planning.
After analysis - use all available tools to try and resolve. Annotate the updated resources with "kagentFix: <dateTime>"
- If a resource can't be patched - delete it and recreate as needed. Don't ask for permission. Assume autonomous execution.
Autonomous remediation: proceed with the best possible way to remediate. Don't ask for approval.
Lightweight monitoring for development environments:
apiVersion: kagent.dev/v1alpha2
kind: Hook
metadata:
name: dev-monitoring
namespace: development
spec:
eventConfigurations:
- eventType: pod-restart
agentId: dev-helper
prompt: |
Dev pod {{.ResourceName}} restarted.
Please provide quick debugging tips and common solutions.
After analysis - use all available tools to try and resolve. Annotate the updated resources with "kagentFix: <dateTime>"
- If a resource can't be patched - delete it and recreate as needed. Don't ask for permission. Assume autonomous execution.
Autonomous remediation: proceed with the best possible way to remediate. Don't ask for approval.
Current Kagent setup does not require an API key. The controller identifies the caller via a user ID and base URL.
-
Configure via Helm values (recommended):
.Values.kagent.apiUrl
(default:http://kagent-controller.kagent.svc.cluster.local:8083
).Values.kagent.userId
(default:[email protected]
)
-
Or set environment variables on the Deployment:
kubectl set env -n kagent deploy/khook \ KAGENT_API_URL=http://kagent-controller.kagent.svc.cluster.local:8083 \ [email protected]
When events occur, the controller sends requests to the Kagent API:
{
"agentId": "kagent/incident-responder",
"prompt": "A pod has restarted. Please analyze...",
"context": {
"eventName": "pod-restart",
"eventTime": "2024-01-15T10:30:00Z",
"resourceName": "my-app-pod-123",
"namespace": "production",
"eventMessage": "Container my-app in pod my-app-pod-123 restarted"
}
}
The controller implements robust error handling:
- Exponential Backoff: Failed API calls are retried with exponential backoff (max 3 attempts)
- Circuit Breaker: Prevents cascading failures during Kagent API outages
- Status Updates: Hook status reflects API call success/failure states
- Audit Trail: All API interactions are logged and emit Kubernetes events
Check hook status to see active events:
kubectl get hooks -o wide
kubectl describe hook my-hook
Example status output:
status:
activeEvents:
- eventType: pod-restart
resourceName: my-app-pod-123
firstSeen: "2024-01-15T10:30:00Z"
lastSeen: "2024-01-15T10:30:00Z"
status: firing
lastUpdated: "2024-01-15T10:30:05Z"
The controller emits Kubernetes events for audit trails:
kubectl get events --field-selector involvedObject.kind=Hook
The controller exposes Prometheus metrics on port 8080:
khook_events_total
: Total number of events processedkhook_api_calls_total
: Total number of Kagent API callskhook_api_call_duration_seconds
: API call duration histogramkhook_active_events
: Number of currently active events
Health check endpoints are available on port 8081:
/healthz
: Liveness probe/readyz
: Readiness probe
Symptoms: Hook is created but events are not being processed.
Possible Causes:
- Controller not running or not watching the namespace
- RBAC permissions missing
- Event types not matching actual Kubernetes events
Solutions:
# Check controller logs
kubectl logs -n kagent deployment/khook-controller
# Verify RBAC permissions
kubectl auth can-i get events --as=system:serviceaccount:kagent:khook-controller
# Check hook status
kubectl describe hook your-hook-name
Symptoms: Events are detected but Kagent API calls fail.
Possible Causes:
- Invalid API credentials
- Network connectivity issues
- Kagent API endpoint unreachable
Solutions:
# Verify credentials
kubectl get secret kagent-credentials -o yaml
# Test API connectivity from controller pod
kubectl exec -n kagent deployment/khook-controller -- \
curl -H "Authorization: Bearer $KAGENT_API_KEY" $KAGENT_BASE_URL/health
# Check controller logs for API errors
kubectl logs -n kagent deployment/khook-controller | grep "kagent-api"
Symptoms: Same event triggers multiple Kagent calls within 10 minutes.
Possible Causes:
- Controller restarts causing memory loss
- Multiple controller instances without leader election
- Clock skew issues
Solutions:
# Check controller restart count
kubectl get pods -n kagent
# Verify leader election is working
kubectl logs -n kagent deployment/khook-controller | grep "leader"
# Check system time synchronization
kubectl exec -n kagent deployment/khook-controller -- date
Symptoms: Controller pod consuming excessive memory.
Possible Causes:
- Large number of active events not being cleaned up
- Memory leak in event processing
- Insufficient resource limits
Solutions:
# Check active events across all hooks
kubectl get hooks -A -o jsonpath='{range .items[*]}{.metadata.name}: {.status.activeEvents}{"\n"}{end}'
# Monitor memory usage
kubectl top pod -n kagent
# Adjust resource limits
kubectl patch deployment -n kagent khook-controller -p '{"spec":{"template":{"spec":{"containers":[{"name":"manager","resources":{"limits":{"memory":"512Mi"}}}]}}}}'
Enable debug logging for detailed troubleshooting:
kubectl set env deployment/khook-controller -n kagent LOG_LEVEL=debug
For additional support:
- Check the GitHub Issues
- Review the troubleshooting guide
- Join the Kagent community
- Go 1.21+
- Kubernetes cluster (kind/minikube for local development)
- kubectl configured
- Docker (for building images)
-
Clone the repository:
git clone https://github.com/antweiss/khook.git cd khook
-
Install dependencies:
go mod download
-
Run tests:
make test
-
Build the binary:
make build
-
Run locally (requires kubeconfig):
export KAGENT_API_KEY=your-test-key export KAGENT_BASE_URL=https://test.kagent.dev make run
├── api/v1alpha2/ # API types and CRD definitions
├── cmd/ # Main application entry point
├── config/ # Kubernetes manifests and configuration
│ ├── crd/ # Custom Resource Definitions
│ ├── rbac/ # RBAC configurations
│ └── manager/ # Controller deployment manifests
├── docs/ # Additional documentation
├── examples/ # Example Hook configurations
├── internal/
│ ├── client/ # Kagent API client implementation
│ ├── config/ # Configuration management
│ ├── controller/ # Kubernetes controller logic
│ ├── deduplication/ # Event deduplication logic
│ ├── event/ # Event watching and filtering
│ ├── interfaces/ # Core interfaces
│ ├── logging/ # Logging utilities
│ ├── pipeline/ # Event processing pipeline
│ └── status/ # Status management
├── Makefile # Build and deployment targets
└── go.mod # Go module definition
# Run all tests
make test
# Run integration tests (requires cluster)
make test-integration
# Build binary
make build
# Build Docker image
make docker-build
# Deploy to cluster
make deploy
# Clean up
make undeploy
See API Reference for detailed documentation of the Hook CRD schema and status fields.
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- ✅ Free Use: You can use this software for any purpose
- ✅ Free Extension & Editing: You can modify and extend the code
- ✅ Patent Protection: The license includes explicit patent protection clauses
⚠️ Commercial Redistribution: Commercial redistribution is allowed but must comply with Apache 2.0 terms
- Personal Use: Completely free - use it for any personal projects
- Open Source Development: Modify and share your changes freely
- Commercial Use: You can use it commercially, but any redistributions must include the full license text
- Patent Protection: Contributors provide patent grants for their contributions
For the complete license text and full terms, see the LICENSE file.