Build a production-style AI system that ingests logs and metrics, detects anomalies, and uses an LLM to summarize incidents and suggest likely root causes, with observability and reliability in mind.
A production-inspired AI Agent system that assists on-call engineers by analyzing incidents, deciding next actions, and escalating to humans when required.
This project goes beyond simple LLM integration and demonstrates how to build a goal-driven AI agent where the LLM is treated as a tool, not the decision-maker.
The system implements a Virtual SRE Agent with:
- Goal-oriented behavior
- Explicit decision-making
- Multiple possible actions
- Stateful memory
- Human feedback loops
The agent decides:
- Whether to analyze an incident
- How deeply to analyze
- Which prompt version to use
- When to escalate to humans
Assist Site Reliability Engineers by summarizing incidents, suggesting likely causes, and escalating when AI confidence or trust is low.
Incident Created
↓
SRE AI Agent
↓
Decision (Action Selection)
↓
┌──────────────────────────────┐
│ SKIP_ANALYSIS │
│ BASIC_SUMMARY (LLM) │
│ DEEP_ANALYSIS (LLM) │
│ REQUEST_HUMAN_REVIEW │
└──────────────────────────────┘
↓
Incident Updated
The LLM is invoked only if the agent decides to do so.
| Action | Description |
|---|---|
| SKIP_ANALYSIS | Low severity incident, AI not required |
| BASIC_SUMMARY | Lightweight AI summary |
| DEEP_ANALYSIS | More detailed AI reasoning |
| REQUEST_HUMAN_REVIEW | Escalation due to low trust or repeated failures |
The agent maintains memory of bad human feedback per service.
- Repeated bad feedback alters future decisions
- The agent escalates earlier
- Unreliable AI paths are avoided
This enables learning without retraining models.
- LLMs are abstracted as tools
- The agent controls when and how they are invoked
- Failures never impact core system reliability
Humans can submit feedback that directly influences agent behavior.
- Feedback is stored
- Future decisions adapt
- Human judgment always overrides AI
src/main/java/com/example/agentai
│
├── AgentAiApplication.java
├── Incident.java
├── AgentAction.java
├── AgentMemory.java
├── PromptRegistry.java
├── LLMTool.java
├── IncidentStore.java
├── SREAgent.java
└── IncidentController.java
mvn spring-boot:runcurl -X POST "http://localhost:8080/incident?service=orders&severity=HIGH"curl http://localhost:8080/incident/1curl -X POST "http://localhost:8080/feedback/bad?service=orders"- AI is best-effort and non-blocking
- Explicit decision boundaries
- Safe failure modes
- Human-in-the-loop governance
- Real OpenAI / local LLM integration
- Agent SLOs and metrics
- Multi-agent collaboration
- Cost-aware planning
- Prompt A/B testing
MIT