Skip to content

anonymousgirl123/ai-incident-analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ai-incident-analyzer

Build a production-style AI system that ingests logs and metrics, detects anomalies, and uses an LLM to summarize incidents and suggest likely root causes, with observability and reliability in mind.

image

Sequence Diagram

image

AI Incident Analyzer with SRE AI Agent

A production-inspired AI Agent system that assists on-call engineers by analyzing incidents, deciding next actions, and escalating to humans when required.

This project goes beyond simple LLM integration and demonstrates how to build a goal-driven AI agent where the LLM is treated as a tool, not the decision-maker.


🚨 What Makes This an AI Agent (Not Just an LLM Call)

The system implements a Virtual SRE Agent with:

  • Goal-oriented behavior
  • Explicit decision-making
  • Multiple possible actions
  • Stateful memory
  • Human feedback loops

The agent decides:

  • Whether to analyze an incident
  • How deeply to analyze
  • Which prompt version to use
  • When to escalate to humans

🧠 Agent Goal

Assist Site Reliability Engineers by summarizing incidents, suggesting likely causes, and escalating when AI confidence or trust is low.


🏗️ High-Level Flow

Incident Created
      ↓
SRE AI Agent
      ↓
Decision (Action Selection)
      ↓
┌──────────────────────────────┐
│ SKIP_ANALYSIS                │
│ BASIC_SUMMARY (LLM)          │
│ DEEP_ANALYSIS (LLM)          │
│ REQUEST_HUMAN_REVIEW         │
└──────────────────────────────┘
      ↓
Incident Updated

The LLM is invoked only if the agent decides to do so.


🤖 Agent Actions

Action Description
SKIP_ANALYSIS Low severity incident, AI not required
BASIC_SUMMARY Lightweight AI summary
DEEP_ANALYSIS More detailed AI reasoning
REQUEST_HUMAN_REVIEW Escalation due to low trust or repeated failures

🧠 Agent Memory (Stateful Behavior)

The agent maintains memory of bad human feedback per service.

  • Repeated bad feedback alters future decisions
  • The agent escalates earlier
  • Unreliable AI paths are avoided

This enables learning without retraining models.


🤖 LLM as a Tool

  • LLMs are abstracted as tools
  • The agent controls when and how they are invoked
  • Failures never impact core system reliability

👩‍💻 Human-in-the-Loop Feedback

Humans can submit feedback that directly influences agent behavior.

  • Feedback is stored
  • Future decisions adapt
  • Human judgment always overrides AI

📦 Project Structure

src/main/java/com/example/agentai
│
├── AgentAiApplication.java
├── Incident.java
├── AgentAction.java
├── AgentMemory.java
├── PromptRegistry.java
├── LLMTool.java
├── IncidentStore.java
├── SREAgent.java
└── IncidentController.java

🚀 Running the Project

mvn spring-boot:run

Create an incident

curl -X POST "http://localhost:8080/incident?service=orders&severity=HIGH"

Fetch incident

curl http://localhost:8080/incident/1

Submit bad feedback

curl -X POST "http://localhost:8080/feedback/bad?service=orders"

🧪 Why This Design Is Production-Ready

  • AI is best-effort and non-blocking
  • Explicit decision boundaries
  • Safe failure modes
  • Human-in-the-loop governance

🛣️ Future Enhancements

  • Real OpenAI / local LLM integration
  • Agent SLOs and metrics
  • Multi-agent collaboration
  • Cost-aware planning
  • Prompt A/B testing

📄 License

MIT

About

Build a production-style AI system that ingests logs and metrics, detects anomalies, and uses an LLM to summarize incidents and suggest likely root causes, with observability and reliability in mind.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages