An autonomous AI agent that diagnoses root causes of failures in distributed systems — analyzing OpenTelemetry traces and infrastructure metrics using a resilient hybrid LLM pipeline (Gemini Flash → Ollama fallback).
When a microservice request goes slow or fails, finding the root cause across dozens of spans is tedious and slow. RCA Agent automates that entire workflow:
- Receives a
traceIdvia REST - Fetches the full span tree from Grafana Tempo
- Enriches each span with historical baseline data from H2
- Selects the optimal LLM and prompt strategy based on availability
- Returns a structured JSON report: root cause, anomaly type, confidence score, and actionable recommendation
POST /api/analyze/{traceId}
→ { rootCause, anomalyType, confidence, recommendation, anomalyFactor }
This RCA Agent architecture features a hexagonal-based system that ingests telemetry from Kotlin services via an OTel/Prometheus/Tempo pipeline. It processes distributed traces and metrics through specialized adapters to analyze system health. Finally, it leverages a Hybrid LLM Layer (Gemini Flash as primary, Ollama Llama 3.2 as fallback) via LangChain4j to automate root cause diagnostics.
The image outlines an agent analysis logic that identifies a 15.5x latency spike in POST /payments and prunes the trace tree after confirming child spans are fast. By contextualizing the 3100ms delay, an LLM diagnoses a DATABASE_SLOW_QUERY within the local service. It concludes by generating a report with 0.94 confidence, specifically recommending an index addition to resolve the anomaly.
| Layer | Technology |
|---|---|
| Agent | Java 21, Spring Boot 3.3, LangChain4j |
| Microservices | Kotlin 2.0, Spring Boot 3.3 |
| Tracing | OpenTelemetry Java Agent, Grafana Tempo |
| Metrics | Prometheus, Grafana |
| Messaging | Apache Kafka |
| LLM (primary) | Google Gemini Flash |
| LLM (fallback) | Ollama llama3.2 (local, CPU) |
| Persistence | H2 in-memory |
| Resilience | Resilience4j — Circuit Breaker, Retry, Bulkhead, TimeLimiter |
- Docker Desktop — minimum 8GB RAM (Settings → Resources → Memory → 8192 MB)
- Gemini API key (free tier): aistudio.google.com/apikey
cp .env.example .env
# Set GEMINI_API_KEY in .env
# If left empty, the agent auto-degrades to Ollama mode
⚠️ Never commit.env— it is gitignored..env.examplemust never contain real keys.
docker compose up -d./scripts/demo.shThe script injects an anomaly, fires a real order request, captures the traceId from the response, waits for Tempo to index the trace, calls the agent, and prints the RCA report.
| Scenario | Command | Description |
|---|---|---|
| Slow payment | ./scripts/demo.sh |
3s latency injected in payment-service |
| Error storm | ./scripts/demo.sh errors |
100% error rate on payment-service |
| Cascade failure | ./scripts/demo.sh cascade |
Latency + errors across multiple services |
End-to-end execution of the cascade failure scenario — latency injected in payment-service + error rate in inventory-service:
./scripts/demo.sh cascade
═══════════════════════════════════════════════════
RCA Agent Demo — scenario: cascade
═══════════════════════════════════════════════════
▶ Step 0: Waiting for RCA Agent to be ready...
✓ Ready
▶ Step 1: Injecting anomaly...
✓ payment-service: 3000ms latency injected
✓ inventory-service: 80% error rate injected
▶ Step 2: Firing request to order-service...
HTTP/1.1 200
{"orderId":"5acdca3e-1c15-4373-b2d6-9e1f43b10b1f","status":"FAILED","traceId":"215490281486b332bfabdcdfaef23eeb"}
✓ trace_id: 215490281486b332bfabdcdfaef23eeb
▶ Step 3: Waiting for Tempo to index trace...
✓ Trace indexed after 0 retries
▶ Step 4: Calling RCA agent...
{
"traceId": "215490281486b332bfabdcdfaef23eeb",
"rootCause": "The POST /orders in the order-service is failing due to high latency downstream",
"anomalySpan": "POST /orders",
"durationMs": 3313,
"baselineMs": 250,
"anomalyFactor": 13.252,
"anomalyType": "HIGH_LATENCY_DOWNSTREAM",
"recommendation": "Optimize the database connection to improve performance",
"confidence": 0.8,
"highConfidence": true,
"anomaly": true
}
▶ Step 5: Resetting all anomaly injections...
✓ All injections reset
═══════════════════════════════════════════════════
Demo complete — scenario: cascade
═══════════════════════════════════════════════════TRACE_ID=$(curl -s -X POST http://localhost:8081/orders \
-H "Content-Type: application/json" \
-d '{"productId": "p1", "quantity": 2}' | jq -r '.traceId')
curl -s http://localhost:8080/api/analyze/$TRACE_ID | jq{
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
"rootCause": "Slow SQL execution on payments table — full table scan detected",
"anomalySpan": "db.query SELECT payments",
"durationMs": 3980,
"baselineMs": 45,
"anomalyFactor": 88.4,
"recommendation": "ANALYZE payments; add composite index on (order_id, status)",
"confidence": 0.94
}Request → [Gemini Flash] ── success ──────────────────────→ RCA Report
└── quota / timeout / error ──→ [Ollama llama3.2] ── success → RCA Report
└── fail ──→ Deterministic fallback report
-
Dynamic Prompt Strategy — two prompt variants optimized per model:
- Standard prompt → Gemini: full span tree with OTel attributes, system metrics, multi-rule classification
- Lite prompt → Ollama: minimal context designed for 1b parameter constraints
-
Native JSON Mode — Ollama is configured with
.format("json")enforcing structured output at the inference level, not just as a prompt instruction -
Quota Circuit Breaker — if Gemini returns a 429, the agent stops cloud calls for 60 seconds and routes exclusively to Ollama using an
AtomicLongtimestamp
The Gemini Flash free tier has aggressive rate limits (15 RPM), which initially caused cascading failures. The agent now specifically detects 429 RESOURCE_EXHAUSTED errors and automatically skips Gemini for 60 seconds, falling back to Ollama.
Additionally, older models like gemini-1.5-flash have been retired, leading to 404 errors. This project is configured for gemini-3.1-flash-lite (current stable), but you can switch to gemini-2.5-flash or gemini-3-flash by updating LLM_STANDARD_MODEL in your .env.
| Pattern | Applied to | Configuration |
|---|---|---|
| Circuit Breaker | Tempo | Opens at 100% failure rate, recovers after 30s |
| Retry + Exponential Backoff | LLM | 3 attempts, 1s base, 2x multiplier |
| Bulkhead | LLM | Max 5 concurrent calls, 2s wait |
| TimeLimiter | Tempo / LLM | 5s / 60s hard timeout |
Ollama 1b model — at 1 billion parameters, llama3.2:1b has a strong prior toward DATABASE_SLOW_QUERY regardless of actual span data. This is a fundamental model capability constraint, not a prompt engineering problem. The lite prompt and JSON mode mitigate parse failures but cannot fix reasoning quality. For accurate multi-signal classification use llama3.2:3b or higher, or configure Gemini.
LangChain4j Ollama client — buffers the full response before returning. Streaming for structured outputs was not available at time of implementation, adding noticeable latency on CPU inference.
Baseline cold start — on first run with no history in H2, all services default to 200ms baseline. This may produce false positives for fast-completing spans until enough requests are processed.
Tempo API — the adapter uses /api/traces/ (v1) which returns batches[]. SpanTreeMapper handles both batches and resourceSpans transparently. If your Tempo instance exposes /api/v2/, update the URI in TempoTraceAdapter.
| Tool | URL | Credentials |
|---|---|---|
| Grafana | http://localhost:3000 | admin / admin |
| RCA Agent API | http://localhost:8080/api/analyze/{traceId} | — |
| Prometheus | http://localhost:9090 | — |
| Tempo | http://localhost:3200 | — |
rca-agent/
├── agent/ # Java 21 — the RCA agent
│ └── src/main/java/com/rcaagent/
│ ├── adapters/in/ # REST controllers (inbound)
│ ├── adapters/out/ # Tempo, Prometheus, LLM, H2 (outbound)
│ ├── application/ # Use case orchestration
│ ├── domain/ # Pure domain objects — no framework deps
│ ├── infrastructure/ # Spring config, health indicators
│ └── ports/ # Port interfaces (in/out)
├── services/ # Kotlin 2.0 microservices
│ ├── order-service/
│ ├── payment-service/
│ ├── inventory-service/
│ └── notification-service/
├── infra/ # Tempo, Prometheus, Grafana, OTel config
├── scripts/
│ ├── demo.sh # End-to-end demo with anomaly injection
│ └── benchmark.sh # Accuracy evaluation runner
└── docker-compose.yml
| Variable | Default | Description |
|---|---|---|
LLM_MODE |
gemini |
gemini or ollama |
GEMINI_API_KEY |
(empty) | Google AI Studio key — if empty, auto-routes to Ollama |
LLM_STANDARD_MODEL |
gemini-3.1-flash-lite |
Gemini model name |
LLM_LOCAL_MODEL |
llama3.2:1b |
Ollama model name |
TEMPO_URL |
http://tempo:3200 |
Tempo base URL |
PROMETHEUS_URL |
http://prometheus:9090 |
Prometheus base URL |
OLLAMA_URL |
http://ollama:11434 |
Ollama base URL |
RCA_CONFIDENCE_THRESHOLD |
0.75 |
Minimum confidence for high-confidence flag |
Built following SOLID principles and Hexagonal Architecture to demonstrate the viability of AI agents in platform engineering.
Mar10-Labs — GitHub

