Agentic RCA Platform

Cloud-agnostic, Kubernetes-first, Apache-2.0 open-source platform for alert-driven root-cause analysis (RCA).

Product Boundaries (v1)

RCA-only output (top-3 hypotheses + evidence + confidence), no autonomous remediation.
Read-only investigations across connectors.
Slack + Jira publishing supported.
Compare-first rollout for agentic stages (compare -> active).
Formal eval gating and human-adjudication support.

What Is Implemented

Temporal-orchestrated investigation workflow with six stages:
1. resolve_service_identity
2. build_investigation_plan
3. collect_evidence
4. synthesize_rca_report
5. publish_report
6. emit_eval_event
Agentic resolver/planner runtime with:
- model routing (primary/fallback),
- tool-calling loop with limits,
- Pydantic output validation,
- compare-mode diff capture,
- strict fail behavior in active mode.
Tool registry across built-in connector tools + MCP-discovered tools.
Settings APIs for MCP servers, prompt profiles, and rollout mode.
Web UI console (services/web-ui) with:
- past/ongoing incident views,
- workflow run timeline/inspector,
- interactive mapper (drag/pan/zoom/minimap/reset),
- mapper layout persistence per tenant+user+workflow key,
- settings pages for connectors, LLM routes, MCP, prompts, rollout.

Tech Stack

Runtime: Python
Orchestration: Temporal
UI: Next.js + TypeScript + React Flow
Data defaults: PostgreSQL + Redis (platform target), in-memory store for local dev
Deployment: Kubernetes (Helm + CRDs)
License: Apache-2.0

Repository Layout

services/ingest-api
services/orchestrator
services/analysis-engine
services/eval-service
services/web-ui
platform_core
connectors/core/newrelic
connectors/core/azure
connectors/core/otel
sdk/plugin-sdk-python
charts/rca-platform
crds/
evals/golden-datasets/
examples/

Local Development

Prerequisites

Python 3.11+
Node.js 18+
Docker (for Temporal local)

One-Command Local Stack (Docker Compose)

Start all local services (Temporal, ingest API, worker, web UI):

docker compose -f docker-compose.local.yml up -d --build

Or via make:

make compose-up

Stop the stack:

make compose-down

Tail logs:

make compose-logs

Service URLs:

Web UI: http://localhost:3001
Ingest API: http://localhost:8000
Temporal UI: http://localhost:8080

1) Install dependencies

python3 -m venv .venv
source .venv/bin/activate
make setup
make web-install

2) Start Temporal

docker compose -f infra/temporal/docker-compose.yml up -d

3) Run services (3 terminals)

Terminal A:

.venv/bin/python -m uvicorn services.ingest-api.app.main:app --host 0.0.0.0 --port 8000

Terminal B:

.venv/bin/python -m services.orchestrator.app.worker

Terminal C:

cd services/web-ui
npm run dev -- --port 3001

4) Open local UIs

Web UI: http://localhost:3001
Temporal UI: http://localhost:8080
Ingest health: http://localhost:8000/v1/health

5) Seed demo incidents

TS=$(date -u +"%Y%m%dT%H%M%SZ")
curl -sS -X POST http://localhost:8000/v1/alerts \
  -H 'content-type: application/json' \
  --data "{
    \"source\":\"newrelic\",
    \"severity\":\"critical\",
    \"incident_key\":\"demo-checkout-$TS\",
    \"entity_ids\":[\"service-checkout\"],
    \"timestamps\":{\"triggered_at\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"},
    \"raw_payload_ref\":\"newrelic://demo-checkout-$TS\",
    \"raw_payload\":{\"condition\":\"error_rate_spike\"}
  }"

Tip: stop the worker temporarily before posting alerts if you want incidents to remain in running state for the Ongoing view demo.

API Surface (Current)

Core Investigation APIs

POST /v1/alerts
POST /v1/alerts/newrelic
POST /v1/alerts/grafana
GET /v1/investigations
GET /v1/investigations/{id}
GET /v1/investigations/{id}/events (SSE)
POST /v1/investigations/{id}/runs
GET /v1/investigations/{id}/runs
GET /v1/investigations/{id}/runs/{run_id}
GET /v1/investigations/{id}/runs/{run_id}/events (SSE)
POST /v1/investigations/{id}/rerun
POST /v1/internal/runs/events (internal callback)

Settings APIs

GET /v1/settings/connectors
PUT /v1/settings/connectors/{provider}
POST /v1/settings/connectors/{provider}/test
GET /v1/settings/llm-routes
PUT /v1/settings/llm-routes
GET /v1/settings/mcp-servers
PUT /v1/settings/mcp-servers/{server_id}
POST /v1/settings/mcp-servers/{server_id}/test
GET /v1/settings/mcp-servers/{server_id}/tools
GET /v1/settings/agent-prompts
PUT /v1/settings/agent-prompts/{stage_id}
GET /v1/settings/agent-rollout
PUT /v1/settings/agent-rollout

UI Layout APIs

GET /v1/ui/workflow-layouts/{workflow_key}
PUT /v1/ui/workflow-layouts/{workflow_key}

Ops APIs

GET /v1/me
GET /v1/metrics
GET /v1/health

Environment Variables

Backend:

TEMPORAL_AUTOSTART_ENABLED (default true)
TEMPORAL_ADDRESS (default localhost:7233)
TEMPORAL_TASK_QUEUE (default rca-investigations)
ORCHESTRATOR_EVENT_BASE_URL (default http://localhost:8000)
ORCHESTRATOR_EVENT_TOKEN (optional)
API_KEY (optional; if set, required by API)
CORS_ALLOW_ORIGINS (default http://localhost:3000; set to include http://localhost:3001 for local web UI)
RCA_MODEL_ALIAS_CODEX (required when LLM route uses friendly alias codex)
RCA_MODEL_ALIAS_CLAUDE (required when LLM route uses friendly alias claude)

Web UI:

NEXT_PUBLIC_API_BASE_URL (default http://localhost:8000)
INTERNAL_API_BASE_URL (optional server-side override for containerized web UI)
NEXT_PUBLIC_API_KEY (optional)
NEXT_PUBLIC_DEFAULT_TENANT (default default)
NEXT_PUBLIC_DEFAULT_ROLE (default admin)
NEXT_PUBLIC_DEFAULT_USER (default web-ui)

MCP auth helpers:

NEW_RELIC_API_KEY (recommended for New Relic MCP when secret_ref_key=NEW_RELIC_API_KEY)
MCP_NEWRELIC_INCLUDE_TAGS (optional comma-separated include tags forwarded as include-tags header)
GRAFANA_MCP_API_KEY (optional for Grafana MCP endpoints that require API key bearer auth)
GRAFANA_URL (used by local grafana/mcp-grafana sidecar, defaults to local demo URL)
GRAFANA_SERVICE_ACCOUNT_TOKEN (Grafana token for local grafana/mcp-grafana sidecar)
GRAFANA_ORG_ID (optional org id for local grafana/mcp-grafana)
JAEGER_BASE_URL (used by local jaeger-mcp sidecar, defaults to http://host.docker.internal:50734)
JAEGER_API_PREFIX (defaults to /jaeger/ui/api)
JAEGER_TIMEOUT_SECONDS (default 10)

New Relic MCP Setup (Claude/Codex Tool Access)

Add your New Relic user/API key to backend env:
- NEW_RELIC_API_KEY=...
In Settings -> MCP Server Registry, add:
- server_id: newrelic
- base_url: https://mcp.newrelic.com/mcp/
- secret_ref_key: NEW_RELIC_API_KEY
- enabled: true
Click Test and then Load Tools in the MCP section.
Keep agent rollout in compare until traces/outputs look correct, then switch to active.

Notes:

MCP client now uses streamable HTTP JSON-RPC (initialize, tools/list, tools/call).
For compatibility, legacy /tools and /invoke fallback is still supported.
The client sends both Authorization: Bearer <token> and Api-Key: <token> when a token is resolved.
PUT /v1/settings/llm-routes validates model aliases; unresolved codex/claude aliases return HTTP 400.

Grafana Alerting + Grafana MCP Setup

Configure Grafana webhook contact point to:
- POST /v1/alerts/grafana
- Example public URL: https://<your-tunnel-domain>/v1/alerts/grafana?apiKey=<API_KEY>
The Grafana webhook payload is normalized into canonical AlertEnvelope (source=grafana) and starts the same Temporal workflow.
Optional local Grafana MCP sidecar:
- docker compose -f docker-compose.local.yml -f docker-compose.grafana-mcp.local.yml up -d grafana-mcp
- or make compose-up-grafana-mcp
- Then register MCP server in Settings:
  - server_id: grafana
  - base_url: http://grafana-mcp:8000/mcp
  - secret_ref_key: leave blank for local sidecar unless you added endpoint auth.

Jaeger MCP Setup (Trace Access for Agentic Stages)

Start local Jaeger MCP sidecar:
- docker compose -f docker-compose.local.yml -f docker-compose.jaeger-mcp.local.yml up -d --build jaeger-mcp
- or make compose-up-jaeger-mcp
Register in Settings -> MCP Server Registry:
- server_id: jaeger
- base_url: http://jaeger-mcp:8000/mcp
- secret_ref_key: leave blank (local sidecar has no auth by default)
Click Test and Load Tools.
Optional: start both local MCP sidecars together:
- make compose-up-all-mcp
Re-register both MCP servers in ingest settings after backend restart:
- make bootstrap-local-mcp

Testing

PYTHONPATH=. .venv/bin/pytest -q
cd services/web-ui && npm run build

Kubernetes Assets

Helm chart: charts/rca-platform
CRDs:
- ConnectorConfig
- ModelRoute
- InvestigationPolicy
- CatalogSource
- EvalPolicy

Documentation

API summary: docs/api.md
Architecture: docs/architecture.md
OpenAPI: docs/openapi.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
charts/rca-platform		charts/rca-platform
connectors		connectors
crds		crds
docs		docs
evals/golden-datasets		evals/golden-datasets
examples		examples
infra/temporal		infra/temporal
platform_core		platform_core
rca_plugin_sdk		rca_plugin_sdk
scripts		scripts
sdk/plugin-sdk-python		sdk/plugin-sdk-python
services		services
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.grafana-mcp.local.yml		docker-compose.grafana-mcp.local.yml
docker-compose.jaeger-mcp.local.yml		docker-compose.jaeger-mcp.local.yml
docker-compose.local.yml		docker-compose.local.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic RCA Platform

Product Boundaries (v1)

What Is Implemented

Tech Stack

Repository Layout

Local Development

Prerequisites

One-Command Local Stack (Docker Compose)

1) Install dependencies

2) Start Temporal

3) Run services (3 terminals)

4) Open local UIs

5) Seed demo incidents

API Surface (Current)

Core Investigation APIs

Settings APIs

UI Layout APIs

Ops APIs

Environment Variables

New Relic MCP Setup (Claude/Codex Tool Access)

Grafana Alerting + Grafana MCP Setup

Jaeger MCP Setup (Trace Access for Agentic Stages)

Testing

Kubernetes Assets

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agentic RCA Platform

Product Boundaries (v1)

What Is Implemented

Tech Stack

Repository Layout

Local Development

Prerequisites

One-Command Local Stack (Docker Compose)

1) Install dependencies

2) Start Temporal

3) Run services (3 terminals)

4) Open local UIs

5) Seed demo incidents

API Surface (Current)

Core Investigation APIs

Settings APIs

UI Layout APIs

Ops APIs

Environment Variables

New Relic MCP Setup (Claude/Codex Tool Access)

Grafana Alerting + Grafana MCP Setup

Jaeger MCP Setup (Trace Access for Agentic Stages)

Testing

Kubernetes Assets

Documentation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages