Cloud-agnostic, Kubernetes-first, Apache-2.0 open-source platform for alert-driven root-cause analysis (RCA).
- RCA-only output (top-3 hypotheses + evidence + confidence), no autonomous remediation.
- Read-only investigations across connectors.
- Slack + Jira publishing supported.
- Compare-first rollout for agentic stages (
compare->active). - Formal eval gating and human-adjudication support.
- Temporal-orchestrated investigation workflow with six stages:
resolve_service_identitybuild_investigation_plancollect_evidencesynthesize_rca_reportpublish_reportemit_eval_event
- Agentic resolver/planner runtime with:
- model routing (primary/fallback),
- tool-calling loop with limits,
- Pydantic output validation,
- compare-mode diff capture,
- strict fail behavior in active mode.
- Tool registry across built-in connector tools + MCP-discovered tools.
- Settings APIs for MCP servers, prompt profiles, and rollout mode.
- Web UI console (
services/web-ui) with:- past/ongoing incident views,
- workflow run timeline/inspector,
- interactive mapper (drag/pan/zoom/minimap/reset),
- mapper layout persistence per tenant+user+workflow key,
- settings pages for connectors, LLM routes, MCP, prompts, rollout.
- Runtime: Python
- Orchestration: Temporal
- UI: Next.js + TypeScript + React Flow
- Data defaults: PostgreSQL + Redis (platform target), in-memory store for local dev
- Deployment: Kubernetes (Helm + CRDs)
- License: Apache-2.0
services/ingest-apiservices/orchestratorservices/analysis-engineservices/eval-serviceservices/web-uiplatform_coreconnectors/core/newrelicconnectors/core/azureconnectors/core/otelsdk/plugin-sdk-pythoncharts/rca-platformcrds/evals/golden-datasets/examples/
- Python 3.11+
- Node.js 18+
- Docker (for Temporal local)
Start all local services (Temporal, ingest API, worker, web UI):
docker compose -f docker-compose.local.yml up -d --buildOr via make:
make compose-upStop the stack:
make compose-downTail logs:
make compose-logsService URLs:
- Web UI:
http://localhost:3001 - Ingest API:
http://localhost:8000 - Temporal UI:
http://localhost:8080
python3 -m venv .venv
source .venv/bin/activate
make setup
make web-installdocker compose -f infra/temporal/docker-compose.yml up -dTerminal A:
.venv/bin/python -m uvicorn services.ingest-api.app.main:app --host 0.0.0.0 --port 8000Terminal B:
.venv/bin/python -m services.orchestrator.app.workerTerminal C:
cd services/web-ui
npm run dev -- --port 3001- Web UI:
http://localhost:3001 - Temporal UI:
http://localhost:8080 - Ingest health:
http://localhost:8000/v1/health
TS=$(date -u +"%Y%m%dT%H%M%SZ")
curl -sS -X POST http://localhost:8000/v1/alerts \
-H 'content-type: application/json' \
--data "{
\"source\":\"newrelic\",
\"severity\":\"critical\",
\"incident_key\":\"demo-checkout-$TS\",
\"entity_ids\":[\"service-checkout\"],
\"timestamps\":{\"triggered_at\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"},
\"raw_payload_ref\":\"newrelic://demo-checkout-$TS\",
\"raw_payload\":{\"condition\":\"error_rate_spike\"}
}"Tip: stop the worker temporarily before posting alerts if you want incidents to remain in running state for the Ongoing view demo.
POST /v1/alertsPOST /v1/alerts/newrelicPOST /v1/alerts/grafanaGET /v1/investigationsGET /v1/investigations/{id}GET /v1/investigations/{id}/events(SSE)POST /v1/investigations/{id}/runsGET /v1/investigations/{id}/runsGET /v1/investigations/{id}/runs/{run_id}GET /v1/investigations/{id}/runs/{run_id}/events(SSE)POST /v1/investigations/{id}/rerunPOST /v1/internal/runs/events(internal callback)
GET /v1/settings/connectorsPUT /v1/settings/connectors/{provider}POST /v1/settings/connectors/{provider}/testGET /v1/settings/llm-routesPUT /v1/settings/llm-routesGET /v1/settings/mcp-serversPUT /v1/settings/mcp-servers/{server_id}POST /v1/settings/mcp-servers/{server_id}/testGET /v1/settings/mcp-servers/{server_id}/toolsGET /v1/settings/agent-promptsPUT /v1/settings/agent-prompts/{stage_id}GET /v1/settings/agent-rolloutPUT /v1/settings/agent-rollout
GET /v1/ui/workflow-layouts/{workflow_key}PUT /v1/ui/workflow-layouts/{workflow_key}
GET /v1/meGET /v1/metricsGET /v1/health
Backend:
TEMPORAL_AUTOSTART_ENABLED(defaulttrue)TEMPORAL_ADDRESS(defaultlocalhost:7233)TEMPORAL_TASK_QUEUE(defaultrca-investigations)ORCHESTRATOR_EVENT_BASE_URL(defaulthttp://localhost:8000)ORCHESTRATOR_EVENT_TOKEN(optional)API_KEY(optional; if set, required by API)CORS_ALLOW_ORIGINS(defaulthttp://localhost:3000; set to includehttp://localhost:3001for local web UI)RCA_MODEL_ALIAS_CODEX(required when LLM route uses friendly aliascodex)RCA_MODEL_ALIAS_CLAUDE(required when LLM route uses friendly aliasclaude)
Web UI:
NEXT_PUBLIC_API_BASE_URL(defaulthttp://localhost:8000)INTERNAL_API_BASE_URL(optional server-side override for containerized web UI)NEXT_PUBLIC_API_KEY(optional)NEXT_PUBLIC_DEFAULT_TENANT(defaultdefault)NEXT_PUBLIC_DEFAULT_ROLE(defaultadmin)NEXT_PUBLIC_DEFAULT_USER(defaultweb-ui)
MCP auth helpers:
NEW_RELIC_API_KEY(recommended for New Relic MCP whensecret_ref_key=NEW_RELIC_API_KEY)MCP_NEWRELIC_INCLUDE_TAGS(optional comma-separated include tags forwarded asinclude-tagsheader)GRAFANA_MCP_API_KEY(optional for Grafana MCP endpoints that require API key bearer auth)GRAFANA_URL(used by localgrafana/mcp-grafanasidecar, defaults to local demo URL)GRAFANA_SERVICE_ACCOUNT_TOKEN(Grafana token for localgrafana/mcp-grafanasidecar)GRAFANA_ORG_ID(optional org id for localgrafana/mcp-grafana)JAEGER_BASE_URL(used by localjaeger-mcpsidecar, defaults tohttp://host.docker.internal:50734)JAEGER_API_PREFIX(defaults to/jaeger/ui/api)JAEGER_TIMEOUT_SECONDS(default10)
- Add your New Relic user/API key to backend env:
NEW_RELIC_API_KEY=...
- In Settings -> MCP Server Registry, add:
server_id:newrelicbase_url:https://mcp.newrelic.com/mcp/secret_ref_key:NEW_RELIC_API_KEYenabled:true
- Click Test and then Load Tools in the MCP section.
- Keep agent rollout in
compareuntil traces/outputs look correct, then switch toactive.
Notes:
- MCP client now uses streamable HTTP JSON-RPC (
initialize,tools/list,tools/call). - For compatibility, legacy
/toolsand/invokefallback is still supported. - The client sends both
Authorization: Bearer <token>andApi-Key: <token>when a token is resolved. PUT /v1/settings/llm-routesvalidates model aliases; unresolvedcodex/claudealiases return HTTP400.
- Configure Grafana webhook contact point to:
POST /v1/alerts/grafana- Example public URL:
https://<your-tunnel-domain>/v1/alerts/grafana?apiKey=<API_KEY>
- The Grafana webhook payload is normalized into canonical
AlertEnvelope(source=grafana) and starts the same Temporal workflow. - Optional local Grafana MCP sidecar:
docker compose -f docker-compose.local.yml -f docker-compose.grafana-mcp.local.yml up -d grafana-mcp- or
make compose-up-grafana-mcp - Then register MCP server in Settings:
server_id:grafanabase_url:http://grafana-mcp:8000/mcpsecret_ref_key: leave blank for local sidecar unless you added endpoint auth.
- Start local Jaeger MCP sidecar:
docker compose -f docker-compose.local.yml -f docker-compose.jaeger-mcp.local.yml up -d --build jaeger-mcp- or
make compose-up-jaeger-mcp
- Register in Settings -> MCP Server Registry:
server_id:jaegerbase_url:http://jaeger-mcp:8000/mcpsecret_ref_key: leave blank (local sidecar has no auth by default)
- Click Test and Load Tools.
- Optional: start both local MCP sidecars together:
make compose-up-all-mcp
- Re-register both MCP servers in ingest settings after backend restart:
make bootstrap-local-mcp
PYTHONPATH=. .venv/bin/pytest -q
cd services/web-ui && npm run build- Helm chart:
charts/rca-platform - CRDs:
ConnectorConfigModelRouteInvestigationPolicyCatalogSourceEvalPolicy
- API summary:
docs/api.md - Architecture:
docs/architecture.md - OpenAPI:
docs/openapi.yaml