TARSy is an intelligent Site Reliability Engineering system that automatically processes alerts through sequential agent chains, retrieves runbooks, and uses MCP (Model Context Protocol) servers to gather system information for comprehensive multi-stage incident analysis.
Inspired by the spirit of sci-fi AI, TARSy is your reliable companion for SRE operations. π
- README.md: This file - project overview and quick start
- docs/architecture-overview.md: High-level architecture concepts and design principles
- docs/functional-areas-design.md: Functional areas design and architecture documentation
Before running TARSy, ensure you have the following tools installed:
- Python 3.13+ - Core backend runtime
- Node.js 18+ - Frontend development and build tools
- npm - Node.js package manager (comes with Node.js)
- uv - Modern Python package and project manager
- Install:
pip install uv
- Alternative:
curl -LsSf https://astral.sh/uv/install.sh | sh
- Install:
Quick Check: Run
make check-prereqs
to verify all prerequisites are installed.
# 1. Initial setup (one-time only)
make setup
# 2. Configure API keys (REQUIRED)
# Edit backend/.env and set your API keys:
# - GOOGLE_API_KEY (get from https://aistudio.google.com/app/apikey)
# - GITHUB_TOKEN (get from https://github.com/settings/tokens)
# 3. Ensure Kubernetes/OpenShift access (REQUIRED)
# See [K8s Access Requirements](#k8s-access-reqs) section below for details
# 4. Start all services
make dev
Services will be available at:
- π₯οΈ TARSy Dashboard: http://localhost:5173
- Manual Alert Submission: http://localhost:5173/submit-alert
- π§ Backend API: http://localhost:8000 (docs at /docs)
Stop all services: make stop
- π οΈ Configuration-Based Agents: Deploy new agents and chain definitions via YAML configuration without code changes
- π§ Flexible Alert Processing: Accept arbitrary JSON payloads from any monitoring system
- π§ Chain-Based Agent Architecture: Specialized agents with domain-specific tools and AI reasoning working in coordinated stages
- π Comprehensive Audit Trail: Complete visibility into chain processing workflows with stage-level timeline reconstruction
- π₯οΈ SRE Dashboard: Real-time monitoring and historical analysis with interactive chain timeline visualization
- π Data Masking: Automatic protection of sensitive data in logs and responses
Tarsy uses an AI-powered chain-based architecture where alerts flow through sequential stages of specialized agents that build upon each other's work using domain-specific tools to provide comprehensive expert recommendations to engineers.
π For high-level architecture concepts: See Architecture Overview
- Alert arrives from monitoring systems with flexible JSON payload
- Orchestrator selects appropriate agent chain based on alert type
- Runbook downloaded automatically from GitHub for chain guidance
- Sequential stages execute where each agent builds upon previous stage data using AI to select and execute domain-specific tools
- Comprehensive multi-stage analysis provided to engineers with actionable recommendations
- Full audit trail captured with stage-level detail for monitoring and continuous improvement
sequenceDiagram
participant MonitoringSystem
participant Orchestrator
participant AgentChains
participant GitHub
participant AI
participant MCPServers
participant Dashboard
participant Engineer
MonitoringSystem->>Orchestrator: Send Alert
Orchestrator->>AgentChains: Assign Alert & Context
AgentChains->>GitHub: Download Runbook
loop Investigation Loop
AgentChains->>AI: Investigate with LLM
AI->>MCPServers: Query/Actuate as needed
end
AgentChains->>Dashboard: Send Analysis & Recommendations
Engineer->>Dashboard: Review & Take Action
- Start All Services: Run
make dev
to start backend and dashboard - Submit an Alert: You can Use Manual Alert Submission at http://localhost:5173/submit-alert for testing TARSy in dev environment.
- Monitor via Dashboard: Watch real-time progress updates and historical analysis at http://localhost:5173
- View Results: See detailed processing timelines and comprehensive LLM analysis
- Stop Services: Run
make stop
when finished
Tip: Use
make urls
to see all available service endpoints andmake status
to check which services are running.
For testing with real OAuth authentication:
# Start all services with OAuth2-proxy authentication
make dev-auth-full
This mode adds OAuth2-Proxy authentication layer for development testing.
π For OAuth2-proxy setup instructions: See docs/oauth2-proxy-setup.md
The system now supports flexible alert types from any monitoring source:
- Kubernetes Agent: Processes alerts from Kubernetes clusters (namespaces, pods, services, etc.)
- Any Monitoring System: Accepts arbitrary JSON payloads from Prometheus, AWS CloudWatch, ArgoCD, Datadog, etc.
- Agent-Agnostic Processing: New alert types can be added by creating specialized agents and updating agent registry
- LLM-Driven Analysis: Agents intelligently interpret any alert data structure without code changes to core system
The LLM-driven approach with flexible data structures means diverse alert types can be handled from any monitoring source, as long as:
- A runbook exists for the alert type
- An appropriate specialized agent is available or can be created
- The MCP servers have relevant tools for the monitoring domain
TARSy requires read-only access to a Kubernetes or OpenShift cluster to analyze and troubleshoot Kubernetes infrastructure issues. The system uses the kubernetes-mcp-server, which connects to your cluster via kubeconfig.
TARSy does not use oc
or kubectl
commands directly. Instead, it:
- Uses Kubernetes MCP Server: Runs
kubernetes-mcp-server@latest
via npm - Reads kubeconfig: Authenticates using your existing kubeconfig file
- Read-Only Operations: Configured with
--read-only --disable-destructive
flags - No Modifications: Cannot create, update, or delete cluster resources
If you're already logged into your OpenShift/Kubernetes cluster:
# Verify your current access
oc whoami
oc cluster-info
# TARSy-bot will automatically use your current kubeconfig
# Default location: ~/.kube/config or $KUBECONFIG
To use a specific kubeconfig file:
# Set in backend/.env
KUBECONFIG=/path/to/your/kubeconfig
# Or set environment variable
export KUBECONFIG=/path/to/your/kubeconfig
Common Issues:
# Check kubeconfig validity
oc cluster-info
# Verify TARSy can access cluster
# Check backend logs for kubernetes-mcp-server errors
tail -f backend/logs/tarsy.log | grep kubernetes
# Test kubernetes-mcp-server independently
npx -y kubernetes-mcp-server@latest --kubeconfig ~/.kube/config --help
Permission Errors:
- Ensure your user/service account has at least
view
cluster role - Verify kubeconfig points to correct cluster
- Check network connectivity to cluster API server
GET /
- Health check endpointGET /health
- Comprehensive health check with service statusPOST /alerts
- Submit a new alert for processingGET /alert-types
- Get supported alert typesGET /processing-status/{alert_id}
- Get processing statusWebSocket /ws/{alert_id}
- Real-time progress updates
GET /api/v1/history/sessions
- List alert processing sessions with filtering and paginationGET /api/v1/history/sessions/{session_id}
- Get detailed session with chronological timelineGET /api/v1/history/health
- History service health check and database status
- Alert Types: Define any alert type in
config/agents.yaml
- no hardcoding required, just create corresponding runbooks - MCP Servers: Update
mcp_servers
configuration insettings.py
or define inconfig/agents.yaml
- Agents: Create traditional hardcoded agent classes extending BaseAgent, or define configuration-based agents in
config/agents.yaml
- LLM Providers: Built-in providers work out-of-the-box (OpenAI, Google, xAI, Anthropic). Add custom providers via
config/llm_providers.yaml
for proxy configurations or model overrides
π For detailed extensibility examples: See Extensibility section in the Architecture Overview
# Run back-end and front-end (dashboard) tests
make test
The test suite includes comprehensive end-to-end integration tests covering the complete alert processing pipeline, agent specialization, error handling, and performance scenarios with full mocking of external services.