GitHub - codeready-toolchain/tarsy-bot: Intelligent Site Reliability Engineering agent for automatic alert processing

TARSy is an intelligent Site Reliability Engineering system that automatically processes alerts through sequential agent chains, retrieves runbooks, and uses MCP (Model Context Protocol) servers to gather system information for comprehensive multi-stage incident analysis.

Inspired by the spirit of sci-fi AI, TARSy is your reliable companion for SRE operations. 🚀

Documentation

README.md: This file - project overview and quick start
docs/architecture-overview.md: High-level architecture concepts and design principles
docs/functional-areas-design.md: Functional areas design and architecture documentation

Prerequisites

Before running TARSy, ensure you have the following tools installed:

Python 3.13+ - Core backend runtime
Node.js 18+ - Frontend development and build tools
npm - Node.js package manager (comes with Node.js)
uv - Modern Python package and project manager
- Install: pip install uv
- Alternative: curl -LsSf https://astral.sh/uv/install.sh | sh

Quick Check: Run make check-prereqs to verify all prerequisites are installed.

Quick Start

# 1. Initial setup (one-time only)
make setup

# 2. Configure API keys (REQUIRED)
# Edit backend/.env and set your API keys:
# - GOOGLE_API_KEY (get from https://aistudio.google.com/app/apikey)
# - GITHUB_TOKEN (get from https://github.com/settings/tokens)

# 3. Ensure Kubernetes/OpenShift access (REQUIRED)
# See [K8s Access Requirements](#k8s-access-reqs) section below for details

# 4. Start all services  
make dev

Services will be available at:

🖥️ TARSy Dashboard: http://localhost:5173
- Manual Alert Submission: http://localhost:5173/submit-alert
🔧 Backend API: http://localhost:8000 (docs at /docs)

Stop all services: make stop

Key Features

🛠️ Configuration-Based Agents: Deploy new agents and chain definitions via YAML configuration without code changes
🔧 Flexible Alert Processing: Accept arbitrary JSON payloads from any monitoring system
🧠 Chain-Based Agent Architecture: Specialized agents with domain-specific tools and AI reasoning working in coordinated stages
📊 Comprehensive Audit Trail: Complete visibility into chain processing workflows with stage-level timeline reconstruction
🖥️ SRE Dashboard: Real-time monitoring and historical analysis with interactive chain timeline visualization
🔒 Data Masking: Automatic protection of sensitive data in logs and responses

Architecture

Tarsy uses an AI-powered chain-based architecture where alerts flow through sequential stages of specialized agents that build upon each other's work using domain-specific tools to provide comprehensive expert recommendations to engineers.

📖 For high-level architecture concepts: See Architecture Overview

How It Works

Alert arrives from monitoring systems with flexible JSON payload
Orchestrator selects appropriate agent chain based on alert type
Runbook downloaded automatically from GitHub for chain guidance
Sequential stages execute where each agent builds upon previous stage data using AI to select and execute domain-specific tools
Comprehensive multi-stage analysis provided to engineers with actionable recommendations
Full audit trail captured with stage-level detail for monitoring and continuous improvement

sequenceDiagram
    participant MonitoringSystem
    participant Orchestrator
    participant AgentChains
    participant GitHub
    participant AI
    participant MCPServers
    participant Dashboard
    participant Engineer

    MonitoringSystem->>Orchestrator: Send Alert
    Orchestrator->>AgentChains: Assign Alert & Context
    AgentChains->>GitHub: Download Runbook
    loop Investigation Loop
        AgentChains->>AI: Investigate with LLM
        AI->>MCPServers: Query/Actuate as needed
    end
    AgentChains->>Dashboard: Send Analysis & Recommendations
    Engineer->>Dashboard: Review & Take Action

Usage

Start All Services: Run make dev to start backend and dashboard
Submit an Alert: You can Use Manual Alert Submission at http://localhost:5173/submit-alert for testing TARSy in dev environment.
Monitor via Dashboard: Watch real-time progress updates and historical analysis at http://localhost:5173
View Results: See detailed processing timelines and comprehensive LLM analysis
Stop Services: Run make stop when finished

Tip: Use make urls to see all available service endpoints and make status to check which services are running.

Development with Authentication (Optional)

For testing with real OAuth authentication:

# Start all services with OAuth2-proxy authentication
make dev-auth-full

This mode adds OAuth2-Proxy authentication layer for development testing.

📖 For OAuth2-proxy setup instructions: See docs/oauth2-proxy-setup.md

Supported Alert Types

The system now supports flexible alert types from any monitoring source:

Current Agent Types

Kubernetes Agent: Processes alerts from Kubernetes clusters (namespaces, pods, services, etc.)

Flexible Alert Support

Any Monitoring System: Accepts arbitrary JSON payloads from Prometheus, AWS CloudWatch, ArgoCD, Datadog, etc.
Agent-Agnostic Processing: New alert types can be added by creating specialized agents and updating agent registry
LLM-Driven Analysis: Agents intelligently interpret any alert data structure without code changes to core system

The LLM-driven approach with flexible data structures means diverse alert types can be handled from any monitoring source, as long as:

A runbook exists for the alert type
An appropriate specialized agent is available or can be created
The MCP servers have relevant tools for the monitoring domain

Kubernetes/OpenShift Access Requirements

TARSy requires read-only access to a Kubernetes or OpenShift cluster to analyze and troubleshoot Kubernetes infrastructure issues. The system uses the kubernetes-mcp-server, which connects to your cluster via kubeconfig.

🔗 How TARSy Accesses Your Cluster

TARSy does not use oc or kubectl commands directly. Instead, it:

Uses Kubernetes MCP Server: Runs kubernetes-mcp-server@latest via npm
Reads kubeconfig: Authenticates using your existing kubeconfig file
Read-Only Operations: Configured with --read-only --disable-destructive flags
No Modifications: Cannot create, update, or delete cluster resources

⚙️ Setup Instructions

Option 1: Use Existing Session (Recommended)

If you're already logged into your OpenShift/Kubernetes cluster:

# Verify your current access
oc whoami
oc cluster-info

# TARSy-bot will automatically use your current kubeconfig
# Default location: ~/.kube/config or $KUBECONFIG

Option 2: Custom Kubeconfig

To use a specific kubeconfig file:

# Set in backend/.env
KUBECONFIG=/path/to/your/kubeconfig

# Or set environment variable
export KUBECONFIG=/path/to/your/kubeconfig

🔧 Troubleshooting Cluster Access

Common Issues:

# Check kubeconfig validity
oc cluster-info

# Verify TARSy can access cluster
# Check backend logs for kubernetes-mcp-server errors
tail -f backend/logs/tarsy.log | grep kubernetes

# Test kubernetes-mcp-server independently
npx -y kubernetes-mcp-server@latest --kubeconfig ~/.kube/config --help

Permission Errors:

Ensure your user/service account has at least view cluster role
Verify kubeconfig points to correct cluster
Check network connectivity to cluster API server

API Endpoints

Core API

GET / - Health check endpoint
GET /health - Comprehensive health check with service status
POST /alerts - Submit a new alert for processing
GET /alert-types - Get supported alert types
GET /processing-status/{alert_id} - Get processing status
WebSocket /ws/{alert_id} - Real-time progress updates

History API (EP-0003)

GET /api/v1/history/sessions - List alert processing sessions with filtering and pagination
GET /api/v1/history/sessions/{session_id} - Get detailed session with chronological timeline
GET /api/v1/history/health - History service health check and database status

Development

Adding New Components

Alert Types: Define any alert type in config/agents.yaml - no hardcoding required, just create corresponding runbooks
MCP Servers: Update mcp_servers configuration in settings.py or define in config/agents.yaml
Agents: Create traditional hardcoded agent classes extending BaseAgent, or define configuration-based agents in config/agents.yaml
LLM Providers: Built-in providers work out-of-the-box (OpenAI, Google, xAI, Anthropic). Add custom providers via config/llm_providers.yaml for proxy configurations or model overrides

📖 For detailed extensibility examples: See Extensibility section in the Architecture Overview

Running Tests

# Run back-end and front-end (dashboard) tests
make test

The test suite includes comprehensive end-to-end integration tests covering the complete alert processing pipeline, agent specialization, error handling, and performance scenarios with full mocking of external services.

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
.github/workflows		.github/workflows
backend		backend
config		config
dashboard		dashboard
docs		docs
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Documentation

Prerequisites

Quick Start

Key Features

Architecture

How It Works

Usage

Development with Authentication (Optional)

Supported Alert Types

Current Agent Types

Flexible Alert Support

Kubernetes/OpenShift Access Requirements

🔗 How TARSy Accesses Your Cluster

⚙️ Setup Instructions

Option 1: Use Existing Session (Recommended)

Option 2: Custom Kubeconfig

🔧 Troubleshooting Cluster Access

API Endpoints

Core API

History API (EP-0003)

Development

Adding New Components

Running Tests

About

Uh oh!

Uh oh!

Contributors 5

Languages

License

codeready-toolchain/tarsy-bot

Folders and files

Latest commit

History

Repository files navigation

Documentation

Prerequisites

Quick Start

Key Features

Architecture

How It Works

Usage

Development with Authentication (Optional)

Supported Alert Types

Current Agent Types

Flexible Alert Support

Kubernetes/OpenShift Access Requirements

🔗 How TARSy Accesses Your Cluster

⚙️ Setup Instructions

Option 1: Use Existing Session (Recommended)

Option 2: Custom Kubeconfig

🔧 Troubleshooting Cluster Access

API Endpoints

Core API

History API (EP-0003)

Development

Adding New Components

Running Tests

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 5

Languages