Skip to content

Commit 029dc6a

Browse files
committed
add files
1 parent 758b1d2 commit 029dc6a

File tree

7 files changed

+291
-1
lines changed

7 files changed

+291
-1
lines changed

ARCHITECTURE.md

Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
# ChaosPilot System Architecture
2+
3+
## Overview
4+
ChaosPilot is a full-stack AI platform for log analysis, incident detection, and automated remediation. It leverages Google Agent Development Kit (ADK), Google Cloud services (BigQuery, Logging, AI Platform, Gemini), and a modular multi-agent architecture. The system is designed for security, extensibility, and modern DevOps.
5+
6+
---
7+
8+
## 1. High-Level Architecture
9+
10+
- **Frontend:** Angular SPA (TypeScript, TailwindCSS, RxJS)
11+
- **Backend:** Python (FastAPI, async/await, Google ADK)
12+
- **Agents:** Main agent manager orchestrates multiple ADK-compliant sub-agents (detector, planner, fixer, notifier, action recommender)
13+
- **Data/AI:** Google BigQuery, Cloud Logging, Gemini LLM (Google AI Platform)
14+
- **Authentication:** Supabase (user/session management)
15+
- **DevOps:** Docker, `uv`, `hatch`, GCP deployment scripts
16+
17+
---
18+
19+
## 2. Google Agent Development Kit (ADK) Usage
20+
21+
- **Core Orchestration:**
22+
- All agent logic is built using ADK's async runtime and event-driven patterns.
23+
- The main agent manager (`/agent_manager/agent.py`) coordinates sub-agents, each inheriting from ADK base classes.
24+
- Sub-agents (in `/agent_manager/sub_agents/`) handle specialized tasks (detection, planning, fixing, notification, recommendations).
25+
- **Toolbox Integration:**
26+
- `/mcp-toolbox/tools.yaml` defines tools and toolsets in ADK schema, enabling dynamic tool invocation and chaining.
27+
- **Schema Compliance:**
28+
- All tool and agent definitions are kept in sync with ADK's open source schema, ensuring compatibility and reliability.
29+
- **Open Source Contribution:**
30+
- Refactors and schema corrections to `tools.yaml` and agent code are suitable for upstream contribution to the ADK open source project.
31+
32+
---
33+
34+
## 3. Multi-Agent Orchestration
35+
36+
- **Agent Manager:**
37+
- Receives user requests and delegates to specialized sub-agents.
38+
- **Agent Handoffs:**
39+
- Workflows are designed for agent handoff (e.g., detector → planner → fixer/notifier).
40+
- **Dynamic Toolsets:**
41+
- Each agent can invoke tools from the ADK toolbox, with toolsets defined per agent type.
42+
- **Frontend Visualization:**
43+
- The Angular frontend visualizes multi-agent workflows, showing handoffs, function calls, and responses in the chat UI.
44+
45+
---
46+
47+
## 4. Google Cloud & AI Services (including Gemini)
48+
49+
- **BigQuery:**
50+
- Stores and queries logs, incident data. Agents use BigQuery for analytics and context retrieval.
51+
- **Cloud Logging:**
52+
- Ingests and manages raw logs. Scripts in `/scripts/` support log injection and management.
53+
- **Gemini LLM (Google AI Platform):**
54+
- Backend calls Gemini for advanced log analysis, incident classification, and remediation planning.
55+
- All LLM calls are backend-only. There is currently no retrieval-augmented generation (RAG) pipeline, embedding generation, or vector similarity search implemented in the codebase. If RAG is implemented in the future, it will follow strict security and privacy guidelines.
56+
- **ADK Toolbox:**
57+
- All tools and toolsets are defined for use by agents, ensuring schema compliance and dynamic extensibility.
58+
59+
---
60+
61+
## 5. Security & Best Practices
62+
63+
- All API keys and secrets are stored in environment variables.
64+
- No direct client access to LLM APIs.
65+
- All communication is over HTTPS.
66+
- Supabase authentication for all sensitive routes.
67+
- Input/output sanitization, rate limiting, and audit logging at every step.
68+
69+
---
70+
71+
## 6. DevOps & Deployment
72+
73+
- **Local Development:** Use `uv` or `hatch` for environment management, run backend with `uvicorn`, frontend with Angular CLI.
74+
- **Production:** Build Docker image, deploy to cloud (GCP, Azure, etc.), use managed DBs and secure secrets.
75+
- **Scripts:** `/scripts` for GCP setup, IAM, log injection, etc.
76+
77+
---
78+
79+
## 7. Example End-to-End Flow
80+
81+
1. User logs in via Supabase (Angular frontend).
82+
2. User triggers an action (e.g., "Analyze Error Logs").
83+
3. Frontend sends authenticated request to FastAPI backend.
84+
4. Backend authenticates and invokes the main ADK agent.
85+
5. Agent manager delegates to the appropriate sub-agent.
86+
6. Sub-agent queries BigQuery, retrieves relevant logs, and may send those logs or summaries to the LLM (Gemini or Azure) for analysis.
87+
7. Agent manager may hand off to other agents as needed.
88+
8. Backend streams response to frontend, which visualizes the multi-agent workflow.
89+
90+
---
91+
92+
## 8. Google Tech, Open Source, and Published Content
93+
94+
- **Google Tech:**
95+
- Deep integration with Google Cloud (BigQuery, Logging, AI Platform, Gemini).
96+
- Full adoption of Google ADK for agent orchestration and tool management.
97+
- **Open Source:**
98+
- Refactored and schema-corrected `tools.yaml` and agent code are suitable for contribution to the ADK open source project.
99+
- **Published Content:**
100+
- The project journal (`xREADME.md`) and documentation provide a transparent record of technical decisions, suitable for publication as a case study or blog post.
101+
102+
---
103+
104+
## 9. Summary Table
105+
106+
| Layer | Tech/Service | Key Files/Dirs | Google/ADK Usage |
107+
|------------|----------------------|-------------------------------|----------------------------------|
108+
| Frontend | Angular, Tailwind | `/frontend/src/app/` | Visualizes multi-agent ADK flows |
109+
| Backend | FastAPI, ADK, Python | `/main.py`, `/agent_manager/` | ADK async agents, tool orchestration |
110+
| Data/AI | BigQuery, Gemini | `/mcp-toolbox/tools.yaml` | BigQuery queries, Gemini LLM, ADK toolbox |
111+
| Auth | Supabase | `/frontend`, `/main.py` | - |
112+
| DevOps | Docker, uv, hatch | `/Dockerfile`, `/scripts/` | GCP deployment scripts |
113+
114+
---
115+
116+
## 10. Visual Diagram
117+
118+
```
119+
graph TD
120+
subgraph Frontend (Angular)
121+
A1["User<br/>Browser"]
122+
A2["Angular App<br/>(SPA)"]
123+
end
124+
subgraph Backend (Python/FastAPI)
125+
B1["API Gateway<br/>(FastAPI/Uvicorn)"]
126+
B2["Agent Manager"]
127+
B3["Sub-Agents<br/>(Detector, Planner, Fixer, etc.)"]
128+
B4["Session & Auth Service"]
129+
B6["BigQuery/Logging Service"]
130+
end
131+
subgraph Cloud & Data
132+
C1["Google BigQuery"]
133+
C2["Google Cloud Logging"]
134+
C3["Google AI Platform (Gemini)"]
135+
C4["Supabase<br/>(Auth, DB)"]
136+
end
137+
subgraph DevOps
138+
D1["Docker"]
139+
D2["CI/CD"]
140+
end
141+
142+
A1-->|HTTPS|A2
143+
A2-->|REST/WebSocket|B1
144+
B1-->|Auth|B4
145+
B1-->|Agent Requests|B2
146+
B2-->|Delegate|B3
147+
B3-->|Data|B6
148+
B6-->|Query|C1
149+
B6-->|Logs|C2
150+
B2-->|LLM Analysis|C3
151+
B4-->|User/Session|C4
152+
B1-->|Streamed Response|A2
153+
D1-->|Containerize|B1
154+
D2-->|Deploy|D1
155+
```
156+
157+
---
158+
159+
**Note:**
160+
- There is currently no RAG/embedding/vector similarity service implemented. If this is a future goal, it will be added in a later version and clearly documented as such.
161+
162+
For more details, see the project journal (`xREADME.md`) and codebase documentation.

HOW_IT_WORKS.md

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# How ChaosPilot Works: Step-by-Step Script
2+
3+
This document provides a clear, step-by-step walkthrough of how the ChaosPilot application operates, from user interaction in the frontend to agent orchestration and AI analysis in the backend. Use this as a guide for onboarding, demos, or understanding the system flow.
4+
5+
---
6+
7+
## 1. User Login & Authentication
8+
9+
- The user navigates to the ChaosPilot web app (Angular frontend).
10+
- The app prompts the user to log in using Supabase authentication.
11+
- Upon successful login, the user is granted access to the dashboard, chat, history, and settings.
12+
13+
---
14+
15+
## 2. Initiating an AI Workflow (Example: Analyze Error Logs)
16+
17+
- The user sees a set of quick action buttons (e.g., "Analyze Error Logs", "Classify Incident", "Generate Fix Plan").
18+
- The user clicks "Analyze Error Logs".
19+
- The frontend sends an authenticated request to the backend (FastAPI server) to start the analysis workflow.
20+
21+
---
22+
23+
## 3. Backend Agent Orchestration
24+
25+
- The backend receives the request and verifies the user's authentication (via Supabase).
26+
- The main agent manager (using Google ADK) is invoked.
27+
- The agent manager delegates the task to the appropriate sub-agent (e.g., the detector agent for log analysis).
28+
29+
---
30+
31+
## 4. Data Query & AI Analysis
32+
33+
- The sub-agent queries Google BigQuery for recent error logs.
34+
- The relevant logs or summaries are prepared for analysis.
35+
- The backend sends the prepared data to the selected LLM (Gemini or Azure OpenAI) for advanced analysis and insights.
36+
- The LLM returns its analysis (e.g., detected patterns, incident classification, recommendations).
37+
38+
---
39+
40+
## 5. Multi-Agent Workflow (if needed)
41+
42+
- If the workflow requires further steps (e.g., generating a fix plan, recommending fixes), the agent manager hands off the task to other sub-agents (planner, fixer, etc.).
43+
- Each sub-agent may query data, invoke tools, or call the LLM as needed.
44+
- The results from each agent are collected and organized.
45+
46+
---
47+
48+
## 6. Streaming Results to the Frontend
49+
50+
- The backend streams the results of the agent workflow back to the frontend.
51+
- The Angular app dynamically updates the chat UI, displaying:
52+
- Markdown-formatted analysis and reports
53+
- Structured data (tables, JSON)
54+
- Agent handoffs and function calls
55+
- Status updates and loading indicators
56+
57+
---
58+
59+
## 7. User Experience & Further Actions
60+
61+
- The user reviews the AI-generated analysis and recommendations in the chat interface.
62+
- The user can trigger additional actions (e.g., request a fix plan, escalate an incident, review history).
63+
- All sensitive actions and data remain protected by authentication and backend-only processing.
64+
65+
---
66+
67+
## 8. Security & Best Practices
68+
69+
- All LLM/API calls are made from the backend only; the client never interacts directly with AI services.
70+
- All communication is over HTTPS.
71+
- User sessions and permissions are managed by Supabase.
72+
- Logs and sensitive data are never exposed to the client or external services.
73+
74+
---
75+
76+
## 9. Summary Flow Diagram
77+
78+
```
79+
User (Browser)
80+
81+
82+
Angular Frontend (UI, Auth, Chat)
83+
│ (REST API call)
84+
85+
FastAPI Backend (Python, ADK)
86+
87+
├─► Agent Manager (Orchestrates sub-agents)
88+
│ │
89+
│ ├─► Detector Agent (queries BigQuery)
90+
│ ├─► Planner Agent (generates plans)
91+
│ └─► Fixer/Notifier Agents (as needed)
92+
93+
└─► LLM (Gemini/Azure) for analysis
94+
95+
96+
Backend streams results
97+
98+
99+
Angular Frontend (renders chat, tables, reports)
100+
```
101+
102+
---
103+
104+
For more details, see the architecture and project journal files.

agent_manager/sub_agents/action_recommender/agent.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,14 @@
77
from typing import Dict, List, Any
88
from enum import Enum
99
from toolbox_core import ToolboxSyncClient
10+
from agent_manger.config import TOOLBOX_URL
11+
1012
from dotenv import load_dotenv
1113

1214

1315
load_dotenv()
1416

15-
toolbox = ToolboxSyncClient("http://127.0.0.1:5000")
17+
toolbox = ToolboxSyncClient(TOOLBOX_URL)
1618
tools = toolbox.load_toolset("action_recommender_toolset")
1719

1820

43.5 KB
Loading

makefile

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Load .env file
2+
include .env
3+
export
4+
5+
deploy:
6+
gcloud run deploy $(AGENT_SERVICE_NAME) \
7+
--source . \
8+
--region $(GOOGLE_CLOUD_LOCATION) \
9+
--allow-unauthenticated \
10+
--port=8000 \
11+
--set-env-vars "GOOGLE_CLOUD_PROJECT=$(GOOGLE_CLOUD_PROJECT),GOOGLE_CLOUD_LOCATION=$(GOOGLE_CLOUD_LOCATION),GOOGLE_GENAI_USE_VERTEXAI=$(GOOGLE_GENAI_USE_VERTEXAI), MODEL=$(MODEL),TOOLBOX_URL=$(TOOLBOX_URL),GOOGLE_API_KEY=$(GOOGLE_API_KEY)"
12+
13+
delete:
14+
gcloud run services delete $(AGENT_SERVICE_NAME) --region $(GOOGLE_CLOUD_LOCATION)

requirements.txt

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
google-adk==1.3.0
2+
google-cloud-logging==3.12.1
3+
google-cloud-bigquery==3.34.0
4+
google-cloud-aiplatform==1.95.1
5+
google-generativeai==0.4.1
6+
litellm==1.72.7
7+
toolbox-core==0.2.1
8+
python-dotenv==1.1.0

set

Whitespace-only changes.

0 commit comments

Comments
 (0)