|
| 1 | +# ChaosPilot System Architecture |
| 2 | + |
| 3 | +## Overview |
| 4 | +ChaosPilot is a full-stack AI platform for log analysis, incident detection, and automated remediation. It leverages Google Agent Development Kit (ADK), Google Cloud services (BigQuery, Logging, AI Platform, Gemini), and a modular multi-agent architecture. The system is designed for security, extensibility, and modern DevOps. |
| 5 | + |
| 6 | +--- |
| 7 | + |
| 8 | +## 1. High-Level Architecture |
| 9 | + |
| 10 | +- **Frontend:** Angular SPA (TypeScript, TailwindCSS, RxJS) |
| 11 | +- **Backend:** Python (FastAPI, async/await, Google ADK) |
| 12 | +- **Agents:** Main agent manager orchestrates multiple ADK-compliant sub-agents (detector, planner, fixer, notifier, action recommender) |
| 13 | +- **Data/AI:** Google BigQuery, Cloud Logging, Gemini LLM (Google AI Platform) |
| 14 | +- **Authentication:** Supabase (user/session management) |
| 15 | +- **DevOps:** Docker, `uv`, `hatch`, GCP deployment scripts |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +## 2. Google Agent Development Kit (ADK) Usage |
| 20 | + |
| 21 | +- **Core Orchestration:** |
| 22 | + - All agent logic is built using ADK's async runtime and event-driven patterns. |
| 23 | + - The main agent manager (`/agent_manager/agent.py`) coordinates sub-agents, each inheriting from ADK base classes. |
| 24 | + - Sub-agents (in `/agent_manager/sub_agents/`) handle specialized tasks (detection, planning, fixing, notification, recommendations). |
| 25 | +- **Toolbox Integration:** |
| 26 | + - `/mcp-toolbox/tools.yaml` defines tools and toolsets in ADK schema, enabling dynamic tool invocation and chaining. |
| 27 | +- **Schema Compliance:** |
| 28 | + - All tool and agent definitions are kept in sync with ADK's open source schema, ensuring compatibility and reliability. |
| 29 | +- **Open Source Contribution:** |
| 30 | + - Refactors and schema corrections to `tools.yaml` and agent code are suitable for upstream contribution to the ADK open source project. |
| 31 | + |
| 32 | +--- |
| 33 | + |
| 34 | +## 3. Multi-Agent Orchestration |
| 35 | + |
| 36 | +- **Agent Manager:** |
| 37 | + - Receives user requests and delegates to specialized sub-agents. |
| 38 | +- **Agent Handoffs:** |
| 39 | + - Workflows are designed for agent handoff (e.g., detector → planner → fixer/notifier). |
| 40 | +- **Dynamic Toolsets:** |
| 41 | + - Each agent can invoke tools from the ADK toolbox, with toolsets defined per agent type. |
| 42 | +- **Frontend Visualization:** |
| 43 | + - The Angular frontend visualizes multi-agent workflows, showing handoffs, function calls, and responses in the chat UI. |
| 44 | + |
| 45 | +--- |
| 46 | + |
| 47 | +## 4. Google Cloud & AI Services (including Gemini) |
| 48 | + |
| 49 | +- **BigQuery:** |
| 50 | + - Stores and queries logs, incident data. Agents use BigQuery for analytics and context retrieval. |
| 51 | +- **Cloud Logging:** |
| 52 | + - Ingests and manages raw logs. Scripts in `/scripts/` support log injection and management. |
| 53 | +- **Gemini LLM (Google AI Platform):** |
| 54 | + - Backend calls Gemini for advanced log analysis, incident classification, and remediation planning. |
| 55 | + - All LLM calls are backend-only. There is currently no retrieval-augmented generation (RAG) pipeline, embedding generation, or vector similarity search implemented in the codebase. If RAG is implemented in the future, it will follow strict security and privacy guidelines. |
| 56 | +- **ADK Toolbox:** |
| 57 | + - All tools and toolsets are defined for use by agents, ensuring schema compliance and dynamic extensibility. |
| 58 | + |
| 59 | +--- |
| 60 | + |
| 61 | +## 5. Security & Best Practices |
| 62 | + |
| 63 | +- All API keys and secrets are stored in environment variables. |
| 64 | +- No direct client access to LLM APIs. |
| 65 | +- All communication is over HTTPS. |
| 66 | +- Supabase authentication for all sensitive routes. |
| 67 | +- Input/output sanitization, rate limiting, and audit logging at every step. |
| 68 | + |
| 69 | +--- |
| 70 | + |
| 71 | +## 6. DevOps & Deployment |
| 72 | + |
| 73 | +- **Local Development:** Use `uv` or `hatch` for environment management, run backend with `uvicorn`, frontend with Angular CLI. |
| 74 | +- **Production:** Build Docker image, deploy to cloud (GCP, Azure, etc.), use managed DBs and secure secrets. |
| 75 | +- **Scripts:** `/scripts` for GCP setup, IAM, log injection, etc. |
| 76 | + |
| 77 | +--- |
| 78 | + |
| 79 | +## 7. Example End-to-End Flow |
| 80 | + |
| 81 | +1. User logs in via Supabase (Angular frontend). |
| 82 | +2. User triggers an action (e.g., "Analyze Error Logs"). |
| 83 | +3. Frontend sends authenticated request to FastAPI backend. |
| 84 | +4. Backend authenticates and invokes the main ADK agent. |
| 85 | +5. Agent manager delegates to the appropriate sub-agent. |
| 86 | +6. Sub-agent queries BigQuery, retrieves relevant logs, and may send those logs or summaries to the LLM (Gemini or Azure) for analysis. |
| 87 | +7. Agent manager may hand off to other agents as needed. |
| 88 | +8. Backend streams response to frontend, which visualizes the multi-agent workflow. |
| 89 | + |
| 90 | +--- |
| 91 | + |
| 92 | +## 8. Google Tech, Open Source, and Published Content |
| 93 | + |
| 94 | +- **Google Tech:** |
| 95 | + - Deep integration with Google Cloud (BigQuery, Logging, AI Platform, Gemini). |
| 96 | + - Full adoption of Google ADK for agent orchestration and tool management. |
| 97 | +- **Open Source:** |
| 98 | + - Refactored and schema-corrected `tools.yaml` and agent code are suitable for contribution to the ADK open source project. |
| 99 | +- **Published Content:** |
| 100 | + - The project journal (`xREADME.md`) and documentation provide a transparent record of technical decisions, suitable for publication as a case study or blog post. |
| 101 | + |
| 102 | +--- |
| 103 | + |
| 104 | +## 9. Summary Table |
| 105 | + |
| 106 | +| Layer | Tech/Service | Key Files/Dirs | Google/ADK Usage | |
| 107 | +|------------|----------------------|-------------------------------|----------------------------------| |
| 108 | +| Frontend | Angular, Tailwind | `/frontend/src/app/` | Visualizes multi-agent ADK flows | |
| 109 | +| Backend | FastAPI, ADK, Python | `/main.py`, `/agent_manager/` | ADK async agents, tool orchestration | |
| 110 | +| Data/AI | BigQuery, Gemini | `/mcp-toolbox/tools.yaml` | BigQuery queries, Gemini LLM, ADK toolbox | |
| 111 | +| Auth | Supabase | `/frontend`, `/main.py` | - | |
| 112 | +| DevOps | Docker, uv, hatch | `/Dockerfile`, `/scripts/` | GCP deployment scripts | |
| 113 | + |
| 114 | +--- |
| 115 | + |
| 116 | +## 10. Visual Diagram |
| 117 | + |
| 118 | +``` |
| 119 | +graph TD |
| 120 | + subgraph Frontend (Angular) |
| 121 | + A1["User<br/>Browser"] |
| 122 | + A2["Angular App<br/>(SPA)"] |
| 123 | + end |
| 124 | + subgraph Backend (Python/FastAPI) |
| 125 | + B1["API Gateway<br/>(FastAPI/Uvicorn)"] |
| 126 | + B2["Agent Manager"] |
| 127 | + B3["Sub-Agents<br/>(Detector, Planner, Fixer, etc.)"] |
| 128 | + B4["Session & Auth Service"] |
| 129 | + B6["BigQuery/Logging Service"] |
| 130 | + end |
| 131 | + subgraph Cloud & Data |
| 132 | + C1["Google BigQuery"] |
| 133 | + C2["Google Cloud Logging"] |
| 134 | + C3["Google AI Platform (Gemini)"] |
| 135 | + C4["Supabase<br/>(Auth, DB)"] |
| 136 | + end |
| 137 | + subgraph DevOps |
| 138 | + D1["Docker"] |
| 139 | + D2["CI/CD"] |
| 140 | + end |
| 141 | +
|
| 142 | + A1-->|HTTPS|A2 |
| 143 | + A2-->|REST/WebSocket|B1 |
| 144 | + B1-->|Auth|B4 |
| 145 | + B1-->|Agent Requests|B2 |
| 146 | + B2-->|Delegate|B3 |
| 147 | + B3-->|Data|B6 |
| 148 | + B6-->|Query|C1 |
| 149 | + B6-->|Logs|C2 |
| 150 | + B2-->|LLM Analysis|C3 |
| 151 | + B4-->|User/Session|C4 |
| 152 | + B1-->|Streamed Response|A2 |
| 153 | + D1-->|Containerize|B1 |
| 154 | + D2-->|Deploy|D1 |
| 155 | +``` |
| 156 | + |
| 157 | +--- |
| 158 | + |
| 159 | +**Note:** |
| 160 | +- There is currently no RAG/embedding/vector similarity service implemented. If this is a future goal, it will be added in a later version and clearly documented as such. |
| 161 | + |
| 162 | +For more details, see the project journal (`xREADME.md`) and codebase documentation. |
0 commit comments