|
| 1 | +# Architecture Diagrams |
| 2 | + |
| 3 | +This document contains C4 architecture diagrams for the Business Process Agents MVP system. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The C4 model provides a hierarchical view of the system architecture: |
| 8 | +- **Context**: System context and external dependencies |
| 9 | +- **Container**: High-level technology choices and container interactions |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## C4 Context Diagram |
| 14 | + |
| 15 | +The context diagram shows how the Business Process Agents platform fits into the broader enterprise ecosystem, including external actors and systems. |
| 16 | + |
| 17 | +```mermaid |
| 18 | +C4Context |
| 19 | +title Business Process Agents MVP - System Context |
| 20 | +
|
| 21 | +Person(admin, "Platform Admin", "Manages agents, monitors system health, configures deployments") |
| 22 | +Person(developer, "Agent Developer", "Creates and deploys business process agents") |
| 23 | +
|
| 24 | +System(bpa, "Business Process Agents Platform", "Orchestrates AI agents for business process automation using Microsoft Agent Framework") |
| 25 | +
|
| 26 | +System_Ext(azureai, "Azure AI Foundry", "Provides LLM models (GPT-4, etc.) for agent reasoning") |
| 27 | +System_Ext(servicebus, "Azure Service Bus", "Message queue for input events and DLQ") |
| 28 | +System_Ext(keyvault, "Azure Key Vault", "Stores secrets and connection strings") |
| 29 | +System_Ext(targetapi, "Target Business APIs", "Downstream systems that agents interact with (e.g., Invoice API)") |
| 30 | +System_Ext(identity, "Identity Provider", "OIDC authentication (Keycloak/Entra)") |
| 31 | +
|
| 32 | +Rel(admin, bpa, "Monitors and manages", "HTTPS/UI") |
| 33 | +Rel(developer, bpa, "Deploys agents", "API/UI") |
| 34 | +
|
| 35 | +Rel(bpa, azureai, "Calls LLM", "HTTPS/OpenAI SDK") |
| 36 | +Rel(bpa, servicebus, "Consumes messages, publishes to DLQ", "AMQP") |
| 37 | +Rel(bpa, keyvault, "Retrieves secrets", "HTTPS") |
| 38 | +Rel(bpa, targetapi, "Invokes business logic", "HTTPS") |
| 39 | +Rel(bpa, identity, "Authenticates users", "OIDC") |
| 40 | +
|
| 41 | +Rel(servicebus, bpa, "Triggers agent runs", "Event notification") |
| 42 | +``` |
| 43 | + |
| 44 | +**Key External Dependencies:** |
| 45 | + |
| 46 | +- **Azure AI Foundry**: Hosts LLM models (e.g., GPT-4) used by agents for reasoning and decision-making via Microsoft Agent Framework |
| 47 | +- **Azure Service Bus**: Input queue for business events (invoices, orders, etc.) and dead-letter queue for failed messages |
| 48 | +- **Azure Key Vault**: Secure storage for connection strings, API keys, and certificates |
| 49 | +- **Target Business APIs**: External REST APIs that agents call to perform business actions (e.g., creating invoices, updating records) |
| 50 | +- **Identity Provider**: OIDC provider for admin/developer authentication (Keycloak for dev, Entra for production) |
| 51 | + |
| 52 | +--- |
| 53 | + |
| 54 | +## C4 Container Diagram |
| 55 | + |
| 56 | +The container diagram shows the internal structure of the Business Process Agents platform, including key components and their interactions. |
| 57 | + |
| 58 | +```mermaid |
| 59 | +C4Container |
| 60 | +title Business Process Agents MVP - Container View |
| 61 | +
|
| 62 | +Person(admin, "Platform Admin", "Manages agents and monitors system") |
| 63 | +
|
| 64 | +Container_Boundary(control, "Control Plane (Kubernetes)") { |
| 65 | + Container(api, "Control API", "ASP.NET Core + gRPC", "Manages agents, nodes, runs, and deployments") |
| 66 | + Container(scheduler, "Scheduler", "Hosted Service", "Least-loaded scheduling with placement constraints") |
| 67 | + Container(database, "PostgreSQL", "Relational Database", "Stores agents, versions, deployments, nodes, runs") |
| 68 | + Container(cache, "Redis", "In-Memory Store", "Manages leases, locks, and rate limits") |
| 69 | + Container(otel, "OTel Collector", "Telemetry Hub", "Collects and exports metrics, traces, logs") |
| 70 | + Container(ui, "Admin UI", "Next.js SPA", "Fleet dashboard, runs viewer, agent editor") |
| 71 | +} |
| 72 | +
|
| 73 | +Container_Boundary(worker, "Worker Node") { |
| 74 | + Container(runtime, "Node Runtime", ".NET Worker Service", "Pulls leases, executes agents in sandboxes, reports results") |
| 75 | + Container(connectors, "Connectors SDK", ".NET Libraries", "Service Bus input, HTTP output, DLQ handling") |
| 76 | +} |
| 77 | +
|
| 78 | +Container_Boundary(observability, "Observability Stack") { |
| 79 | + Container(prometheus, "Prometheus", "Metrics Store", "Stores time-series metrics") |
| 80 | + Container(tempo, "Tempo", "Trace Store", "Stores distributed traces") |
| 81 | + Container(loki, "Loki", "Log Aggregation", "Stores structured logs") |
| 82 | + Container(grafana, "Grafana", "Visualization", "Dashboards for metrics, traces, logs") |
| 83 | +} |
| 84 | +
|
| 85 | +System_Ext(servicebus, "Azure Service Bus", "Message queue and DLQ") |
| 86 | +System_Ext(azureai, "Azure AI Foundry", "LLM inference") |
| 87 | +System_Ext(targetapi, "Target Business API", "Downstream systems") |
| 88 | +System_Ext(keycloak, "Keycloak/Entra", "Identity provider") |
| 89 | +
|
| 90 | +Rel(admin, ui, "Uses", "HTTPS") |
| 91 | +Rel(ui, api, "Calls", "REST/gRPC") |
| 92 | +Rel(api, scheduler, "Invokes", "In-process") |
| 93 | +Rel(api, database, "Reads/Writes", "SQL") |
| 94 | +Rel(scheduler, cache, "Manages leases", "Redis protocol") |
| 95 | +Rel(scheduler, database, "Queries nodes/runs", "SQL") |
| 96 | +
|
| 97 | +Rel(runtime, api, "Registers, heartbeats", "gRPC") |
| 98 | +Rel(api, runtime, "Streams leases", "gRPC") |
| 99 | +Rel(runtime, connectors, "Orchestrates", "In-process") |
| 100 | +Rel(connectors, servicebus, "Receives/Acks/Nacks", "AMQP") |
| 101 | +Rel(connectors, azureai, "Calls LLM via MAF", "HTTPS") |
| 102 | +Rel(connectors, targetapi, "Posts results", "HTTPS") |
| 103 | +
|
| 104 | +Rel(runtime, otel, "Sends telemetry", "OTLP") |
| 105 | +Rel(api, otel, "Sends telemetry", "OTLP") |
| 106 | +Rel(otel, prometheus, "Exports metrics", "Prometheus Remote Write") |
| 107 | +Rel(otel, tempo, "Exports traces", "OTLP") |
| 108 | +Rel(otel, loki, "Exports logs", "Loki API") |
| 109 | +Rel(grafana, prometheus, "Queries", "PromQL") |
| 110 | +Rel(grafana, tempo, "Queries", "TraceQL") |
| 111 | +Rel(grafana, loki, "Queries", "LogQL") |
| 112 | +
|
| 113 | +Rel(ui, keycloak, "Authenticates", "OIDC") |
| 114 | +Rel(api, keycloak, "Validates tokens", "JWT") |
| 115 | +``` |
| 116 | + |
| 117 | +**Key Containers:** |
| 118 | + |
| 119 | +### Control Plane |
| 120 | +- **Control API**: REST and gRPC endpoints for managing agents, nodes, and runs; integrates with Microsoft Agent Framework SDK |
| 121 | +- **Scheduler**: Selects optimal node for each run based on capacity and placement constraints |
| 122 | +- **PostgreSQL**: Persistent storage for all system state |
| 123 | +- **Redis**: Distributed locks and lease management with TTL |
| 124 | +- **OTel Collector**: Central telemetry aggregation point |
| 125 | +- **Admin UI**: Web interface for operators and developers |
| 126 | + |
| 127 | +### Worker Node |
| 128 | +- **Node Runtime**: Long-running worker service that pulls leases, executes agents via MAF, and reports status |
| 129 | +- **Connectors SDK**: Pluggable input/output adapters (Service Bus, HTTP, DLQ) |
| 130 | + |
| 131 | +### Observability Stack |
| 132 | +- **Prometheus**: Metrics storage and querying (runs, latency, tokens, cost) |
| 133 | +- **Tempo**: Distributed tracing backend |
| 134 | +- **Loki**: Log aggregation with trace correlation |
| 135 | +- **Grafana**: Unified dashboards for all telemetry |
| 136 | + |
| 137 | +--- |
| 138 | + |
| 139 | +## Additional Diagrams |
| 140 | + |
| 141 | +### Sequence Diagram: Agent Run Flow |
| 142 | + |
| 143 | +Shows the end-to-end flow of processing a message through an agent run. |
| 144 | + |
| 145 | +```mermaid |
| 146 | +sequenceDiagram |
| 147 | + participant SB as Azure Service Bus |
| 148 | + participant API as Control API |
| 149 | + participant Sched as Scheduler |
| 150 | + participant Node as Node Runtime |
| 151 | + participant Agent as Agent (MAF) |
| 152 | + participant LLM as Azure AI Foundry |
| 153 | + participant TargetAPI as Target Business API |
| 154 | +
|
| 155 | + SB->>API: Queue depth notification |
| 156 | + API->>Sched: Create run request |
| 157 | + Sched->>Sched: Select node (least-loaded) |
| 158 | + Sched->>API: Return lease assignment |
| 159 | + API->>Node: Stream lease (gRPC) |
| 160 | + Node->>Node: Start sandbox process |
| 161 | + Node->>SB: Receive message |
| 162 | + Node->>Agent: Execute with message payload |
| 163 | + Agent->>LLM: LLM reasoning call |
| 164 | + LLM-->>Agent: Response with tool calls |
| 165 | + Agent->>TargetAPI: POST with idempotency key |
| 166 | + TargetAPI-->>Agent: 200 OK |
| 167 | + Agent-->>Node: Execution complete |
| 168 | + Node->>SB: Complete message (ack) |
| 169 | + Node->>API: Report run complete |
| 170 | + API->>Sched: Release lease |
| 171 | +``` |
| 172 | + |
| 173 | +### Sequence Diagram: Failure and DLQ Flow |
| 174 | + |
| 175 | +Shows how failures are handled and messages are routed to the dead-letter queue. |
| 176 | + |
| 177 | +```mermaid |
| 178 | +sequenceDiagram |
| 179 | + participant SB as Azure Service Bus |
| 180 | + participant Node as Node Runtime |
| 181 | + participant Agent as Agent (MAF) |
| 182 | + participant TargetAPI as Target Business API |
| 183 | + participant DLQ as Dead Letter Queue |
| 184 | +
|
| 185 | + SB->>Node: Receive message |
| 186 | + Node->>Agent: Execute agent run |
| 187 | + Agent->>TargetAPI: POST /api/endpoint |
| 188 | + TargetAPI-->>Agent: 500 Internal Server Error |
| 189 | + Agent-->>Node: Retry 1/3 |
| 190 | + Node->>Agent: Execute agent run |
| 191 | + Agent->>TargetAPI: POST /api/endpoint (retry) |
| 192 | + TargetAPI-->>Agent: 500 Internal Server Error |
| 193 | + Agent-->>Node: Retry 2/3 |
| 194 | + Node->>Agent: Execute agent run |
| 195 | + Agent->>TargetAPI: POST /api/endpoint (retry) |
| 196 | + TargetAPI-->>Agent: 500 Internal Server Error |
| 197 | + Agent-->>Node: Retry 3/3 (failed) |
| 198 | + Node->>SB: Abandon message |
| 199 | + SB->>DLQ: Move to dead-letter queue |
| 200 | + Node->>API: Report run failed |
| 201 | +``` |
| 202 | + |
| 203 | +--- |
| 204 | + |
| 205 | +## Deployment View |
| 206 | + |
| 207 | +### Local Development (k3d) |
| 208 | + |
| 209 | +```mermaid |
| 210 | +graph TB |
| 211 | + subgraph "k3d Cluster" |
| 212 | + subgraph "Control Plane Namespace" |
| 213 | + API[Control API] |
| 214 | + Sched[Scheduler] |
| 215 | + UI[Admin UI] |
| 216 | + PG[PostgreSQL] |
| 217 | + Redis[Redis Cache] |
| 218 | + OTel[OTel Collector] |
| 219 | + end |
| 220 | + |
| 221 | + subgraph "Worker Namespace" |
| 222 | + Node1[Node Runtime 1] |
| 223 | + Node2[Node Runtime 2] |
| 224 | + end |
| 225 | + |
| 226 | + subgraph "Observability Namespace" |
| 227 | + Prom[Prometheus] |
| 228 | + Tempo[Tempo] |
| 229 | + Loki[Loki] |
| 230 | + Grafana[Grafana] |
| 231 | + end |
| 232 | + end |
| 233 | + |
| 234 | + subgraph "External Services" |
| 235 | + SB[Azure Service Bus] |
| 236 | + AzureAI[Azure AI Foundry] |
| 237 | + KC[Keycloak] |
| 238 | + end |
| 239 | + |
| 240 | + API --> PG |
| 241 | + API --> Redis |
| 242 | + Sched --> Redis |
| 243 | + Node1 --> API |
| 244 | + Node2 --> API |
| 245 | + Node1 --> SB |
| 246 | + Node2 --> SB |
| 247 | + Node1 --> AzureAI |
| 248 | + Node2 --> AzureAI |
| 249 | + API --> OTel |
| 250 | + Node1 --> OTel |
| 251 | + OTel --> Prom |
| 252 | + OTel --> Tempo |
| 253 | + OTel --> Loki |
| 254 | + UI --> KC |
| 255 | +``` |
| 256 | + |
| 257 | +### Production (AKS) |
| 258 | + |
| 259 | +```mermaid |
| 260 | +graph TB |
| 261 | + subgraph "Azure" |
| 262 | + subgraph "AKS Cluster" |
| 263 | + subgraph "Control Plane" |
| 264 | + API[Control API<br/>2 replicas] |
| 265 | + Sched[Scheduler] |
| 266 | + UI[Admin UI] |
| 267 | + end |
| 268 | + |
| 269 | + subgraph "Worker Nodes" |
| 270 | + Node1[Node 1] |
| 271 | + Node2[Node 2] |
| 272 | + NodeN[Node N] |
| 273 | + end |
| 274 | + |
| 275 | + subgraph "Observability" |
| 276 | + OTel[OTel Collector] |
| 277 | + Grafana[Grafana] |
| 278 | + end |
| 279 | + end |
| 280 | + |
| 281 | + PG[Azure Database<br/>for PostgreSQL] |
| 282 | + Redis[Azure Cache<br/>for Redis] |
| 283 | + SB[Azure Service Bus] |
| 284 | + AzureAI[Azure AI Foundry] |
| 285 | + KV[Azure Key Vault] |
| 286 | + Monitor[Azure Monitor] |
| 287 | + Entra[Entra ID] |
| 288 | + end |
| 289 | + |
| 290 | + API --> PG |
| 291 | + API --> Redis |
| 292 | + API --> KV |
| 293 | + Sched --> Redis |
| 294 | + Node1 --> API |
| 295 | + Node2 --> API |
| 296 | + NodeN --> API |
| 297 | + Node1 --> SB |
| 298 | + Node1 --> AzureAI |
| 299 | + API --> OTel |
| 300 | + Node1 --> OTel |
| 301 | + OTel --> Monitor |
| 302 | + UI --> Entra |
| 303 | +``` |
| 304 | + |
| 305 | +--- |
| 306 | + |
| 307 | +## Technology Stack Summary |
| 308 | + |
| 309 | +| Layer | Technologies | |
| 310 | +|-------|-------------| |
| 311 | +| **Control Plane** | ASP.NET Core, gRPC, Microsoft Agent Framework SDK | |
| 312 | +| **Worker Runtime** | .NET Worker Service, Microsoft Agent Framework | |
| 313 | +| **Storage** | PostgreSQL, Redis | |
| 314 | +| **Messaging** | Azure Service Bus, NATS JetStream | |
| 315 | +| **AI/LLM** | Azure AI Foundry (GPT-4, etc.) | |
| 316 | +| **Observability** | OpenTelemetry, Prometheus, Tempo, Loki, Grafana | |
| 317 | +| **UI** | Next.js, React, Tailwind CSS, shadcn/ui | |
| 318 | +| **Auth** | Keycloak (dev), Entra ID (prod), OIDC | |
| 319 | +| **Infrastructure** | Kubernetes (k3d/AKS), Helm, Docker | |
| 320 | +| **Secrets** | Azure Key Vault, External Secrets Operator | |
| 321 | + |
| 322 | +--- |
| 323 | + |
| 324 | +## References |
| 325 | + |
| 326 | +- [System Architecture Document (SAD)](../sad.md) |
| 327 | +- [Microsoft Agent Framework Documentation](https://learn.microsoft.com/en-us/agent-framework/) |
| 328 | +- [C4 Model](https://c4model.com/) |
| 329 | +- [Azure AI Foundry Integration](./AZURE_AI_FOUNDRY_INTEGRATION.md) |
0 commit comments