diff --git a/backend/app/api/routes/saved_scripts.py b/backend/app/api/routes/saved_scripts.py
index e5cc97e9..6af0d6de 100644
--- a/backend/app/api/routes/saved_scripts.py
+++ b/backend/app/api/routes/saved_scripts.py
@@ -11,7 +11,7 @@
from app.services.auth_service import AuthService
from app.services.saved_script_service import SavedScriptService
-router = APIRouter(route_class=DishkaRoute)
+router = APIRouter(route_class=DishkaRoute, tags=["scripts"])
@router.post("/scripts", response_model=SavedScriptResponse)
diff --git a/docs/architecture/authentication.md b/docs/architecture/authentication.md
new file mode 100644
index 00000000..7b97f63c
--- /dev/null
+++ b/docs/architecture/authentication.md
@@ -0,0 +1,145 @@
+# Authentication
+
+The platform uses cookie-based JWT authentication with CSRF protection via the double-submit pattern. This approach
+keeps tokens secure (httpOnly cookies) while enabling CSRF protection for state-changing requests.
+
+## Architecture
+
+```mermaid
+sequenceDiagram
+ participant Browser
+ participant Frontend
+ participant Backend
+ participant MongoDB
+
+ Browser->>Frontend: Login form submit
+ Frontend->>Backend: POST /auth/login
+ Backend->>MongoDB: Verify credentials
+ MongoDB-->>Backend: User record
+ Backend->>Backend: Generate JWT + CSRF token
+ Backend-->>Frontend: Set-Cookie: access_token (httpOnly)
+ Backend-->>Frontend: Set-Cookie: csrf_token
+ Frontend->>Frontend: Store CSRF in memory
+
+ Note over Browser,Backend: Subsequent requests
+ Browser->>Frontend: Click action
+ Frontend->>Backend: POST /api/... + X-CSRF-Token header
+ Backend->>Backend: Validate JWT from cookie
+ Backend->>Backend: Validate CSRF header == cookie
+ Backend-->>Frontend: Response
+```
+
+## Token Flow
+
+Login creates two cookies:
+
+| Cookie | Properties | Purpose |
+|----------------|------------------------------------|---------------------------------|
+| `access_token` | httpOnly, secure, samesite=strict | JWT for authentication |
+| `csrf_token` | secure, samesite=strict (readable) | CSRF double-submit verification |
+
+The `access_token` cookie is httpOnly, so JavaScript cannot read it—this prevents XSS attacks from stealing the token.
+The `csrf_token` cookie is readable by JavaScript so the frontend can include it in request headers.
+
+## Backend Implementation
+
+### Password Hashing
+
+Passwords are hashed using bcrypt via passlib:
+
+```python
+--8<-- "backend/app/core/security.py:23:32"
+```
+
+### JWT Creation
+
+JWTs are signed with HS256 using a secret key from settings:
+
+```python
+--8<-- "backend/app/core/security.py:34:39"
+```
+
+The token payload contains the username in the `sub` claim and an expiration time. Token lifetime is configured via
+`ACCESS_TOKEN_EXPIRE_MINUTES` (default: 30 minutes).
+
+### CSRF Validation
+
+The double-submit pattern requires the CSRF token to be sent in both a cookie and a header. The
+[`validate_csrf_token`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/core/security.py) dependency
+validates this for all authenticated POST/PUT/DELETE requests:
+
+```python
+--8<-- "backend/app/core/security.py:77:107"
+```
+
+Safe methods (GET, HEAD, OPTIONS) and auth endpoints (login, register, logout) skip CSRF validation.
+
+### Cookie Configuration
+
+Login sets cookies with security best practices:
+
+```python
+--8<-- "backend/app/api/routes/auth.py:94:112"
+```
+
+| Setting | Value | Purpose |
+|------------|--------|---------------------------------------------|
+| `httponly` | true | Prevents JavaScript access (XSS protection) |
+| `secure` | true | HTTPS only |
+| `samesite` | strict | Prevents CSRF via cross-site requests |
+| `path` | / | Cookie sent for all paths |
+
+## Frontend Implementation
+
+### Auth Store
+
+The frontend maintains authentication state in a Svelte store with sessionStorage persistence:
+
+```typescript
+--8<-- "frontend/src/stores/auth.ts:9:17"
+```
+
+The store caches verification results for 30 seconds to reduce server load:
+
+```typescript
+--8<-- "frontend/src/stores/auth.ts:45:47"
+```
+
+### CSRF Injection
+
+The API interceptor automatically adds the CSRF token header to all non-GET requests:
+
+```typescript
+--8<-- "frontend/src/lib/api-interceptors.ts:137:145"
+```
+
+### Session Handling
+
+On 401 responses, the interceptor clears auth state and redirects to login, preserving the original URL for
+post-login redirect:
+
+```typescript
+--8<-- "frontend/src/lib/api-interceptors.ts:64:81"
+```
+
+## Endpoints
+
+
+
+## Offline-First Behavior
+
+The frontend uses an offline-first approach for auth verification. On network failure, it returns the cached auth state
+rather than immediately logging out. This provides better UX during transient network issues but means server-revoked
+tokens may remain "valid" locally for up to 30 seconds.
+
+Security-critical operations should use `verifyAuth(forceRefresh=true)` to bypass the cache.
+
+## Key Files
+
+| File | Purpose |
+|------------------------------------------------------------------------------------------------------------------------|--------------------------------|
+| [`core/security.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/core/security.py) | JWT, password, CSRF utilities |
+| [`services/auth_service.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/auth_service.py) | Auth service layer |
+| [`api/routes/auth.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/api/routes/auth.py) | Auth endpoints |
+| [`stores/auth.ts`](https://github.com/HardMax71/Integr8sCode/blob/main/frontend/src/stores/auth.ts) | Frontend auth state |
+| [`api-interceptors.ts`](https://github.com/HardMax71/Integr8sCode/blob/main/frontend/src/lib/api-interceptors.ts) | CSRF injection, error handling |
diff --git a/docs/architecture/domain-exceptions.md b/docs/architecture/domain-exceptions.md
index c04b4553..4b10cfb2 100644
--- a/docs/architecture/domain-exceptions.md
+++ b/docs/architecture/domain-exceptions.md
@@ -1,27 +1,35 @@
# Domain exceptions
-This document explains how the backend handles errors using domain exceptions. It covers the exception hierarchy, how services use them, and how the middleware maps them to HTTP responses.
+This document explains how the backend handles errors using domain exceptions. It covers the exception hierarchy, how
+services use them, and how the middleware maps them to HTTP responses.
## Why domain exceptions
-Services used to throw `HTTPException` directly with status codes like 404 or 422. That worked but created tight coupling between business logic and HTTP semantics. A service that throws `HTTPException(status_code=404)` knows it's running behind an HTTP API, which breaks when you want to reuse that service from a CLI tool, a message consumer, or a test harness.
+Services used to throw `HTTPException` directly with status codes like 404 or 422. That worked but created tight
+coupling between business logic and HTTP semantics. A service that throws `HTTPException(status_code=404)` knows it's
+running behind an HTTP API, which breaks when you want to reuse that service from a CLI tool, a message consumer, or a
+test harness.
-Domain exceptions fix this by letting services speak in business terms. A service raises `ExecutionNotFoundError(execution_id)` instead of `HTTPException(404, "Execution not found")`. The exception handler middleware maps domain exceptions to HTTP responses in one place. Services stay transport-agnostic, tests assert on meaningful exception types, and the mapping logic lives where it belongs.
+Domain exceptions fix this by letting services speak in business terms. A service raises
+`ExecutionNotFoundError(execution_id)` instead of `HTTPException(404, "Execution not found")`. The exception handler
+middleware maps domain exceptions to HTTP responses in one place. Services stay transport-agnostic, tests assert on
+meaningful exception types, and the mapping logic lives where it belongs.
## Exception hierarchy
-All domain exceptions inherit from `DomainError`, which lives in `app/domain/exceptions.py`. The base classes map to HTTP status codes:
+All domain exceptions inherit from `DomainError`, which lives in `app/domain/exceptions.py`. The base classes map to
+HTTP status codes:
-| Base class | HTTP status | Use case |
-|------------|-------------|----------|
-| `NotFoundError` | 404 | Entity doesn't exist |
-| `ValidationError` | 422 | Invalid input or state |
-| `ThrottledError` | 429 | Rate limit exceeded |
-| `ConflictError` | 409 | Concurrent modification or duplicate |
-| `UnauthorizedError` | 401 | Missing or invalid credentials |
-| `ForbiddenError` | 403 | Authenticated but not allowed |
-| `InvalidStateError` | 400 | Operation invalid for current state |
-| `InfrastructureError` | 500 | External system failure |
+| Base class | HTTP status | Use case |
+|-----------------------|-------------|--------------------------------------|
+| `NotFoundError` | 404 | Entity doesn't exist |
+| `ValidationError` | 422 | Invalid input or state |
+| `ThrottledError` | 429 | Rate limit exceeded |
+| `ConflictError` | 409 | Concurrent modification or duplicate |
+| `UnauthorizedError` | 401 | Missing or invalid credentials |
+| `ForbiddenError` | 403 | Authenticated but not allowed |
+| `InvalidStateError` | 400 | Operation invalid for current state |
+| `InfrastructureError` | 500 | External system failure |
Each domain module defines specific exceptions that inherit from these bases. The hierarchy looks like this:
@@ -62,64 +70,43 @@ DomainError
Domain exceptions live in their respective domain modules:
-| Module | File | Exceptions |
-|--------|------|------------|
-| Base | `app/domain/exceptions.py` | `DomainError`, `NotFoundError`, `ValidationError`, etc. |
-| Execution | `app/domain/execution/exceptions.py` | `ExecutionNotFoundError`, `RuntimeNotSupportedError`, `EventPublishError` |
-| Saga | `app/domain/saga/exceptions.py` | `SagaNotFoundError`, `SagaAccessDeniedError`, `SagaInvalidStateError`, `SagaCompensationError`, `SagaTimeoutError`, `SagaConcurrencyError` |
-| Notification | `app/domain/notification/exceptions.py` | `NotificationNotFoundError`, `NotificationThrottledError`, `NotificationValidationError` |
-| Saved Script | `app/domain/saved_script/exceptions.py` | `SavedScriptNotFoundError` |
-| Replay | `app/domain/replay/exceptions.py` | `ReplaySessionNotFoundError`, `ReplayOperationError` |
-| User/Auth | `app/domain/user/exceptions.py` | `AuthenticationRequiredError`, `InvalidCredentialsError`, `TokenExpiredError`, `CSRFValidationError`, `AdminAccessRequiredError`, `UserNotFoundError` |
+| Module | File | Exceptions |
+|--------------|-----------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Base | `app/domain/exceptions.py` | `DomainError`, `NotFoundError`, `ValidationError`, etc. |
+| Execution | `app/domain/execution/exceptions.py` | `ExecutionNotFoundError`, `RuntimeNotSupportedError`, `EventPublishError` |
+| Saga | `app/domain/saga/exceptions.py` | `SagaNotFoundError`, `SagaAccessDeniedError`, `SagaInvalidStateError`, `SagaCompensationError`, `SagaTimeoutError`, `SagaConcurrencyError` |
+| Notification | `app/domain/notification/exceptions.py` | `NotificationNotFoundError`, `NotificationThrottledError`, `NotificationValidationError` |
+| Saved Script | `app/domain/saved_script/exceptions.py` | `SavedScriptNotFoundError` |
+| Replay | `app/domain/replay/exceptions.py` | `ReplaySessionNotFoundError`, `ReplayOperationError` |
+| User/Auth | `app/domain/user/exceptions.py` | `AuthenticationRequiredError`, `InvalidCredentialsError`, `TokenExpiredError`, `CSRFValidationError`, `AdminAccessRequiredError`, `UserNotFoundError` |
## Rich constructors
-Specific exceptions have constructors that capture context for logging and debugging. Instead of just a message string, they take structured arguments:
+Specific exceptions have constructors that capture context for logging and debugging. Instead of just a message string,
+they take structured arguments:
```python
-class SagaAccessDeniedError(ForbiddenError):
- def __init__(self, saga_id: str, user_id: str) -> None:
- self.saga_id = saga_id
- self.user_id = user_id
- super().__init__(f"Access denied to saga '{saga_id}' for user '{user_id}'")
+--8<-- "backend/app/domain/saga/exceptions.py:11:17"
+```
-class NotificationThrottledError(ThrottledError):
- def __init__(self, user_id: str, limit: int, window_hours: int) -> None:
- self.user_id = user_id
- self.limit = limit
- self.window_hours = window_hours
- super().__init__(f"Rate limit exceeded for user '{user_id}': max {limit} per {window_hours}h")
+```python
+--8<-- "backend/app/domain/notification/exceptions.py:11:18"
```
-This means you can log `exc.saga_id` or `exc.limit` without parsing the message, and tests can assert on specific fields.
+This means you can log `exc.saga_id` or `exc.limit` without parsing the message, and tests can assert on specific
+fields.
## Exception handler
-The middleware in `app/core/exceptions/handlers.py` catches all `DomainError` subclasses and maps them to JSON responses:
+The middleware in `app/core/exceptions/handlers.py` catches all `DomainError` subclasses and maps them to JSON
+responses:
```python
-def configure_exception_handlers(app: FastAPI) -> None:
- @app.exception_handler(DomainError)
- async def domain_error_handler(request: Request, exc: DomainError) -> JSONResponse:
- status_code = _map_to_status_code(exc)
- return JSONResponse(
- status_code=status_code,
- content={"detail": exc.message, "type": type(exc).__name__},
- )
-
-def _map_to_status_code(exc: DomainError) -> int:
- if isinstance(exc, NotFoundError): return 404
- if isinstance(exc, ValidationError): return 422
- if isinstance(exc, ThrottledError): return 429
- if isinstance(exc, ConflictError): return 409
- if isinstance(exc, UnauthorizedError): return 401
- if isinstance(exc, ForbiddenError): return 403
- if isinstance(exc, InvalidStateError): return 400
- if isinstance(exc, InfrastructureError): return 500
- return 500
+--8<-- "backend/app/core/exceptions/handlers.py:17:44"
```
-The response includes the exception type name, so clients can distinguish between `ExecutionNotFoundError` and `SagaNotFoundError` even though both return 404.
+The response includes the exception type name, so clients can distinguish between `ExecutionNotFoundError` and
+`SagaNotFoundError` even though both return 404.
## Using exceptions in services
@@ -145,7 +132,8 @@ async def get_execution(self, execution_id: str) -> DomainExecution:
return execution
```
-The service no longer knows about HTTP. It raises a domain exception that describes what went wrong in business terms. The middleware handles the translation to HTTP.
+The service no longer knows about HTTP. It raises a domain exception that describes what went wrong in business terms.
+The middleware handles the translation to HTTP.
## Testing with domain exceptions
@@ -203,4 +191,5 @@ API routes can still use `HTTPException` for route-level concerns that don't bel
- Authentication checks in route dependencies
- Route-specific access control before calling services
-The general rule: if it's about the business domain, use domain exceptions. If it's about HTTP mechanics at the route level, `HTTPException` is fine.
+The general rule: if it's about the business domain, use domain exceptions. If it's about HTTP mechanics at the route
+level, `HTTPException` is fine.
diff --git a/docs/architecture/event-storage.md b/docs/architecture/event-storage.md
index fd21d31e..ae56c78f 100644
--- a/docs/architecture/event-storage.md
+++ b/docs/architecture/event-storage.md
@@ -8,49 +8,30 @@ with consistent structure.
## EventDocument structure
-`EventDocument` uses a flexible payload pattern:
+`EventDocument` uses a flexible payload pattern. Base fields are stored at document level for efficient indexing, while
+event-specific fields go into the `payload` dict:
```python
-class EventDocument(Document):
- event_id: str # Unique event identifier
- event_type: EventType # Typed event classification
- event_version: str # Schema version
- timestamp: datetime # When event occurred
- aggregate_id: str # Related entity (e.g., execution_id)
- metadata: EventMetadata # Service info, correlation, user context
- payload: dict[str, Any] # Event-specific data (flexible)
- stored_at: datetime # When stored in MongoDB
- ttl_expires_at: datetime # Auto-expiration time
+--8<-- "backend/app/db/docs/event.py:30:45"
```
-**Base fields** (`event_id`, `event_type`, `timestamp`, `aggregate_id`, `metadata`) are stored at document level for
-efficient indexing.
-
-**Event-specific fields** go into `payload` dict, allowing different event types to have different data without schema
-changes.
-
## Storage pattern
-When storing events, base fields stay at top level while everything else goes into payload:
+When storing events, base fields stay at top level while everything else goes into payload. The `_flatten_doc` helper
+reverses this for deserialization:
```python
-_BASE_FIELDS = {"event_id", "event_type", "event_version", "timestamp", "aggregate_id", "metadata"}
-
-data = event.model_dump(exclude={"topic"})
-payload = {k: data.pop(k) for k in list(data) if k not in _BASE_FIELDS}
-doc = EventDocument(**data, payload=payload, stored_at=now, ttl_expires_at=ttl)
+--8<-- "backend/app/events/event_store.py:18:26"
```
-## Query pattern
-
-For typed deserialization, flatten payload inline:
+The `store_event` method applies this pattern:
```python
-d = doc.model_dump(exclude={"id", "revision_id", "stored_at", "ttl_expires_at"})
-flat = {**{k: v for k, v in d.items() if k != "payload"}, **d.get("payload", {})}
-event = schema_registry.deserialize_json(flat)
+--8<-- "backend/app/events/event_store.py:56:64"
```
+## Query pattern
+
For MongoDB queries, access payload fields with dot notation:
```python
@@ -71,9 +52,8 @@ graph TD
ESC --> ES
```
-1. `KafkaEventService.publish_event()` stores to `events` AND publishes to Kafka
-2. `EventStoreConsumer` consumes from Kafka and stores to same `events` collection
-3. Deduplication via unique `event_id` index handles double-writes gracefully
+`KafkaEventService.publish_event()` stores to `events` AND publishes to Kafka. `EventStoreConsumer` consumes from Kafka
+and stores to the same `events` collection. Deduplication via unique `event_id` index handles double-writes gracefully.
## Read patterns
@@ -88,45 +68,34 @@ All repositories query the same `events` collection:
## TTL and retention
Events have a configurable TTL (default 90 days). The `ttl_expires_at` field triggers MongoDB's TTL index for automatic
-cleanup:
-
-```python
-ttl_expires_at = datetime.now(timezone.utc) + timedelta(days=self.ttl_days)
-```
-
-For permanent audit requirements, events can be archived to `EventArchiveDocument` before deletion.
+cleanup. For permanent audit requirements, events can be archived to `EventArchiveDocument` before deletion.
## ReplayFilter
`ReplayFilter` provides a unified way to query events across all use cases:
```python
-class ReplayFilter(BaseModel):
- event_ids: list[str] | None = None
- execution_id: str | None = None
- correlation_id: str | None = None
- aggregate_id: str | None = None
- event_types: list[EventType] | None = None
- start_time: datetime | None = None
- end_time: datetime | None = None
- user_id: str | None = None
- service_name: str | None = None
- custom_query: dict[str, Any] | None = None
-
- def to_mongo_query(self) -> dict[str, Any]:
- # Builds MongoDB query from filter fields
+--8<-- "backend/app/domain/replay/models.py:13:31"
+```
+
+The `to_mongo_query()` method builds MongoDB queries from filter fields:
+
+```python
+--8<-- "backend/app/domain/replay/models.py:49:90"
```
All event querying—admin browse, replay preview, event export—uses `ReplayFilter.to_mongo_query()` for consistency.
## Key files
-- `db/docs/event.py` — `EventDocument` and `EventArchiveDocument` definitions
-- `domain/replay/models.py` — `ReplayFilter`, `ReplayConfig`, `ReplaySessionState`
-- `events/event_store.py` — event storage and retrieval operations
-- `db/repositories/replay_repository.py` — replay-specific queries
-- `db/repositories/admin/admin_events_repository.py` — admin dashboard queries
-- `services/kafka_event_service.py` — unified publish (store + Kafka)
+| File | Purpose |
+|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|
+| [`db/docs/event.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/db/docs/event.py) | `EventDocument` and `EventArchiveDocument` definitions |
+| [`domain/replay/models.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/domain/replay/models.py) | `ReplayFilter`, `ReplayConfig`, `ReplaySessionState` |
+| [`events/event_store.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/events/event_store.py) | Event storage and retrieval operations |
+| [`db/repositories/replay_repository.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/db/repositories/replay_repository.py) | Replay-specific queries |
+| [`db/repositories/admin/admin_events_repository.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/db/repositories/admin/admin_events_repository.py) | Admin dashboard queries |
+| [`services/kafka_event_service.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/kafka_event_service.py) | Unified publish (store + Kafka) |
## Related docs
diff --git a/docs/architecture/execution-queue.md b/docs/architecture/execution-queue.md
new file mode 100644
index 00000000..876ff6df
--- /dev/null
+++ b/docs/architecture/execution-queue.md
@@ -0,0 +1,122 @@
+# Execution Queue
+
+The ExecutionCoordinator manages a priority queue for script executions, allocating CPU and memory resources before
+spawning pods. It consumes `ExecutionRequested` events, validates resource availability, and emits commands to the
+Kubernetes worker via the saga system. Per-user limits and stale timeout handling prevent queue abuse.
+
+## Architecture
+
+```mermaid
+flowchart TB
+ subgraph Kafka
+ REQ[ExecutionRequested Event] --> COORD[ExecutionCoordinator]
+ COORD --> CMD[CreatePodCommand]
+ RESULT[Completed/Failed Events] --> COORD
+ end
+
+ subgraph Coordinator
+ COORD --> QUEUE[QueueManager]
+ COORD --> RESOURCES[ResourceManager]
+ QUEUE --> HEAP[(Priority Heap)]
+ RESOURCES --> POOL[(Resource Pool)]
+ end
+
+ subgraph Scheduling Loop
+ LOOP[Get Next Execution] --> CHECK{Resources Available?}
+ CHECK -->|Yes| ALLOCATE[Allocate Resources]
+ CHECK -->|No| REQUEUE[Requeue Execution]
+ ALLOCATE --> PUBLISH[Publish CreatePodCommand]
+ end
+```
+
+## Queue Priority
+
+Executions enter the queue with one of five priority levels. Lower numeric values are processed first:
+
+```python
+--8<-- "backend/app/services/coordinator/queue_manager.py:14:19"
+```
+
+The queue uses Python's `heapq` module, which efficiently maintains the priority ordering. When resources are
+unavailable, executions are requeued with reduced priority to prevent starvation of lower-priority work.
+
+## Per-User Limits
+
+The queue enforces per-user execution limits to prevent a single user from monopolizing resources:
+
+```python
+--8<-- "backend/app/services/coordinator/queue_manager.py:42:54"
+```
+
+When a user exceeds their limit, new execution requests are rejected with an error message indicating the limit has been
+reached.
+
+## Stale Timeout
+
+Executions that sit in the queue too long (default 1 hour) are automatically removed by a background cleanup task. This
+prevents abandoned requests from consuming queue space indefinitely:
+
+```python
+--8<-- "backend/app/services/coordinator/queue_manager.py:243:267"
+```
+
+## Resource Allocation
+
+The ResourceManager tracks a pool of CPU, memory, and GPU resources. Each execution requests an allocation based on
+language defaults or explicit requirements:
+
+```python
+--8<-- "backend/app/services/coordinator/resource_manager.py:121:130"
+```
+
+The pool maintains minimum reserve thresholds to ensure the system remains responsive even under heavy load. Allocations
+that would exceed the safe threshold are rejected, and the execution is requeued for later processing.
+
+```python
+--8<-- "backend/app/services/coordinator/resource_manager.py:135:148"
+```
+
+## Scheduling Loop
+
+The coordinator runs a background scheduling loop that continuously pulls executions from the queue and attempts to
+schedule them:
+
+```python
+--8<-- "backend/app/services/coordinator/coordinator.py:307:323"
+```
+
+A semaphore limits concurrent scheduling operations to prevent overwhelming the system during bursts of incoming
+requests.
+
+## Event Flow
+
+The coordinator handles several event types:
+
+1. **ExecutionRequested** - Adds execution to queue, publishes `ExecutionAccepted`
+2. **ExecutionCancelled** - Removes from queue, releases resources if allocated
+3. **ExecutionCompleted** - Releases allocated resources
+4. **ExecutionFailed** - Releases allocated resources
+
+When scheduling succeeds, the coordinator publishes a `CreatePodCommand` to the saga topic, triggering pod creation by
+the Kubernetes worker.
+
+## Configuration
+
+| Parameter | Default | Description |
+|-------------------------------|---------|--------------------------------------|
+| `max_queue_size` | 10000 | Maximum executions in queue |
+| `max_executions_per_user` | 100 | Per-user queue limit |
+| `stale_timeout_seconds` | 3600 | When to discard old executions |
+| `max_concurrent_scheduling` | 10 | Parallel scheduling operations |
+| `scheduling_interval_seconds` | 0.5 | Polling interval when queue is empty |
+| `total_cpu_cores` | 32.0 | Total CPU pool |
+| `total_memory_mb` | 65536 | Total memory pool (64GB) |
+| `overcommit_factor` | 1.2 | Allow 20% resource overcommit |
+
+## Key Files
+
+| File | Purpose |
+|--------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------|
+| [`services/coordinator/coordinator.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/coordinator/coordinator.py) | Main coordinator service |
+| [`services/coordinator/queue_manager.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/coordinator/queue_manager.py) | Priority queue implementation |
+| [`services/coordinator/resource_manager.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/coordinator/resource_manager.py) | Resource pool and allocation |
diff --git a/docs/architecture/idempotency.md b/docs/architecture/idempotency.md
new file mode 100644
index 00000000..6b036d68
--- /dev/null
+++ b/docs/architecture/idempotency.md
@@ -0,0 +1,129 @@
+# Idempotency
+
+The platform implements at-least-once event delivery with idempotency protection to prevent duplicate processing. When a
+Kafka message is delivered multiple times (due to retries, rebalances, or failures), the idempotency layer ensures the
+event handler executes only once. Results can be cached for fast duplicate responses.
+
+## Architecture
+
+```mermaid
+flowchart TB
+ subgraph Kafka Consumer
+ MSG[Incoming Event] --> CHECK[Check & Reserve Key]
+ end
+
+ subgraph Idempotency Manager
+ CHECK --> REDIS[(Redis)]
+ REDIS --> FOUND{Key Exists?}
+ FOUND -->|Yes| STATUS{Status?}
+ STATUS -->|Processing| TIMEOUT{Timed Out?}
+ STATUS -->|Completed/Failed| DUP[Return Duplicate]
+ TIMEOUT -->|Yes| RETRY[Allow Retry]
+ TIMEOUT -->|No| WAIT[Block Duplicate]
+ FOUND -->|No| RESERVE[Reserve Key]
+ end
+
+ subgraph Handler Execution
+ RESERVE --> HANDLER[Execute Handler]
+ RETRY --> HANDLER
+ HANDLER -->|Success| COMPLETE[Mark Completed]
+ HANDLER -->|Error| FAIL[Mark Failed]
+ COMPLETE --> CACHE[Cache Result]
+ end
+```
+
+## Key Strategies
+
+The idempotency manager supports three strategies for generating keys from events:
+
+**Event-based** uses the event's unique ID and type. This is the default and works for events where the ID is guaranteed
+unique (like UUIDs generated at publish time).
+
+**Content hash** generates a SHA-256 hash of the event's payload, excluding metadata like timestamps and event IDs. Use
+this when the same logical operation might produce different event IDs but identical content.
+
+**Custom** allows the caller to provide an arbitrary key. Useful when idempotency depends on business logic (e.g., "one
+execution per user per minute").
+
+```python
+--8<-- "backend/app/services/idempotency/idempotency_manager.py:37:57"
+```
+
+## Status Lifecycle
+
+Each idempotency record transitions through defined states:
+
+```python
+--8<-- "backend/app/domain/idempotency/models.py:11:15"
+```
+
+When an event arrives, the manager checks for an existing key. If none exists, it creates a record in `PROCESSING` state
+and returns control to the handler. On success, the record moves to `COMPLETED`; on error, to `FAILED`. Both terminal
+states block duplicate processing for the TTL duration.
+
+If a key is found in `PROCESSING` state but has exceeded the processing timeout (default 5 minutes), the manager assumes
+the previous processor crashed and allows a retry.
+
+## Middleware Integration
+
+The `IdempotentEventHandler` wraps Kafka event handlers with automatic duplicate detection:
+
+```python
+--8<-- "backend/app/services/idempotency/middleware.py:39:73"
+```
+
+For bulk registration, the `IdempotentConsumerWrapper` wraps all handlers in a dispatcher:
+
+```python
+--8<-- "backend/app/services/idempotency/middleware.py:122:141"
+```
+
+## Redis Storage
+
+Idempotency records are stored in Redis with automatic TTL expiration. The `SET NX EX` command provides atomic
+reservation—if two processes race to claim the same key, only one succeeds:
+
+```python
+--8<-- "backend/app/services/idempotency/redis_repository.py:93:100"
+```
+
+## Configuration
+
+| Parameter | Default | Description |
+|------------------------------|---------------|--------------------------------------|
+| `key_prefix` | `idempotency` | Redis key namespace |
+| `default_ttl_seconds` | `3600` | How long completed keys are retained |
+| `processing_timeout_seconds` | `300` | When to assume a processor crashed |
+| `enable_result_caching` | `true` | Store handler results for duplicates |
+| `max_result_size_bytes` | `1048576` | Maximum cached result size (1MB) |
+
+```python
+--8<-- "backend/app/services/idempotency/idempotency_manager.py:27:34"
+```
+
+## Result Caching
+
+When `enable_result_caching` is true, the manager stores the handler's result JSON alongside the completion status.
+Subsequent duplicates can return the cached result without re-executing the handler. This is useful for idempotent
+queries where the response should be consistent.
+
+Results exceeding `max_result_size_bytes` are silently dropped from the cache but the idempotency protection still
+applies.
+
+## Metrics
+
+The idempotency system exposes several metrics for monitoring:
+
+- `idempotency_cache_hits` - Key lookups that found an existing record
+- `idempotency_cache_misses` - Key lookups that created new records
+- `idempotency_duplicates_blocked` - Events rejected as duplicates
+- `idempotency_keys_active` - Current number of active keys (updated periodically)
+
+## Key Files
+
+| File | Purpose |
+|--------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------|
+| [`services/idempotency/idempotency_manager.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/idempotency/idempotency_manager.py) | Core idempotency logic |
+| [`services/idempotency/redis_repository.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/idempotency/redis_repository.py) | Redis storage adapter |
+| [`services/idempotency/middleware.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/idempotency/middleware.py) | Handler wrappers and consumer integration |
+| [`domain/idempotency/`](https://github.com/HardMax71/Integr8sCode/tree/main/backend/app/domain/idempotency) | Domain models |
diff --git a/docs/architecture/middleware.md b/docs/architecture/middleware.md
new file mode 100644
index 00000000..e739ff27
--- /dev/null
+++ b/docs/architecture/middleware.md
@@ -0,0 +1,111 @@
+# Middleware
+
+The backend uses a stack of ASGI middleware to handle cross-cutting concerns like rate limiting, request size
+validation, caching, and metrics collection. Middleware runs in order from outermost to innermost, with response
+processing in reverse order.
+
+## Middleware Stack
+
+The middleware is applied in this order (outermost first):
+
+1. **RequestSizeLimitMiddleware** - Rejects oversized requests
+2. **RateLimitMiddleware** - Enforces per-user/per-endpoint limits
+3. **CacheControlMiddleware** - Adds cache headers to responses
+4. **MetricsMiddleware** - Collects HTTP request metrics
+
+## Request Size Limit
+
+Rejects requests exceeding a configurable size limit (default 10MB). This protects against denial-of-service attacks
+from large payloads.
+
+```python
+--8<-- "backend/app/core/middlewares/request_size_limit.py:5:10"
+```
+
+Requests exceeding the limit receive a 413 response:
+
+```json
+{"detail": "Request too large. Maximum size is 10.0MB"}
+```
+
+The middleware checks the `Content-Length` header before reading the body, avoiding wasted processing on oversized
+requests.
+
+## Rate Limit
+
+The `RateLimitMiddleware` intercepts all HTTP requests and checks them against configured rate limits.
+See [Rate Limiting](rate-limiting.md) for the full algorithm details.
+
+Excluded paths bypass rate limiting:
+
+```python
+--8<-- "backend/app/core/middlewares/rate_limit.py:26:38"
+```
+
+When a request is allowed, rate limit headers are added to the response. When rejected, a 429 response is returned with
+`Retry-After` indicating when to retry.
+
+## Cache Control
+
+Adds appropriate `Cache-Control` headers to GET responses based on endpoint patterns:
+
+```python
+--8<-- "backend/app/core/middlewares/cache.py:7:16"
+```
+
+| Endpoint | Policy | TTL |
+|-----------------------------|-------------------|------------|
+| `/api/v1/k8s-limits` | public | 5 minutes |
+| `/api/v1/example-scripts` | public | 10 minutes |
+| `/api/v1/auth/verify-token` | private, no-cache | - |
+| `/api/v1/notifications` | private, no-cache | - |
+
+Public endpoints also get a `Vary: Accept-Encoding` header for proper proxy caching. Cache headers are only added to
+successful (200) responses.
+
+## Metrics
+
+The `MetricsMiddleware` collects HTTP request telemetry using OpenTelemetry:
+
+```python
+--8<-- "backend/app/core/middlewares/metrics.py:27:45"
+```
+
+The middleware tracks:
+
+- **Request count** by method, path template, and status code
+- **Request duration** histogram
+- **Request/response size** histograms
+- **Active requests** gauge
+
+Path templates use pattern replacement to reduce metric cardinality:
+
+```python
+--8<-- "backend/app/core/middlewares/metrics.py:104:118"
+```
+
+UUIDs, numeric IDs, and MongoDB ObjectIds are replaced with `{id}` to prevent metric explosion.
+
+## System Metrics
+
+In addition to HTTP metrics, the middleware module provides system-level observables:
+
+```python
+--8<-- "backend/app/core/middlewares/metrics.py:169:188"
+```
+
+These expose:
+
+- `system_memory_bytes` - System memory (used, available, percent)
+- `system_cpu_percent` - System CPU utilization
+- `process_metrics` - Process RSS, VMS, CPU, thread count
+
+## Key Files
+
+| File | Purpose |
+|----------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
+| [`core/middlewares/__init__.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/core/middlewares/__init__.py) | Middleware exports |
+| [`core/middlewares/rate_limit.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/core/middlewares/rate_limit.py) | Rate limiting |
+| [`core/middlewares/cache.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/core/middlewares/cache.py) | Cache headers |
+| [`core/middlewares/request_size_limit.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/core/middlewares/request_size_limit.py) | Request size validation |
+| [`core/middlewares/metrics.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/core/middlewares/metrics.py) | HTTP and system metrics |
diff --git a/docs/architecture/model-conversion.md b/docs/architecture/model-conversion.md
index 83719588..11fd380e 100644
--- a/docs/architecture/model-conversion.md
+++ b/docs/architecture/model-conversion.md
@@ -1,37 +1,47 @@
-# Model Conversion Patterns
+# Model conversion patterns
-This document describes the patterns for converting between domain models, Pydantic schemas, and ODM documents.
+This document describes patterns for converting between domain models, Pydantic schemas, and ODM documents.
-## Core Principles
+## Why these patterns
-1. **Domain models are dataclasses** - pure Python, no framework dependencies
-2. **Pydantic models are for boundaries** - API schemas, ODM documents, Kafka events
-3. **No custom converter methods** - no `to_dict()`, `from_dict()`, `from_response()`, etc.
-4. **Conversion at boundaries** - happens in repositories and services, not in models
+The codebase separates concerns into layers: domain models are pure Python dataclasses with no framework dependencies,
+while Pydantic models handle API schemas, database documents, and Kafka events. Conversion happens at
+boundaries—repositories and services—not inside models.
-## Model Layers
+## Model layers
+```mermaid
+graph TB
+ subgraph "API Layer"
+ API["Pydantic Schemas
app/schemas_pydantic/"]
+ end
+
+ subgraph "Service Layer"
+ SVC["Services
app/services/"]
+ end
+
+ subgraph "Domain Layer"
+ DOM["Dataclasses
app/domain/"]
+ end
+
+ subgraph "Infrastructure Layer"
+ INF["Pydantic/ODM
app/db/docs/
app/infrastructure/kafka/events/"]
+ end
+
+ API <--> SVC
+ SVC <--> DOM
+ SVC <--> INF
```
-┌─────────────────────────────────────────────────────────────┐
-│ API Layer (Pydantic schemas) │
-│ app/schemas_pydantic/ │
-├─────────────────────────────────────────────────────────────┤
-│ Service Layer │
-│ app/services/ │
-├─────────────────────────────────────────────────────────────┤
-│ Domain Layer (dataclasses) │
-│ app/domain/ │
-├─────────────────────────────────────────────────────────────┤
-│ Infrastructure Layer (Pydantic/ODM) │
-│ app/db/docs/, app/infrastructure/kafka/events/ │
-└─────────────────────────────────────────────────────────────┘
-```
-## Conversion Patterns
+API routes receive and return Pydantic schemas. Services orchestrate business logic using domain dataclasses.
+Repositories translate between domain objects and infrastructure (MongoDB documents, Kafka events). Each layer speaks
+its own language; conversion bridges them.
+
+## Conversion patterns
-### Dataclass to Dict
+### Dataclass to dict
-Use `asdict()` with dict comprehension for enum conversion and None filtering:
+Use `asdict()` with a dict comprehension for enum conversion and optional None filtering:
```python
from dataclasses import asdict
@@ -42,27 +52,15 @@ update_dict = {
for k, v in asdict(domain_obj).items()
if v is not None
}
-
-# Without None filtering (keep all values)
-data = {
- k: (v.value if hasattr(v, "value") else v)
- for k, v in asdict(domain_obj).items()
-}
```
-### Pydantic to Dict
+### Pydantic to dict
Use `model_dump()` directly:
```python
-# Exclude None values
-data = pydantic_obj.model_dump(exclude_none=True)
-
-# Include all values
-data = pydantic_obj.model_dump()
-
-# JSON-compatible (datetimes as ISO strings)
-data = pydantic_obj.model_dump(mode="json")
+data = pydantic_obj.model_dump(exclude_none=True) # Skip None values
+data = pydantic_obj.model_dump(mode="json") # JSON-compatible output
```
### Dict to Pydantic
@@ -70,10 +68,7 @@ data = pydantic_obj.model_dump(mode="json")
Use `model_validate()` or constructor unpacking:
```python
-# From dict
obj = SomeModel.model_validate(data)
-
-# With unpacking
obj = SomeModel(**data)
```
@@ -84,18 +79,15 @@ Use `model_validate()` when models have `from_attributes=True`:
```python
class User(BaseModel):
model_config = ConfigDict(from_attributes=True)
- ...
-# Convert between compatible Pydantic models
user = User.model_validate(user_response)
```
-### Dict to Dataclass
+### Dict to dataclass
-Use constructor unpacking:
+Use constructor unpacking, handling nested objects explicitly:
```python
-# Direct unpacking
domain_obj = DomainModel(**data)
# With nested conversion
@@ -107,14 +99,15 @@ domain_obj = DomainModel(
)
```
-## Examples
+## Repository examples
+
+Repositories are the primary conversion boundary. They translate between domain objects and database documents.
-### Repository: Saving Domain to Document
+### Saving domain to document
```python
async def store_event(self, event: Event) -> str:
data = asdict(event)
- # Convert nested dataclass with enum handling
data["metadata"] = {
k: (v.value if hasattr(v, "value") else v)
for k, v in asdict(event.metadata).items()
@@ -123,7 +116,7 @@ async def store_event(self, event: Event) -> str:
await doc.insert()
```
-### Repository: Loading Document to Domain
+### Loading document to domain
```python
async def get_event(self, event_id: str) -> Event | None:
@@ -138,7 +131,7 @@ async def get_event(self, event_id: str) -> Event | None:
)
```
-### Repository: Updating with Typed Input
+### Updating with typed input
```python
async def update_session(self, session_id: str, updates: SessionUpdate) -> bool:
@@ -156,60 +149,26 @@ async def update_session(self, session_id: str, updates: SessionUpdate) -> bool:
return True
```
-### Service: Converting Between Pydantic Models
-
-```python
-# In API route
-user = User.model_validate(current_user)
-
-# In service converting Kafka metadata to domain
-domain_metadata = DomainEventMetadata(**avro_metadata.model_dump())
-```
-
-## Anti-Patterns
+## Anti-patterns
-### Don't: Custom Converter Methods
+Avoid approaches that scatter conversion logic or couple layers incorrectly.
-```python
-# BAD - adds unnecessary abstraction
-class MyModel:
- def to_dict(self) -> dict:
- return {...}
-
- @classmethod
- def from_dict(cls, data: dict) -> "MyModel":
- return cls(...)
-```
+| Anti-pattern | Why it's bad |
+|----------------------------------|-----------------------------------------------------------|
+| Manual field-by-field conversion | Verbose, error-prone, breaks when fields change |
+| Pydantic in domain layer | Couples domain to framework; domain should be pure Python |
+| Conversion logic in models | Scatters boundary logic; keep it in repositories/services |
-### Don't: Pydantic in Domain Layer
-
-```python
-# BAD - domain should be framework-agnostic
-from pydantic import BaseModel
-
-class DomainEntity(BaseModel): # Wrong!
- ...
-```
-
-### Don't: Manual Field-by-Field Conversion
-
-```python
-# BAD - verbose and error-prone
-def from_response(cls, resp):
- return cls(
- field1=resp.field1,
- field2=resp.field2,
- field3=resp.field3,
- ...
- )
-```
+Thin wrappers that delegate to `model_dump()` with specific options are fine. For example, `BaseEvent.to_dict()` applies
+`by_alias=True, mode="json"` consistently across all events. Methods with additional behavior like filtering private
+keys (`to_public_dict()`) are also acceptable—the anti-pattern is manually listing fields.
-## Summary
+## Quick reference
-| From | To | Method |
-|------|-----|--------|
-| Dataclass | Dict | `{k: (v.value if hasattr(v, "value") else v) for k, v in asdict(obj).items()}` |
-| Pydantic | Dict | `obj.model_dump()` |
-| Dict | Pydantic | `Model.model_validate(data)` or `Model(**data)` |
-| Pydantic | Pydantic | `TargetModel.model_validate(source)` |
-| Dict | Dataclass | `DataclassModel(**data)` |
+| From | To | Method |
+|-----------|-----------|--------------------------------------------------------------------------------|
+| Dataclass | Dict | `{k: (v.value if hasattr(v, "value") else v) for k, v in asdict(obj).items()}` |
+| Pydantic | Dict | `obj.model_dump()` |
+| Dict | Pydantic | `Model.model_validate(data)` or `Model(**data)` |
+| Pydantic | Pydantic | `TargetModel.model_validate(source)` |
+| Dict | Dataclass | `DataclassModel(**data)` |
diff --git a/docs/architecture/rate-limiting.md b/docs/architecture/rate-limiting.md
new file mode 100644
index 00000000..9d16fbbe
--- /dev/null
+++ b/docs/architecture/rate-limiting.md
@@ -0,0 +1,151 @@
+# Rate Limiting
+
+The platform uses Redis-backed rate limiting with per-user and per-endpoint controls. Two algorithms are
+available—sliding window for precise time-based limits and token bucket for bursty workloads. Authenticated users are
+tracked by user ID; anonymous requests fall back to IP-based limiting.
+
+## Architecture
+
+```mermaid
+flowchart TB
+ subgraph Request Flow
+ REQ[Incoming Request] --> MW[RateLimitMiddleware]
+ MW --> AUTH{Authenticated?}
+ AUTH -->|Yes| UID[User ID]
+ AUTH -->|No| IP[IP Address]
+ UID --> CHECK[Check Rate Limit]
+ IP --> CHECK
+ end
+
+ subgraph Rate Limit Service
+ CHECK --> CONFIG[Load Config from Redis]
+ CONFIG --> MATCH[Match Endpoint Rule]
+ MATCH --> ALGO{Algorithm}
+ ALGO -->|Sliding Window| SW[ZSET Counter]
+ ALGO -->|Token Bucket| TB[Token State]
+ SW --> RESULT[RateLimitStatus]
+ TB --> RESULT
+ end
+
+ subgraph Response
+ RESULT --> ALLOWED{Allowed?}
+ ALLOWED -->|Yes| HEADERS[Add Rate Limit Headers]
+ ALLOWED -->|No| REJECT[429 Too Many Requests]
+ HEADERS --> APP[Application]
+ end
+```
+
+## Algorithms
+
+The rate limiter supports two algorithms, selectable per rule.
+
+**Sliding window** tracks requests in a Redis sorted set, with timestamps as scores. Each request adds an entry; stale
+entries outside the window are pruned. This provides precise limiting but uses more memory for high-traffic endpoints.
+
+**Token bucket** maintains a bucket of tokens that refill at a constant rate. Each request consumes one token. When
+empty, requests are rejected until tokens refill. The `burst_multiplier` controls how many extra tokens can accumulate
+beyond the base limit, allowing controlled bursts.
+
+```python
+--8<-- "backend/app/domain/rate_limit/rate_limit_models.py:11:14"
+```
+
+## Default Rules
+
+The platform ships with default rate limits organized by endpoint group. Higher priority rules match first:
+
+| Pattern | Group | Limit | Window | Priority |
+|----------------------|-----------|---------|--------|----------|
+| `^/api/v1/execute` | execution | 10 req | 60s | 10 |
+| `^/api/v1/auth/.*` | auth | 20 req | 60s | 7 |
+| `^/api/v1/admin/.*` | admin | 100 req | 60s | 5 |
+| `^/api/v1/events/.*` | sse | 5 req | 60s | 3 |
+| `^/api/v1/ws` | websocket | 5 req | 60s | 3 |
+| `^/api/v1/.*` | api | 60 req | 60s | 1 |
+
+Execution endpoints have the strictest limits since they spawn Kubernetes pods. The catch-all API rule (priority 1)
+applies to any endpoint not matching a more specific pattern.
+
+## Middleware Integration
+
+The `RateLimitMiddleware` intercepts all HTTP requests, extracts the user identifier, and checks against the configured
+limits:
+
+```python
+--8<-- "backend/app/core/middlewares/rate_limit.py:15:38"
+```
+
+For authenticated requests, the middleware uses the user ID from the request state. Anonymous requests are identified by
+client IP address:
+
+```python
+--8<-- "backend/app/core/middlewares/rate_limit.py:97:101"
+```
+
+## Response Headers
+
+Every response includes rate limit headers so clients can implement backoff logic:
+
+| Header | Description |
+|-------------------------|------------------------------------------------------|
+| `X-RateLimit-Limit` | Maximum requests allowed in the window |
+| `X-RateLimit-Remaining` | Requests remaining in current window |
+| `X-RateLimit-Reset` | Unix timestamp when the window resets |
+| `Retry-After` | Seconds to wait before retrying (429 responses only) |
+
+When a request is rejected, the middleware returns a 429 response with these headers plus a JSON body:
+
+```json
+{
+ "detail": "Rate limit exceeded",
+ "retry_after": 45,
+ "reset_at": "2024-01-15T10:30:00+00:00"
+}
+```
+
+## Per-User Overrides
+
+Administrators can customize limits for specific users through the admin API. User overrides support:
+
+- **Bypass**: Completely disable rate limiting for the user
+- **Global multiplier**: Scale all limits up or down (e.g., 2.0 doubles the limit)
+- **Custom rules**: Add user-specific rules that take priority over defaults
+
+```python
+--8<-- "backend/app/domain/rate_limit/rate_limit_models.py:42:51"
+```
+
+## Redis Storage
+
+Rate limit state is stored in Redis with automatic TTL expiration. The sliding window algorithm uses sorted sets:
+
+```python
+--8<-- "backend/app/services/rate_limit_service.py:315:331"
+```
+
+Token bucket state is stored as JSON with the current token count and last refill time:
+
+```python
+--8<-- "backend/app/services/rate_limit_service.py:366:378"
+```
+
+Configuration is cached in Redis for 5 minutes to reduce database load while allowing dynamic updates.
+
+## Configuration
+
+Rate limiting is controlled by environment variables:
+
+| Variable | Default | Description |
+|---------------------------|--------------|---------------------------------------|
+| `RATE_LIMIT_ENABLED` | `true` | Enable/disable rate limiting globally |
+| `RATE_LIMIT_REDIS_PREFIX` | `ratelimit:` | Redis key prefix for isolation |
+
+The system gracefully degrades when Redis is unavailable—requests are allowed through rather than failing closed.
+
+## Key Files
+
+| File | Purpose |
+|------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------|
+| [`services/rate_limit_service.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/rate_limit_service.py) | Rate limit algorithms and Redis operations |
+| [`core/middlewares/rate_limit.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/core/middlewares/rate_limit.py) | ASGI middleware for request interception |
+| [`domain/rate_limit/`](https://github.com/HardMax71/Integr8sCode/tree/main/backend/app/domain/rate_limit) | Domain models and default configuration |
diff --git a/docs/architecture/runtime-registry.md b/docs/architecture/runtime-registry.md
new file mode 100644
index 00000000..ba66b62b
--- /dev/null
+++ b/docs/architecture/runtime-registry.md
@@ -0,0 +1,99 @@
+# Runtime Registry
+
+The runtime registry defines how each programming language is executed inside Kubernetes pods. It maps language/version
+pairs to Docker images, file extensions, and interpreter commands. Adding a new language or version is a matter of
+extending the specification dictionary.
+
+## Supported Languages
+
+| Language | Versions | Image Template | File Extension |
+|----------|---------------------------------|---------------------------|----------------|
+| Python | 3.7, 3.8, 3.9, 3.10, 3.11, 3.12 | `python:{version}-slim` | `.py` |
+| Node.js | 18, 20, 22 | `node:{version}-alpine` | `.js` |
+| Ruby | 3.1, 3.2, 3.3 | `ruby:{version}-alpine` | `.rb` |
+| Go | 1.20, 1.21, 1.22 | `golang:{version}-alpine` | `.go` |
+| Bash | 5.1, 5.2, 5.3 | `bash:{version}` | `.sh` |
+
+## Language Specification
+
+Each language is defined by a `LanguageSpec` dictionary containing the available versions, Docker image template, file
+extension, and interpreter command:
+
+```python
+--8<-- "backend/app/runtime_registry.py:12:17"
+```
+
+The full specification for all languages:
+
+```python
+--8<-- "backend/app/runtime_registry.py:19:50"
+```
+
+## Runtime Configuration
+
+The registry generates a `RuntimeConfig` for each language/version combination. This contains everything needed to run a
+script in a pod:
+
+```python
+--8<-- "backend/app/runtime_registry.py:6:10"
+```
+
+- **image**: The full Docker image reference (e.g., `python:3.11-slim`)
+- **file_name**: The script filename mounted at `/scripts/` (e.g., `main.py`)
+- **command**: The command to execute (e.g., `["python", "/scripts/main.py"]`)
+
+## Adding a New Language
+
+To add support for a new programming language:
+
+1. Add an entry to `LANGUAGE_SPECS` with versions, image template, file extension, and interpreter
+2. Add an example script to `EXAMPLE_SCRIPTS` demonstrating version-specific features
+3. The `_make_runtime_configs()` function automatically generates runtime configs
+
+For example, to add Rust support:
+
+```python
+"rust": {
+ "versions": ["1.75", "1.76", "1.77"],
+ "image_tpl": "rust:{version}-slim",
+ "file_ext": "rs",
+ "interpreter": ["rustc", "-o", "/tmp/main", "{file}", "&&", "/tmp/main"],
+}
+```
+
+The image template uses `{version}` as a placeholder, which gets replaced with each version number when generating the
+registry.
+
+## Example Scripts
+
+Each language includes an example script that demonstrates both universal features and version-specific syntax. These
+scripts are shown in the frontend editor as templates:
+
+```python
+--8<-- "backend/app/runtime_registry.py:52:78"
+```
+
+The example scripts intentionally use features that may not work on older versions, helping users understand version
+compatibility. For instance, Python's match statement (3.10+), Node's `Promise.withResolvers()` (22+), and Go's
+`clear()` function (1.21+).
+
+## API Endpoint
+
+The `/api/v1/languages` endpoint returns the available runtimes:
+
+```json
+{
+ "python": {"versions": ["3.7", "3.8", "3.9", "3.10", "3.11", "3.12"], "file_ext": "py"},
+ "node": {"versions": ["18", "20", "22"], "file_ext": "js"},
+ "ruby": {"versions": ["3.1", "3.2", "3.3"], "file_ext": "rb"},
+ "go": {"versions": ["1.20", "1.21", "1.22"], "file_ext": "go"},
+ "bash": {"versions": ["5.1", "5.2", "5.3"], "file_ext": "sh"}
+}
+```
+
+## Key Files
+
+| File | Purpose |
+|----------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|
+| [`runtime_registry.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/runtime_registry.py) | Language specifications and runtime config generation |
+| [`api/routes/languages.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/api/routes/languages.py) | API endpoint for available languages |
diff --git a/docs/architecture/user-settings-events.md b/docs/architecture/user-settings-events.md
index 3b319f25..53ff8dbb 100644
--- a/docs/architecture/user-settings-events.md
+++ b/docs/architecture/user-settings-events.md
@@ -1,147 +1,75 @@
# User settings events
-This document explains how user settings are stored, updated, and reconstructed using event sourcing with a unified event type and TypeAdapter-based merging.
+This document explains how user settings are stored, updated, and reconstructed using event sourcing with a unified
+event type and TypeAdapter-based merging.
## Unified event approach
-All user settings changes emit a single `USER_SETTINGS_UPDATED` event type. There are no specialized events for theme, notifications, or editor settings. This eliminates branching in both publishing and consuming code.
+All user settings changes emit a single `USER_SETTINGS_UPDATED` event type. There are no specialized events for theme,
+notifications, or editor settings. This eliminates branching in both publishing and consuming code.
```python
-class UserSettingsUpdatedEvent(BaseEvent):
- event_type: Literal[EventType.USER_SETTINGS_UPDATED] = EventType.USER_SETTINGS_UPDATED
- topic: ClassVar[KafkaTopic] = KafkaTopic.USER_SETTINGS_EVENTS
- user_id: str
- changed_fields: list[str]
- changes: dict[str, str | int | bool | list | dict | None]
- reason: str | None = None
+--8<-- "backend/app/infrastructure/kafka/events/user.py:72:86"
```
-The `changed_fields` list identifies which settings changed. The `changes` dict contains the new values in JSON-serializable form (enums as strings, nested objects as dicts).
-
-## Event payload structure
-
-When updating settings, the service publishes:
-
-```python
-payload = {
- "user_id": "user_123",
- "changed_fields": ["theme", "notifications"],
- "changes": {
- "theme": "dark",
- "notifications": {
- "execution_completed": True,
- "channels": ["email", "in_app"]
- }
- },
- "reason": "User updated preferences"
-}
-```
-
-No old values are tracked. If needed, previous state can be reconstructed by replaying events up to a specific timestamp.
+The `changed_fields` list identifies which settings changed. Typed fields (`theme`, `notifications`, `editor`, etc.)
+contain the new values in Avro-compatible form.
## TypeAdapter pattern
The service uses Pydantic's `TypeAdapter` for dict-based operations without reflection or branching:
```python
-from pydantic import TypeAdapter
-
-_settings_adapter = TypeAdapter(DomainUserSettings)
-_update_adapter = TypeAdapter(DomainUserSettingsUpdate)
+--8<-- "backend/app/services/user_settings_service.py:22:24"
```
### Updating settings
-```python
-async def update_user_settings(self, user_id: str, updates: DomainUserSettingsUpdate) -> DomainUserSettings:
- current = await self.get_user_settings(user_id)
-
- # Get only fields that were explicitly set
- changes = _update_adapter.dump_python(updates, exclude_none=True)
- if not changes:
- return current
-
- # Merge via dict unpacking
- current_dict = _settings_adapter.dump_python(current)
- merged = {**current_dict, **changes}
- merged["version"] = (current.version or 0) + 1
- merged["updated_at"] = datetime.now(timezone.utc)
-
- # Reconstruct with nested dataclass conversion
- new_settings = _settings_adapter.validate_python(merged)
-
- # Publish with JSON-serializable payload
- changes_json = _update_adapter.dump_python(updates, exclude_none=True, mode="json")
- await self._publish_settings_event(user_id, changes_json, reason)
+The `update_user_settings` method merges changes into current settings, publishes an event, and manages snapshots:
- return new_settings
+```python
+--8<-- "backend/app/services/user_settings_service.py:88:120"
```
### Applying events
-```python
-def _apply_event(self, settings: DomainUserSettings, event: DomainSettingsEvent) -> DomainUserSettings:
- changes = event.payload.get("changes", {})
- if not changes:
- return settings
-
- current_dict = _settings_adapter.dump_python(settings)
- merged = {**current_dict, **changes}
- merged["updated_at"] = event.timestamp
+When reconstructing settings from events, `_apply_event` merges each event's changes:
- return _settings_adapter.validate_python(merged)
+```python
+--8<-- "backend/app/services/user_settings_service.py:243:254"
```
-The `validate_python` call handles nested dict-to-dataclass conversion, enum parsing, and type coercion automatically. See [Pydantic Dataclasses](pydantic-dataclasses.md) for details.
+The `validate_python` call handles nested dict-to-dataclass conversion, enum parsing, and type coercion automatically.
+See [Pydantic Dataclasses](pydantic-dataclasses.md) for details.
## Settings reconstruction
-User settings are rebuilt from a snapshot plus events:
-
-```
-┌─────────────────────────────────────────────────────────────┐
-│ get_user_settings(user_id) │
-├─────────────────────────────────────────────────────────────┤
-│ 1. Check cache → return if hit │
-│ 2. Load snapshot from DB (if exists) │
-│ 3. Query events since snapshot.updated_at │
-│ 4. Apply each event via _apply_event() │
-│ 5. Cache result, return │
-└─────────────────────────────────────────────────────────────┘
-```
-
-Snapshots are created automatically when event count exceeds threshold:
-
-```python
-if (await self.repository.count_events_since_snapshot(user_id)) >= 10:
- await self.repository.create_snapshot(new_settings)
+```mermaid
+graph TB
+ A[get_user_settings] --> B{Cache hit?}
+ B -->|Yes| C[Return cached]
+ B -->|No| D[Load snapshot]
+ D --> E[Query events since snapshot]
+ E --> F[Apply each event]
+ F --> G[Cache result]
+ G --> C
```
-This bounds reconstruction cost while preserving full event history for auditing.
+Snapshots are created automatically when event count exceeds threshold (10 events). This bounds reconstruction cost
+while preserving full event history for auditing.
## Cache layer
Settings are cached with TTL to avoid repeated reconstruction:
```python
-self._cache: TTLCache[str, DomainUserSettings] = TTLCache(
- maxsize=1000,
- ttl=timedelta(minutes=5).total_seconds(),
-)
+--8<-- "backend/app/services/user_settings_service.py:34:40"
```
Cache invalidation happens via event bus subscription:
```python
-async def initialize(self, event_bus_manager: EventBusManager) -> None:
- bus = await event_bus_manager.get_event_bus()
-
- async def _handle(evt: EventBusEvent) -> None:
- uid = evt.payload.get("user_id")
- if uid:
- await self.invalidate_cache(str(uid))
-
- await bus.subscribe("user.settings.updated*", _handle)
+--8<-- "backend/app/services/user_settings_service.py:58:68"
```
After each update, the service publishes to the event bus, triggering cache invalidation across instances.
@@ -151,29 +79,14 @@ After each update, the service publishes to the event bus, triggering cache inva
The `get_settings_history` method returns a list of changes extracted from events:
```python
-async def get_settings_history(self, user_id: str, limit: int = 50) -> List[DomainSettingsHistoryEntry]:
- events = await self._get_settings_events(user_id, limit=limit)
- history = []
- for event in events:
- changed_fields = event.payload.get("changed_fields", [])
- changes = event.payload.get("changes", {})
- for field in changed_fields:
- history.append(
- DomainSettingsHistoryEntry(
- timestamp=event.timestamp,
- event_type=event.event_type,
- field=f"/{field}",
- new_value=changes.get(field),
- reason=event.payload.get("reason"),
- )
- )
- return history
+--8<-- "backend/app/services/user_settings_service.py:171:189"
```
## Key files
-- `services/user_settings_service.py` — settings service with caching and event sourcing
-- `domain/user/settings_models.py` — `DomainUserSettings`, `DomainUserSettingsUpdate` dataclasses
-- `infrastructure/kafka/events/user.py` — `UserSettingsUpdatedEvent` definition
-- `db/repositories/user_settings_repository.py` — snapshot and event queries
-- `domain/enums/events.py` — `EventType.USER_SETTINGS_UPDATED`
+| File | Purpose |
+|--------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------|
+| [`services/user_settings_service.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/user_settings_service.py) | Settings service with caching and event sourcing |
+| [`domain/user/settings_models.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/domain/user/settings_models.py) | `DomainUserSettings`, `DomainUserSettingsUpdate` dataclasses |
+| [`infrastructure/kafka/events/user.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/infrastructure/kafka/events/user.py) | `UserSettingsUpdatedEvent` definition |
+| [`db/repositories/user_settings_repository.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/db/repositories/user_settings_repository.py) | Snapshot and event queries |
diff --git a/docs/components/saved-scripts.md b/docs/components/saved-scripts.md
new file mode 100644
index 00000000..80b2c785
--- /dev/null
+++ b/docs/components/saved-scripts.md
@@ -0,0 +1,53 @@
+# Saved Scripts
+
+Users can save scripts to their account for later reuse. Scripts are stored in MongoDB and associated with the user who created them. Each script includes the code, language, version, and optional description.
+
+## Data Model
+
+Each saved script contains:
+
+```python
+--8<-- "backend/app/schemas_pydantic/saved_script.py:45:55"
+```
+
+Scripts are scoped to individual users—a user can only access their own saved scripts.
+
+## API Endpoints
+
+
+
+## Service Layer
+
+The `SavedScriptService` handles business logic with comprehensive logging:
+
+```python
+--8<-- "backend/app/services/saved_script_service.py:17:39"
+```
+
+All operations log the user ID, script ID, and relevant metadata for auditing.
+
+## Storage
+
+Scripts are stored in the `saved_scripts` MongoDB collection with a compound index on `(user_id, script_id)` for efficient per-user queries.
+
+The repository enforces user isolation—queries always filter by `user_id` to prevent cross-user access.
+
+## Frontend Integration
+
+The frontend displays saved scripts in a dropdown, allowing users to:
+
+1. Select a saved script to load into the editor
+2. Save the current editor content as a new script
+3. Update an existing script with new content
+4. Delete scripts they no longer need
+
+When loading a script, the frontend sets both the code content and the language/version selectors to match the saved values.
+
+## Key Files
+
+| File | Purpose |
+|------|---------|
+| [`services/saved_script_service.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/saved_script_service.py) | Business logic |
+| [`api/routes/saved_scripts.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/api/routes/saved_scripts.py) | API endpoints |
+| [`schemas_pydantic/saved_script.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/schemas_pydantic/saved_script.py) | Request/response models |
+| [`db/repositories/saved_script_repository.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/db/repositories/saved_script_repository.py) | MongoDB operations |
diff --git a/docs/index.md b/docs/index.md
index 67ad0a58..721d932f 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -3,82 +3,53 @@
[GitHub :material-github:](https://github.com/HardMax71/Integr8sCode){ .md-button }
[Live Demo :material-play:](https://app.integr8scode.cc/){ .md-button .md-button--primary }
-## Quick start
+Run Python scripts in isolated Kubernetes pods with real-time output streaming, resource limits, and full audit trails.
-### Deployment
+## Quick start
```bash
-# 1. Clone the repo
git clone https://github.com/HardMax71/Integr8sCode.git
cd Integr8sCode
-
-# 2. Start containers
-docker-compose up --build
-```
-
-Access points:
-
-| Service | URL |
-|---------|-----|
-| Frontend | `https://127.0.0.1:5001/` |
-| Backend API | `https://127.0.0.1:443/` |
-| Grafana | `http://127.0.0.1:3000` (admin/admin123) |
-
-### Verify installation
-
-Check if the backend is running:
-
-```bash
-curl -k https://127.0.0.1/api/v1/k8s-limits
+./deploy.sh dev
```
-Example response:
+| Service | URL | Credentials |
+|-------------|-----------------------------|-------------------|
+| Frontend | `https://localhost:5001` | user / user123 |
+| Backend API | `https://localhost:443` | — |
+| Grafana | `http://localhost:3000` | admin / admin123 |
-```json
-{
- "cpu_limit": "1000m",
- "memory_limit": "128Mi",
- "timeout_seconds": 5
-}
-```
-
-### Enable Kubernetes metrics
-
-If CPU and Memory metrics show as `null`, enable the metrics server:
+Verify the backend is running:
```bash
-kubectl create -f https://raw.githubusercontent.com/pythianarora/total-practice/master/sample-kubernetes-code/metrics-server.yaml
+curl -k https://localhost/api/v1/health/live
```
-Verify with:
+!!! note "Kubernetes metrics"
+ If CPU/memory metrics show as `null`, enable the metrics server:
+ `kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml`
-```bash
-kubectl top node
-```
+## Core features
-Example output:
+Every script runs in its own Kubernetes pod with complete isolation. Resource limits are configurable per execution
+(defaults: 1000m CPU, 128Mi memory, 300s timeout).
-```
-NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
-docker-desktop 267m 3% 4732Mi 29%
-```
+The platform supports multiple languages and versions:
-## Core features
+| Language | Versions |
+|----------|--------------------------------|
+| Python | 3.7, 3.8, 3.9, 3.10, 3.11, 3.12|
+| Node.js | 18, 20, 22 |
+| Ruby | 3.1, 3.2, 3.3 |
+| Go | 1.20, 1.21, 1.22 |
+| Bash | 5.1, 5.2, 5.3 |
-- Every script runs in its own Kubernetes pod, fully isolated from others
-- Resource limits keep things in check: 1000m CPU, 128Mi memory, configurable timeouts
-- Multiple Python versions supported (3.9, 3.10, 3.11, 3.12)
-- Real-time updates via Server-Sent Events so you see what's happening as it happens
-- Full audit trail through Kafka event streams
-- Failed events get retried automatically via dead letter queue
+Execution output streams in real-time via Server-Sent Events. All events flow through Kafka for full audit trails, with
+automatic retries via dead letter queue for failed processing.
## Architecture
-The platform has three main parts:
-
-- A Svelte frontend that users interact with
-- FastAPI backend handling the heavy lifting, backed by MongoDB, Kafka, and Redis
-- Kubernetes cluster where each script runs in its own pod with Cilium network policies
+Svelte frontend → FastAPI backend (MongoDB, Kafka, Redis) → Kubernetes pods with Cilium network policies.
```mermaid
flowchart TB
@@ -99,10 +70,12 @@ For detailed architecture diagrams, see the [Architecture](architecture/overview
## Security
-- Pods can't make external network calls (Cilium deny-all policy)
-- Everything runs as non-root with dropped capabilities
-- Read-only root filesystem on containers
-- No service account access to Kubernetes API
+| Control | Implementation |
+|--------------------------|---------------------------------------------|
+| Network isolation | Cilium deny-all egress policy |
+| Non-root execution | Dropped capabilities, no privilege escalation|
+| Filesystem | Read-only root filesystem |
+| Kubernetes API | No service account token mounted |
## Documentation
@@ -134,9 +107,9 @@ For detailed architecture diagrams, see the [Architecture](architecture/overview
-## Sample test
+## Sample code
-Verify your installation by running this Python 3.10+ code:
+Try this Python 3.10+ example in the editor:
```python
from typing import TypeGuard
diff --git a/docs/operations/admin-api.md b/docs/operations/admin-api.md
new file mode 100644
index 00000000..e36cc87b
--- /dev/null
+++ b/docs/operations/admin-api.md
@@ -0,0 +1,41 @@
+# Admin API
+
+The admin API provides endpoints for user management, system settings, and event browsing. All endpoints require
+authentication with an admin-role user. The API is organized into three routers: users, settings, and events.
+
+## Authentication
+
+All admin endpoints require the `admin_user` dependency, which validates that the current user has the `admin` role.
+Requests from non-admin users receive a 403 Forbidden response.
+
+See [Authentication](../architecture/authentication.md) for details on JWT tokens, CSRF protection, and login flow.
+
+## User Management
+
+The `/api/v1/admin/users` router provides full CRUD operations for user accounts, including listing with
+pagination/filtering, creating users, updating profiles, resetting passwords, and managing per-user rate limits.
+
+
+
+## System Settings
+
+The `/api/v1/admin/settings` router manages global system configuration including execution limits, security settings,
+and monitoring parameters.
+
+
+
+## Event Management
+
+The `/api/v1/admin/events` router provides event browsing, export, and replay capabilities. Events can be filtered by
+type, time range, user, or correlation ID. Export supports CSV and JSON formats.
+
+
+
+## Key Files
+
+| File | Purpose |
+|--------------------------------------------------------------------------------------------------------------------------------|---------------------------|
+| [`api/routes/admin/users.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/api/routes/admin/users.py) | User management endpoints |
+| [`api/routes/admin/settings.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/api/routes/admin/settings.py) | System settings endpoints |
+| [`api/routes/admin/events.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/api/routes/admin/events.py) | Event browsing and replay |
+| [`services/admin/`](https://github.com/HardMax71/Integr8sCode/tree/main/backend/app/services/admin) | Admin service layer |
diff --git a/docs/operations/deployment.md b/docs/operations/deployment.md
index 82890546..e326e618 100644
--- a/docs/operations/deployment.md
+++ b/docs/operations/deployment.md
@@ -4,21 +4,35 @@ Integr8sCode supports two deployment modes: local development using Docker Compo
Kubernetes using Helm. Both modes share the same container images and configuration patterns, so what works locally
translates directly to production.
+## Architecture
+
+```mermaid
+flowchart TB
+ Script[deploy.sh] --> Local & Prod
+
+ subgraph Images["Container Images"]
+ Base[Dockerfile.base] --> Backend[Backend Image]
+ Base --> Workers[Worker Images]
+ end
+
+ subgraph Local["Local Development"]
+ DC[docker-compose.yaml] --> Containers
+ end
+
+ subgraph Prod["Kubernetes"]
+ Helm[Helm Chart] --> Pods
+ end
+
+ Images --> Containers
+ Images --> Pods
+```
+
## Deployment script
-The unified `deploy.sh` script in the repository root handles both modes. Running it without arguments shows available
-commands.
+The unified [`deploy.sh`](https://github.com/HardMax71/Integr8sCode/blob/main/deploy.sh) script handles both modes:
```bash
-./deploy.sh dev # Start local development stack
-./deploy.sh dev --build # Rebuild images and start
-./deploy.sh down # Stop local stack
-./deploy.sh check # Run quality checks (ruff, mypy, bandit)
-./deploy.sh test # Run full test suite with coverage
-./deploy.sh prod # Deploy to Kubernetes with Helm
-./deploy.sh prod --dry-run # Validate Helm templates without applying
-./deploy.sh status # Show running services
-./deploy.sh logs backend # Tail logs for a specific service
+--8<-- "deploy.sh:6:18"
```
The script abstracts away the differences between environments. For local development it orchestrates Docker Compose,
@@ -59,43 +73,42 @@ trigger Uvicorn to restart automatically. The frontend runs its own dev server w
### Docker build strategy
-The backend uses a multi-stage build with a shared base image to keep startup fast. All Python dependencies are
-installed at build time, so containers start in seconds rather than waiting for package downloads.
+The backend uses a multi-stage build with a shared base image to keep startup fast:
-```
-Dockerfile.base Dockerfile (backend) Dockerfile.* (workers)
-┌──────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
-│ python:3.12-slim │ │ FROM base │ │ FROM base │
-│ system deps │────▶│ COPY app, workers │ │ COPY app, workers │
-│ uv sync --locked │ │ entrypoint.sh │ │ CMD ["python", ...] │
-│ ENV PATH=.venv │ └──────────────────────┘ └──────────────────────┘
-└──────────────────┘
+```mermaid
+flowchart LR
+ subgraph Base["Dockerfile.base"]
+ B1[python:3.12-slim]
+ B2[system deps]
+ B3[uv sync --locked]
+ end
+
+ subgraph Services["Service Images"]
+ S1[Backend]
+ S2[Workers]
+ end
+
+ Base --> S1 & S2
```
-The base image (`Dockerfile.base`) installs all production dependencies using `uv sync --locked --no-dev`. The
-`--locked` flag ensures the lockfile is respected exactly, and `--no-dev` skips development tools like ruff and mypy
-that aren't needed at runtime. The key optimization is setting `PATH="/app/.venv/bin:$PATH"` so Python and all
-installed packages are available directly without needing `uv run` at startup.
+The base image installs all production dependencies:
-Each service image extends the base and copies only application code. Since dependencies rarely change compared to
-code, Docker's layer caching means most builds only rebuild the thin application layer. First builds take longer
-because they install all packages, but subsequent builds are fast.
+```dockerfile
+--8<-- "backend/Dockerfile.base"
+```
+
+Each service image extends the base and copies only application code. Since dependencies rarely change, Docker's layer
+caching means most builds only rebuild the thin application layer.
-For local development, the compose file mounts source directories into the container:
+For local development, the compose file mounts source directories:
```yaml
-volumes:
- - ./backend/app:/app/app
- - ./backend/workers:/app/workers
- - ./backend/scripts:/app/scripts
+--8<-- "docker-compose.yaml:95:99"
```
-This selective mounting preserves the container's `.venv` directory (with all installed packages) while allowing live
-code changes. The mounted directories overlay the baked-in copies, so edits take effect immediately. Gunicorn watches
-for file changes and reloads workers automatically.
-
-The design means `git clone` followed by `docker compose up` just works. No local Python environment needed, no named
-volumes for caching, no waiting for package downloads. Dependencies live in the image, code comes from the mount.
+This preserves the container's `.venv` while allowing live code changes. Gunicorn watches for file changes and reloads
+automatically. The design means `git clone` followed by `docker compose up` just works—no local Python environment
+needed.
To stop everything and clean up volumes:
@@ -114,7 +127,8 @@ The `test` command runs the full integration and unit test suite:
This builds images, starts services, waits for the backend health endpoint using curl's built-in retry mechanism, runs
pytest with coverage reporting, then tears down the stack. The curl retry approach is cleaner than shell loops and
-avoids issues with Docker Compose's `--wait` flag (which fails on init containers that exit after completion). Key services define healthchecks in `docker-compose.yaml`:
+avoids issues with Docker Compose's `--wait` flag (which fails on init containers that exit after completion). Key
+services define healthchecks in `docker-compose.yaml`:
| Service | Healthcheck |
|-----------------|-----------------------------------------------|
@@ -141,32 +155,22 @@ registry and update the image references in your values file.
### Chart structure
-The Helm chart organizes templates by function.
+The Helm chart organizes templates by function:
-```
-helm/integr8scode/
-├── Chart.yaml # Chart metadata and dependencies
-├── values.yaml # Default configuration
-├── values-prod.yaml # Production overrides
-├── templates/
-│ ├── _helpers.tpl # Template functions
-│ ├── NOTES.txt # Post-install message
-│ ├── namespace.yaml
-│ ├── rbac/ # ServiceAccount, Role, RoleBinding
-│ ├── secrets/ # Kubeconfig and Kafka JAAS
-│ ├── configmaps/ # Environment variables
-│ ├── infrastructure/ # Zookeeper, Kafka, Schema Registry, Jaeger
-│ ├── app/ # Backend and Frontend deployments
-│ ├── workers/ # All seven worker deployments
-│ └── jobs/ # Kafka topic init and user seed
-└── charts/ # Downloaded sub-charts (Redis, MongoDB)
-```
+| Directory | Contents |
+|-----------------------------|-------------------------------------------|
+| `templates/rbac/` | ServiceAccount, Role, RoleBinding |
+| `templates/secrets/` | Kubeconfig and Kafka JAAS |
+| `templates/configmaps/` | Environment variables |
+| `templates/infrastructure/` | Zookeeper, Kafka, Schema Registry, Jaeger |
+| `templates/app/` | Backend and Frontend deployments |
+| `templates/workers/` | All seven worker deployments |
+| `templates/jobs/` | Kafka topic init and user seed |
+| `charts/` | Bitnami sub-charts (Redis, MongoDB) |
-The chart uses Bitnami sub-charts for Redis and MongoDB since they handle persistence, health checks, and configuration
-well. Kafka uses custom templates instead of the Bitnami chart because Confluent images require a specific workaround
-for Kubernetes environment variables. Kubernetes automatically creates environment variables like
-`KAFKA_PORT=tcp://10.0.0.1:29092` for services, which conflicts with Confluent's expectation of a numeric port. The
-templates include an `unset KAFKA_PORT` command in the container startup to avoid this collision.
+The chart uses Bitnami sub-charts for Redis and MongoDB. Kafka uses custom templates because Confluent images require
+unsetting Kubernetes auto-generated environment variables like `KAFKA_PORT=tcp://...` that conflict with expected
+numeric values.
### Running a deployment
@@ -204,8 +208,11 @@ the cluster.
### Configuration
-The `values.yaml` file contains all configurable options with comments explaining each setting. Key sections include
-global settings, image references, resource limits, and infrastructure configuration.
+The `values.yaml` file contains all configurable options. Key sections:
+
+```yaml
+--8<-- "helm/integr8scode/values.yaml:10:19"
+```
Environment variables shared across services live in the `env` section and get rendered into a ConfigMap.
Service-specific overrides go in their respective sections. For example, to increase backend replicas and memory:
@@ -306,68 +313,29 @@ your storage class's reclaim policy.
## Troubleshooting
-A few issues come up regularly during deployment.
-
-### Kafka topic errors
+| Issue | Cause | Solution |
+|-----------------------|-----------------------------------|---------------------------------------------------|
+| Unknown topic errors | kafka-init failed or wrong prefix | Check `kubectl logs job/integr8scode-kafka-init` |
+| Confluent port errors | K8s auto-generated `KAFKA_PORT` | Ensure `unset KAFKA_PORT` in container startup |
+| ImagePullBackOff | Images not in cluster | Use ghcr.io images or import with K3s |
+| MongoDB auth errors | Password mismatch | Verify secret matches `values-prod.yaml` |
+| OOMKilled workers | Resource limits too low | Increase `workers.common.resources.limits.memory` |
-If workers log errors about unknown topics, the kafka-init job may have failed or topics were created without the
-expected prefix. Check the job logs and verify topics exist with the correct names.
+### Kafka topic debugging
```bash
kubectl logs -n integr8scode job/integr8scode-kafka-init
kubectl exec -n integr8scode integr8scode-kafka-0 -- kafka-topics --list --bootstrap-server localhost:29092
```
-Topics should be prefixed (e.g., `prefexecution_events` not `execution_events`). If they're missing the prefix, the
-`KAFKA_TOPIC_PREFIX` setting wasn't applied during topic creation.
+Topics should be prefixed (e.g., `prefexecution_events` not `execution_events`).
-### Port conflicts with Confluent images
-
-Confluent containers may fail to start with errors about invalid port formats. This happens when Kubernetes environment
-variables like `KAFKA_PORT=tcp://...` override the expected numeric values. The chart templates include
-`unset KAFKA_PORT` and similar commands, but if you're customizing the deployment, ensure these remain in place.
-
-### Image pull failures
-
-If pods stay in ImagePullBackOff, the images aren't available to the cluster. For K3s, the deploy script imports images
-automatically. For other distributions, use the pre-built images from GitHub Container Registry (see below) or push to
-your own registry and update `values.yaml` with the correct repository and tag.
-
-```yaml
-images:
- backend:
- repository: your-registry.com/integr8scode-backend
- tag: v1.0.0
-global:
- imagePullPolicy: Always
-```
-
-### MongoDB authentication
-
-When `mongodb.auth.enabled` is true (the default in values-prod.yaml), all connections must authenticate. The chart
-constructs the MongoDB URL with credentials from the values file. If you're seeing authentication errors, verify the
-password is set correctly and matches what MongoDB was initialized with.
+### MongoDB password verification
```bash
kubectl get secret -n integr8scode integr8scode-mongodb -o jsonpath='{.data.mongodb-root-password}' | base64 -d
```
-### Resource constraints
-
-Workers may get OOMKilled or throttled if resource limits are too low for your workload. The default values are
-conservative to work on small clusters. For production, increase limits based on observed usage.
-
-```yaml
-workers:
- common:
- resources:
- limits:
- memory: "1Gi"
- cpu: "1000m"
-```
-
-Monitor resource usage with kubectl top or your cluster's metrics solution to right-size the limits.
-
## Pre-built images
For production deployments, you can skip the local build step entirely by using pre-built images from GitHub Container
@@ -419,5 +387,15 @@ The `--local` flag forces a local build even when using `values-prod.yaml`.
The GitHub Actions workflow in `.github/workflows/docker.yml` handles image building and publishing. On every push to
main, it builds the base, backend, and frontend images, scans them with Trivy for vulnerabilities, and pushes to
-ghcr.io.
-Pull requests build and scan but don't push, ensuring only tested code reaches the registry.
+ghcr.io. Pull requests build and scan but don't push, ensuring only tested code reaches the registry.
+
+## Key files
+
+| File | Purpose |
+|--------------------------------------------------------------------------------------------------------------------------------|-----------------------------|
+| [`deploy.sh`](https://github.com/HardMax71/Integr8sCode/blob/main/deploy.sh) | Unified deployment script |
+| [`docker-compose.yaml`](https://github.com/HardMax71/Integr8sCode/blob/main/docker-compose.yaml) | Local development stack |
+| [`backend/Dockerfile.base`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/Dockerfile.base) | Shared base image with deps |
+| [`helm/integr8scode/values.yaml`](https://github.com/HardMax71/Integr8sCode/blob/main/helm/integr8scode/values.yaml) | Default Helm configuration |
+| [`helm/integr8scode/values-prod.yaml`](https://github.com/HardMax71/Integr8sCode/blob/main/helm/integr8scode/values-prod.yaml) | Production overrides |
+| [`.github/workflows/docker.yml`](https://github.com/HardMax71/Integr8sCode/blob/main/.github/workflows/docker.yml) | CI/CD image build pipeline |
diff --git a/docs/operations/grafana-integration.md b/docs/operations/grafana-integration.md
new file mode 100644
index 00000000..93c0637f
--- /dev/null
+++ b/docs/operations/grafana-integration.md
@@ -0,0 +1,120 @@
+# Grafana Integration
+
+The platform accepts Grafana alert webhooks and converts them into in-app notifications. This allows operators to
+receive Grafana alerts directly in the application UI without leaving the platform.
+
+## Webhook Endpoint
+
+Configure Grafana to send webhooks to `POST /api/v1/alerts/grafana`. A test endpoint is available to verify connectivity.
+
+
+
+## Webhook Payload
+
+The endpoint expects Grafana's standard webhook format:
+
+```python
+--8<-- "backend/app/schemas_pydantic/grafana.py:8:22"
+```
+
+Example payload:
+
+```json
+{
+ "status": "firing",
+ "receiver": "integr8scode",
+ "alerts": [
+ {
+ "status": "firing",
+ "labels": {
+ "alertname": "HighMemoryUsage",
+ "severity": "warning",
+ "instance": "backend:8000"
+ },
+ "annotations": {
+ "summary": "Memory usage above 80%",
+ "description": "Backend instance memory usage is 85%"
+ }
+ }
+ ],
+ "commonLabels": {
+ "env": "production"
+ }
+}
+```
+
+## Severity Mapping
+
+Grafana severity labels are mapped to notification severity levels:
+
+```python
+--8<-- "backend/app/services/grafana_alert_processor.py:14:19"
+```
+
+| Grafana Severity | Notification Severity |
+|------------------|-----------------------|
+| `critical` | HIGH |
+| `error` | HIGH |
+| `warning` | MEDIUM |
+| `info` | LOW |
+
+Resolved alerts (status `ok` or `resolved`) are always mapped to LOW severity regardless of the original severity label.
+
+## Processing Flow
+
+The `GrafanaAlertProcessor` processes each alert in the webhook:
+
+1. Extract severity from alert labels or common labels
+2. Map severity to notification level
+3. Extract title from `alertname` label or `title` annotation
+4. Build message from `summary` and `description` annotations
+5. Create system notification with metadata
+
+```python
+--8<-- "backend/app/services/grafana_alert_processor.py:73:102"
+```
+
+## Notification Content
+
+The processor builds notification content as follows:
+
+- **Title**: `labels.alertname` or `annotations.title` or "Grafana Alert"
+- **Message**: `annotations.summary` and `annotations.description` joined by newlines
+- **Tags**: `["external_alert", "grafana", "entity:external_alert"]`
+- **Metadata**: Alert labels, common labels, and status
+
+## Response Format
+
+The endpoint returns processing status:
+
+```json
+{
+ "message": "Webhook received and processed",
+ "alerts_received": 3,
+ "alerts_processed": 3,
+ "errors": []
+}
+```
+
+If any alerts fail to process, the error messages are included in the `errors` array but the endpoint still returns 200
+for successfully processed alerts.
+
+## Grafana Configuration
+
+To configure Grafana to send alerts:
+
+1. Navigate to **Alerting > Contact points**
+2. Create a new contact point with type **Webhook**
+3. Set URL to `https://your-domain/api/v1/alerts/grafana`
+4. For authenticated environments, configure appropriate headers
+
+The webhook URL should be accessible from your Grafana instance. If using network policies, ensure Grafana can reach the
+backend service.
+
+## Key Files
+
+| File | Purpose |
+|----------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
+| [`services/grafana_alert_processor.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/grafana_alert_processor.py) | Alert processing logic |
+| [`api/routes/grafana_alerts.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/api/routes/grafana_alerts.py) | Webhook endpoint |
+| [`schemas_pydantic/grafana.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/schemas_pydantic/grafana.py) | Request/response models |
diff --git a/docs/operations/logging.md b/docs/operations/logging.md
index b1c652d4..64470f3a 100644
--- a/docs/operations/logging.md
+++ b/docs/operations/logging.md
@@ -1,12 +1,38 @@
# Logging
-This backend uses structured JSON logging with automatic correlation IDs, trace context injection, and sensitive data sanitization. The goal is logs that are both secure against injection attacks and easy to query in aggregation systems like Elasticsearch or Loki.
+This backend uses structured JSON logging with automatic correlation IDs, trace context injection, and sensitive data
+sanitization. The goal is logs that are both secure against injection attacks and easy to query in aggregation systems
+like Elasticsearch or Loki.
+
+## Architecture
+
+```mermaid
+flowchart LR
+ Code[Application Code] --> Logger
+ Logger --> CF[CorrelationFilter]
+ CF --> TF[TracingFilter]
+ TF --> JF[JSONFormatter]
+ JF --> Output[JSON stdout]
+```
## How it's wired
-The logger is created once during application startup via dependency injection. The `setup_logger` function in `app/core/logging.py` configures a JSON formatter and attaches filters for correlation IDs and OpenTelemetry trace context. Every log line comes out as a JSON object with timestamp, level, logger name, message, and whatever structured fields you added. Workers and background services use the same setup, so log format is consistent across the entire system.
+The logger is created once during application startup via dependency injection. The
+[`setup_logger`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/core/logging.py) function configures a
+JSON formatter and attaches filters for correlation IDs and OpenTelemetry trace context:
+
+```python
+--8<-- "backend/app/core/logging.py:110:147"
+```
-The JSON formatter does two things beyond basic formatting. First, it injects context that would be tedious to pass manually - the correlation ID from the current request, the trace and span IDs from OpenTelemetry, and request metadata like method and path. Second, it sanitizes sensitive data by pattern-matching things like API keys, JWT tokens, and database URLs, replacing them with redaction placeholders. This sanitization applies to both the log message and exception tracebacks.
+The JSON formatter does two things beyond basic formatting. First, it injects context that would be tedious to pass
+manually—the correlation ID from the current request, the trace and span IDs from OpenTelemetry, and request metadata
+like method and path. Second, it sanitizes sensitive data by pattern-matching things like API keys, JWT tokens, and
+database URLs:
+
+```python
+--8<-- "backend/app/core/logging.py:35:59"
+```
## Structured logging
@@ -42,16 +68,43 @@ logger.warning(f"Processing event {event_id}")
The fix is to keep user data out of the message string entirely. When you put it in `extra`, the JSON formatter escapes special characters, and the malicious content becomes a harmless string value rather than a log line injection.
-The codebase treats these as user-controlled and keeps them in `extra`: path parameters like execution_id or saga_id, query parameters, request body fields, Kafka message content, database results derived from user input, and exception messages (which often contain user data).
+The codebase treats these as user-controlled and keeps them in `extra`: path parameters like execution_id or saga_id,
+query parameters, request body fields, Kafka message content, database results derived from user input, and exception
+messages (which often contain user data).
## What gets logged
-Correlation and trace IDs are injected automatically by filters. The correlation ID follows a request through all services - it's set from incoming headers or generated for new requests. The trace and span IDs come from OpenTelemetry and link logs to distributed traces in Jaeger or Tempo. You don't need to pass these explicitly; they appear in every log line from code running in that request context.
+Correlation and trace IDs are injected automatically by filters:
+
+| Field | Source | Purpose |
+|------------------|---------------------------------|-----------------------------------|
+| `correlation_id` | Request header or generated | Track request across services |
+| `trace_id` | OpenTelemetry | Link to distributed traces |
+| `span_id` | OpenTelemetry | Link to specific span |
+| `request_method` | HTTP request | GET, POST, etc. |
+| `request_path` | HTTP request | API endpoint path |
+| `client_host` | HTTP request | Client IP address |
-For domain-specific context, developers add fields to `extra` based on what operation they're logging. An execution service method might include `execution_id`, `user_id`, `language`, and `status`. A replay session logs `session_id`, `replayed_events`, `failed_events`, and `duration_seconds`. A saga operation includes `saga_id` and `user_id`. The pattern is consistent: the message says what happened, `extra` says to what and by whom.
+For domain-specific context, developers add fields to `extra` based on what operation they're logging. The pattern is
+consistent: the message says what happened, `extra` says to what and by whom.
## Practical use
-When something goes wrong, start by filtering logs by correlation_id to see everything that happened during that request. If you need to correlate with traces, use the trace_id to jump to Jaeger. If you're investigating a specific execution or saga, filter by those IDs - they're in the structured fields, not buried in message text.
+When something goes wrong, start by filtering logs by `correlation_id` to see everything that happened during that
+request. If you need to correlate with traces, use the `trace_id` to jump to Jaeger.
+
+| Log Level | Use case |
+|-----------|-------------------------------------------------------------|
+| DEBUG | Detailed diagnostics (noisy, local debugging only) |
+| INFO | Normal operations (started, completed, processed) |
+| WARNING | Recoverable issues |
+| ERROR | Failures requiring attention |
+
+The log level is controlled by the `LOG_LEVEL` environment variable.
+
+## Key files
-The log level is controlled by the `LOG_LEVEL` environment variable. In production it's typically INFO, which captures normal operations (started, completed, processed) and problems (warnings for recoverable issues, errors for failures). DEBUG adds detailed diagnostic info and is usually too noisy for production but useful when investigating specific issues locally.
+| File | Purpose |
+|------------------------------------------------------------------------------------------------------------|--------------------------------------|
+| [`core/logging.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/core/logging.py) | Logger setup, filters, JSON formatter|
+| [`core/correlation.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/core/correlation.py) | Correlation ID middleware |
diff --git a/docs/operations/metrics-reference.md b/docs/operations/metrics-reference.md
new file mode 100644
index 00000000..ab0f58fe
--- /dev/null
+++ b/docs/operations/metrics-reference.md
@@ -0,0 +1,168 @@
+# Metrics Reference
+
+The platform exports metrics via OpenTelemetry to an OTLP-compatible collector (Jaeger, Prometheus, etc.). Each service
+component has its own metrics class, and all metrics follow a consistent naming pattern: `{domain}.{metric}.{type}`.
+
+## Architecture
+
+Metrics are collected using the OpenTelemetry SDK and exported every 10 seconds to the configured OTLP endpoint:
+
+```python
+--8<-- "backend/app/core/metrics/base.py:13:19"
+```
+
+When `ENABLE_TRACING` is false or no OTLP endpoint is configured, the system uses a no-op meter provider to avoid
+unnecessary overhead.
+
+## Metric Categories
+
+### Execution Metrics
+
+Track script execution performance and resource usage.
+
+| Metric | Type | Labels | Description |
+|-----------------------------|---------------|--------------------------|------------------------------|
+| `script.executions.total` | Counter | status, lang_and_version | Total executions |
+| `script.execution.duration` | Histogram | lang_and_version | Execution time (seconds) |
+| `script.executions.active` | UpDownCounter | - | Currently running executions |
+| `script.memory.usage` | Histogram | lang_and_version | Memory per execution (MiB) |
+| `script.cpu.utilization` | Histogram | - | CPU usage (millicores) |
+| `script.errors.total` | Counter | error_type | Errors by type |
+| `execution.queue.depth` | UpDownCounter | - | Queued executions |
+| `execution.queue.wait_time` | Histogram | lang_and_version | Queue wait time (seconds) |
+
+### Coordinator Metrics
+
+Track scheduling and resource allocation.
+
+| Metric | Type | Labels | Description |
+|------------------------------------------|---------------|---------------------|---------------------------|
+| `coordinator.processing.time` | Histogram | - | Event processing time |
+| `coordinator.scheduling.duration` | Histogram | - | Scheduling time |
+| `coordinator.executions.active` | UpDownCounter | - | Active managed executions |
+| `coordinator.queue.wait_time` | Histogram | priority, queue | Queue wait by priority |
+| `coordinator.executions.scheduled.total` | Counter | status | Scheduled executions |
+| `coordinator.rate_limited.total` | Counter | limit_type, user_id | Rate limited requests |
+| `coordinator.resource.allocations.total` | Counter | resource_type | Resource allocations |
+| `coordinator.resource.utilization` | UpDownCounter | resource_type | Current utilization |
+| `coordinator.scheduling.decisions.total` | Counter | decision, reason | Scheduling decisions |
+
+### Rate Limit Metrics
+
+Track rate-limiting behavior.
+
+| Metric | Type | Labels | Description |
+|----------------------------------|-----------|------------------------------------|---------------------------|
+| `rate_limit.requests.total` | Counter | authenticated, endpoint, algorithm | Total checks |
+| `rate_limit.allowed.total` | Counter | group, priority, multiplier | Allowed requests |
+| `rate_limit.rejected.total` | Counter | group, priority, multiplier | Rejected requests |
+| `rate_limit.bypass.total` | Counter | endpoint | Bypassed checks |
+| `rate_limit.check.duration` | Histogram | endpoint, authenticated | Check duration (ms) |
+| `rate_limit.redis.duration` | Histogram | operation | Redis operation time (ms) |
+| `rate_limit.remaining` | Histogram | - | Remaining requests |
+| `rate_limit.quota.usage` | Histogram | - | Quota usage (%) |
+| `rate_limit.token_bucket.tokens` | Histogram | endpoint | Current tokens |
+
+### Event Metrics
+
+Track Kafka event processing.
+
+| Metric | Type | Labels | Description |
+|------------------------------|---------------|------------------------|-------------------|
+| `events.produced.total` | Counter | event_type, topic | Events published |
+| `events.consumed.total` | Counter | event_type, topic | Events consumed |
+| `events.processing.duration` | Histogram | event_type | Processing time |
+| `events.errors.total` | Counter | event_type, error_type | Processing errors |
+| `events.lag` | UpDownCounter | topic, partition | Consumer lag |
+
+### Database Metrics
+
+Track MongoDB operations.
+
+| Metric | Type | Labels | Description |
+|-------------------------------|---------------|-----------------------|--------------------|
+| `database.operations.total` | Counter | operation, collection | Total operations |
+| `database.operation.duration` | Histogram | operation, collection | Operation time |
+| `database.errors.total` | Counter | operation, error_type | Database errors |
+| `database.connections.active` | UpDownCounter | - | Active connections |
+
+### Connection Metrics
+
+Track SSE and WebSocket connections.
+
+| Metric | Type | Labels | Description |
+|-----------------------------|---------------|--------|--------------------------|
+| `connections.active` | UpDownCounter | type | Active connections |
+| `connections.total` | Counter | type | Total connections opened |
+| `connections.duration` | Histogram | type | Connection duration |
+| `connections.messages.sent` | Counter | type | Messages sent |
+
+### Health Metrics
+
+Track service health.
+
+| Metric | Type | Labels | Description |
+|------------------------------|---------------|-----------------|--------------------------|
+| `health.checks.total` | Counter | service, status | Health check results |
+| `health.check.duration` | Histogram | service | Check duration |
+| `health.dependencies.status` | UpDownCounter | dependency | Dependency status (1=up) |
+
+### Notification Metrics
+
+Track notification delivery.
+
+| Metric | Type | Labels | Description |
+|-----------------------------------|-----------|---------------|----------------------|
+| `notifications.sent.total` | Counter | type, channel | Notifications sent |
+| `notifications.failed.total` | Counter | type, error | Failed notifications |
+| `notifications.delivery.duration` | Histogram | channel | Delivery time |
+
+### Dead Letter Queue Metrics
+
+Track DLQ operations.
+
+| Metric | Type | Labels | Description |
+|----------------------|---------------|---------------|----------------------|
+| `dlq.messages.total` | Counter | topic, reason | Messages sent to DLQ |
+| `dlq.retries.total` | Counter | topic | Retry attempts |
+| `dlq.size` | UpDownCounter | topic | Current DLQ size |
+
+## Configuration
+
+Metrics are configured via environment variables:
+
+| Variable | Default | Description |
+|-------------------------------|------------------------|-------------------------|
+| `ENABLE_TRACING` | `true` | Enable metrics/tracing |
+| `OTEL_EXPORTER_OTLP_ENDPOINT` | - | OTLP collector endpoint |
+| `TRACING_SERVICE_NAME` | `integr8scode-backend` | Service name in traces |
+
+## Prometheus Queries
+
+Example PromQL queries for common dashboards:
+
+```promql
+# Execution success rate (last 5 minutes)
+sum(rate(script_executions_total{status="completed"}[5m])) /
+sum(rate(script_executions_total[5m]))
+
+# P99 execution duration by language
+histogram_quantile(0.99, sum(rate(script_execution_duration_bucket[5m])) by (le, lang_and_version))
+
+# Rate limit rejection rate
+sum(rate(rate_limit_rejected_total[5m])) /
+sum(rate(rate_limit_requests_total[5m]))
+
+# Queue depth trend
+avg_over_time(execution_queue_depth[1h])
+```
+
+## Key Files
+
+| File | Purpose |
+|------------------------------------------------------------------------------------------------------------------------------|--------------------------------------|
+| [`core/metrics/base.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/core/metrics/base.py) | Base metrics class and configuration |
+| [`core/metrics/execution.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/core/metrics/execution.py) | Execution metrics |
+| [`core/metrics/coordinator.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/core/metrics/coordinator.py) | Coordinator metrics |
+| [`core/metrics/rate_limit.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/core/metrics/rate_limit.py) | Rate limit metrics |
+| [`core/metrics/`](https://github.com/HardMax71/Integr8sCode/tree/main/backend/app/core/metrics) | All metrics modules |
diff --git a/docs/operations/nginx-configuration.md b/docs/operations/nginx-configuration.md
index ad606668..6ff6edb1 100644
--- a/docs/operations/nginx-configuration.md
+++ b/docs/operations/nginx-configuration.md
@@ -1,7 +1,7 @@
-# Nginx Configuration
+# Nginx configuration
-The frontend uses Nginx as a reverse proxy and static file server. This document explains the configuration in
-`frontend/nginx.conf.template`.
+The frontend uses Nginx as a reverse proxy and static file server. The configuration lives in
+[`frontend/nginx.conf.template`](https://github.com/HardMax71/Integr8sCode/blob/main/frontend/nginx.conf.template).
## Architecture
@@ -12,22 +12,15 @@ flowchart LR
Nginx -->|"static files"| Static["Static files"]
```
-Nginx serves two purposes:
+Nginx serves two purposes: static file server for the Svelte frontend build, and reverse proxy for API requests to the
+backend.
-1. **Static file server** for the Svelte frontend build
-2. **Reverse proxy** for API requests to the backend
+## Configuration breakdown
-## Configuration Breakdown
-
-### Server Block
+### Server block
```nginx
-server {
- listen 5001;
- server_name _;
- root /usr/share/nginx/html;
- index index.html;
-}
+--8<-- "frontend/nginx.conf.template:1:6"
```
| Directive | Purpose |
@@ -39,31 +32,16 @@ server {
### Compression
```nginx
-gzip on;
-gzip_vary on;
-gzip_min_length 1024;
-gzip_types text/plain text/css text/xml text/javascript
- application/javascript application/xml+rss
- application/json application/x-font-ttf
- font/opentype image/svg+xml image/x-icon;
+--8<-- "frontend/nginx.conf.template:8:13"
```
Gzip compression reduces bandwidth for text-based assets. Binary files (images, fonts) are excluded as they're already
compressed.
-### API Proxy
+### API proxy
```nginx
-location /api/ {
- proxy_pass https://backend:443;
- proxy_ssl_verify off;
- proxy_set_header Host $host;
- proxy_set_header X-Real-IP $remote_addr;
- proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
- proxy_set_header X-Forwarded-Proto $scheme;
- proxy_pass_request_headers on;
- proxy_set_header Cookie $http_cookie;
-}
+--8<-- "frontend/nginx.conf.template:15:25"
```
| Directive | Purpose |
@@ -79,19 +57,7 @@ location /api/ {
SSE endpoints require special handling to prevent buffering:
```nginx
-location ~ ^/api/v1/events/ {
- proxy_pass https://backend:443;
- proxy_ssl_verify off;
-
- # SSE-specific settings
- proxy_set_header Connection '';
- proxy_http_version 1.1;
- proxy_buffering off;
- proxy_cache off;
- proxy_read_timeout 86400s;
- proxy_send_timeout 86400s;
- proxy_set_header X-Accel-Buffering no;
-}
+--8<-- "frontend/nginx.conf.template:27:54"
```
| Directive | Purpose |
@@ -104,61 +70,23 @@ location ~ ^/api/v1/events/ {
Without these settings, SSE events would be buffered and delivered in batches instead of real-time.
-### Static Asset Caching
+### Static asset caching
```nginx
-# Immutable assets (hashed filenames)
-location ~* \.(js|css|png|jpg|jpeg|gif|ico|svg|woff|woff2|ttf|eot)$ {
- expires 1y;
- add_header Cache-Control "public, immutable";
-}
-
-# Build directory
-location /build/ {
- expires 1y;
- add_header Cache-Control "public, max-age=31536000, immutable";
-}
-
-# HTML (never cache)
-location ~* \.html$ {
- expires -1;
- add_header Cache-Control "no-store, no-cache, must-revalidate";
-}
+--8<-- "frontend/nginx.conf.template:57:75"
```
Svelte build outputs hashed filenames (`app.abc123.js`), making them safe to cache indefinitely. HTML files must never
be cached to ensure users get the latest asset references.
-### Security Headers
+### Security headers
```nginx
-location / {
- add_header Content-Security-Policy "...";
- add_header X-Frame-Options "SAMEORIGIN";
- add_header X-Content-Type-Options "nosniff";
- add_header Referrer-Policy "strict-origin-when-cross-origin";
- add_header Permissions-Policy "geolocation=(), microphone=(), camera=()";
- try_files $uri $uri/ /index.html;
-}
+--8<-- "frontend/nginx.conf.template:77:84"
```
#### Content Security Policy
-```nginx
-Content-Security-Policy "
- default-src 'self';
- script-src 'self' 'unsafe-inline';
- style-src 'self' 'unsafe-inline';
- img-src 'self' data: blob:;
- font-src 'self' data:;
- object-src 'none';
- base-uri 'self';
- form-action 'self';
- frame-ancestors 'none';
- connect-src 'self';
-"
-```
-
| Directive | Value | Purpose |
|-------------------|--------------------------|-------------------------------------|
| `default-src` | `'self'` | Fallback for unspecified directives |
@@ -172,7 +100,7 @@ Content-Security-Policy "
The `data:` source is required for the Monaco editor's inline SVG icons.
-#### Other Security Headers
+#### Other security headers
| Header | Value | Purpose |
|--------------------------|-----------------------------------|--------------------------------|
@@ -181,25 +109,18 @@ The `data:` source is required for the Monaco editor's inline SVG icons.
| `Referrer-Policy` | `strict-origin-when-cross-origin` | Limit referrer leakage |
| `Permissions-Policy` | Deny geolocation, mic, camera | Disable unused APIs |
-### SPA Routing
+### SPA routing
-```nginx
-try_files $uri $uri/ /index.html;
-```
-
-This directive enables client-side routing. When a URL like `/editor` is requested directly, Nginx serves `index.html`
-and lets the Svelte router handle the path.
+The `try_files $uri $uri/ /index.html` directive enables client-side routing. When a URL like `/editor` is requested
+directly, Nginx serves `index.html` and lets the Svelte router handle the path.
## Deployment
The nginx configuration uses environment variable substitution via the official nginx Docker image's built-in `envsubst`
-feature. The template file is processed at container startup, allowing the same image to work in different environments.
+feature:
```dockerfile
-# frontend/Dockerfile.prod
-FROM nginx:alpine
-COPY --from=builder /app/public /usr/share/nginx/html
-COPY nginx.conf.template /etc/nginx/templates/default.conf.template
+--8<-- "frontend/Dockerfile.prod:12:21"
```
The nginx image automatically processes files in `/etc/nginx/templates/*.template` and outputs the result to
@@ -207,8 +128,6 @@ The nginx image automatically processes files in `/etc/nginx/templates/*.templat
### Environment variables
-The template uses `${VARIABLE_NAME}` syntax for substitution. Currently, only `BACKEND_URL` is templated:
-
| Variable | Purpose | Example |
|---------------|-----------------------------------|-----------------------|
| `BACKEND_URL` | Backend service URL for API proxy | `https://backend:443` |
@@ -227,23 +146,9 @@ kubectl rollout restart deployment/frontend -n integr8scode
## Troubleshooting
-### SSE connections dropping
-
-Check `proxy_read_timeout`. Default is 60s which will close idle SSE connections.
-
-### CSP blocking resources
-
-Check browser console for CSP violation reports. Add the blocked source to the appropriate directive.
-
-### 502 Bad Gateway
-
-Backend service is unreachable. Verify:
-
-```bash
-kubectl get svc backend -n integr8scode
-kubectl logs -n integr8scode deployment/frontend
-```
-
-### Assets not updating
-
-Clear browser cache or add cache-busting query parameters. Verify HTML files have `no-cache` headers.
+| Issue | Cause | Solution |
+|--------------------------|----------------------------------|-------------------------------------------|
+| SSE connections dropping | Default 60s `proxy_read_timeout` | Verify 86400s timeout is set |
+| CSP blocking resources | Missing source in directive | Check browser console, add blocked source |
+| 502 Bad Gateway | Backend unreachable | `kubectl get svc backend -n integr8scode` |
+| Assets not updating | Browser cache | Clear cache or verify `no-cache` on HTML |
diff --git a/docs/operations/notification-types.md b/docs/operations/notification-types.md
index d1dd638d..60b6ff28 100644
--- a/docs/operations/notification-types.md
+++ b/docs/operations/notification-types.md
@@ -1,41 +1,94 @@
# Notifications
-Notifications are producer-driven with minimal core fields. Types and legacy levels have been removed.
+Notifications are producer-driven with minimal core fields. The notification system supports multiple channels (in-app,
+webhook, Slack) with throttling, retries, and user subscription preferences.
-## Core fields
-
-- `subject`: short title
-- `body`: text content
-- `channel`: in_app | webhook | slack
-- `severity`: low | medium | high | urgent
-- `tags`: list of strings, e.g. `['execution','failed']`, `['external_alert','grafana']`
-- `status`: pending | sending | delivered | failed | skipped | read | clicked
+## Architecture
-## Tag conventions
+```mermaid
+flowchart LR
+ Event[Execution Event] --> NS[NotificationService]
+ NS --> DB[(MongoDB)]
+ NS --> SSE[SSE Bus]
+ NS --> Webhook[Webhook]
+ NS --> Slack[Slack]
+ SSE --> Browser
+```
-Producers should include small, structured tags to enable filtering, UI actions, and correlation (replacing old related fields).
+## Core fields
-**Category tags** indicate what the notification is about: `execution` for code executions, `external_alert` and `grafana` for notifications from Grafana Alerting.
+| Field | Description |
+|------------|---------------------------------------------------------------------------|
+| `subject` | Short title |
+| `body` | Text content |
+| `channel` | `in_app`, `webhook`, or `slack` |
+| `severity` | `low`, `medium`, `high`, or `urgent` |
+| `tags` | List of strings for filtering, e.g. `["execution", "failed"]` |
+| `status` | `pending`, `sending`, `delivered`, `failed`, `skipped`, `read`, `clicked` |
-**Entity tags** specify the type: `entity:execution`, `entity:external_alert`.
+## Tag conventions
-**Reference tags** link to specific resources: `exec:` references a specific execution (used by UI to provide "View result"). For external alerts, include relevant context in `metadata` rather than unstable IDs in tags.
+Producers include structured tags for filtering, UI actions, and correlation.
-**Outcome tags** describe what happened: `completed`, `failed`, `timeout`, `warning`, `error`, `success`.
+| Tag type | Purpose | Examples |
+|-----------|--------------------------------|----------------------------------|
+| Category | What the notification is about | `execution`, `external_alert` |
+| Entity | Entity type | `entity:execution` |
+| Reference | Link to specific resource | `exec:` |
+| Outcome | What happened | `completed`, `failed`, `timeout` |
## Examples
Execution completed:
+
```json
["execution", "completed", "entity:execution", "exec:2c1b...e8"]
```
Execution failed:
+
```json
["execution", "failed", "entity:execution", "exec:2c1b...e8"]
```
Grafana alert:
+
```json
["external_alert", "grafana", "entity:external_alert"]
```
+
+## Throttling
+
+The service throttles notifications per user per severity window:
+
+```python
+--8<-- "backend/app/services/notification_service.py:76:101"
+```
+
+## Channel handlers
+
+Notifications route to handlers based on channel:
+
+```python
+--8<-- "backend/app/services/notification_service.py:156:160"
+```
+
+In-app notifications publish to the SSE bus for realtime delivery. Webhook and Slack channels use HTTP POST with retry
+logic.
+
+## Subscription filtering
+
+Users configure subscriptions per channel with severity and tag filters. The `_should_skip_notification` method checks
+these before delivery:
+
+```python
+--8<-- "backend/app/services/notification_service.py:801:823"
+```
+
+## Key files
+
+| File | Purpose |
+|------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------|
+| [`services/notification_service.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/notification_service.py) | Notification delivery and logic |
+| [`db/docs/notification.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/db/docs/notification.py) | MongoDB document models |
+| [`db/repositories/notification_repository.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/db/repositories/notification_repository.py) | Database operations |
diff --git a/docs/reference/api-reference.md b/docs/reference/api-reference.md
index 593d7874..2a9b1d58 100644
--- a/docs/reference/api-reference.md
+++ b/docs/reference/api-reference.md
@@ -1,3 +1,3 @@
# API reference
-
+
diff --git a/docs/reference/openapi.json b/docs/reference/openapi.json
index 4e543876..0bad05b1 100644
--- a/docs/reference/openapi.json
+++ b/docs/reference/openapi.json
@@ -619,6 +619,9 @@
},
"/api/v1/scripts": {
"get": {
+ "tags": [
+ "scripts"
+ ],
"summary": "List Saved Scripts",
"operationId": "list_saved_scripts_api_v1_scripts_get",
"responses": {
@@ -639,6 +642,9 @@
}
},
"post": {
+ "tags": [
+ "scripts"
+ ],
"summary": "Create Saved Script",
"operationId": "create_saved_script_api_v1_scripts_post",
"requestBody": {
@@ -677,6 +683,9 @@
},
"/api/v1/scripts/{script_id}": {
"get": {
+ "tags": [
+ "scripts"
+ ],
"summary": "Get Saved Script",
"operationId": "get_saved_script_api_v1_scripts__script_id__get",
"parameters": [
@@ -714,6 +723,9 @@
}
},
"put": {
+ "tags": [
+ "scripts"
+ ],
"summary": "Update Saved Script",
"operationId": "update_saved_script_api_v1_scripts__script_id__put",
"parameters": [
@@ -761,6 +773,9 @@
}
},
"delete": {
+ "tags": [
+ "scripts"
+ ],
"summary": "Delete Saved Script",
"operationId": "delete_saved_script_api_v1_scripts__script_id__delete",
"parameters": [
@@ -6136,9 +6151,6 @@
"user_updated",
"user_deleted",
"user_settings_updated",
- "user_theme_changed",
- "user_notification_settings_updated",
- "user_editor_settings_updated",
"notification_created",
"notification_sent",
"notification_delivered",
@@ -7550,6 +7562,20 @@
},
"ReplayFilter": {
"properties": {
+ "event_ids": {
+ "anyOf": [
+ {
+ "items": {
+ "type": "string"
+ },
+ "type": "array"
+ },
+ {
+ "type": "null"
+ }
+ ],
+ "title": "Event Ids"
+ },
"execution_id": {
"anyOf": [
{
@@ -7561,6 +7587,28 @@
],
"title": "Execution Id"
},
+ "correlation_id": {
+ "anyOf": [
+ {
+ "type": "string"
+ },
+ {
+ "type": "null"
+ }
+ ],
+ "title": "Correlation Id"
+ },
+ "aggregate_id": {
+ "anyOf": [
+ {
+ "type": "string"
+ },
+ {
+ "type": "null"
+ }
+ ],
+ "title": "Aggregate Id"
+ },
"event_types": {
"anyOf": [
{
@@ -7575,6 +7623,20 @@
],
"title": "Event Types"
},
+ "exclude_event_types": {
+ "anyOf": [
+ {
+ "items": {
+ "$ref": "#/components/schemas/EventType"
+ },
+ "type": "array"
+ },
+ {
+ "type": "null"
+ }
+ ],
+ "title": "Exclude Event Types"
+ },
"start_time": {
"anyOf": [
{
@@ -7631,20 +7693,6 @@
}
],
"title": "Custom Query"
- },
- "exclude_event_types": {
- "anyOf": [
- {
- "items": {
- "$ref": "#/components/schemas/EventType"
- },
- "type": "array"
- },
- {
- "type": "null"
- }
- ],
- "title": "Exclude Event Types"
}
},
"type": "object",
diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css
index c728ce00..34047eae 100644
--- a/docs/stylesheets/extra.css
+++ b/docs/stylesheets/extra.css
@@ -132,7 +132,12 @@
}
/* Improve table appearance */
+.md-typeset__table {
+ width: 100%;
+}
+
.md-typeset table:not([class]) {
+ display: table;
font-size: 0.85rem;
}
@@ -164,3 +169,8 @@
.md-footer-meta {
display: none;
}
+
+/* Hide swagger-ui filter input box while keeping filtering active */
+.operation-filter-input {
+ display: none !important;
+}
diff --git a/docs/testing/frontend-testing.md b/docs/testing/frontend-testing.md
index 406eb313..43aefcd4 100644
--- a/docs/testing/frontend-testing.md
+++ b/docs/testing/frontend-testing.md
@@ -1,6 +1,9 @@
# Frontend testing
-The frontend uses Vitest for unit and integration tests, with Playwright for end-to-end scenarios. Tests live alongside the source code in `__tests__` directories, following the same structure as the components they verify. The setup uses jsdom for DOM simulation and @testing-library/svelte for component rendering, giving you a realistic browser-like environment without the overhead of spinning up actual browsers for every test run.
+The frontend uses [Vitest](https://vitest.dev/) for unit and integration tests, with
+[Playwright](https://playwright.dev/) for end-to-end scenarios. Tests live alongside the source code in `__tests__`
+directories, following the same structure as the components they verify. The setup uses jsdom for DOM simulation and
+@testing-library/svelte for component rendering.
## Quick start
@@ -88,60 +91,20 @@ E2E tests run in Playwright against the real application. They exercise full use
## Configuration
-Vitest configuration lives in `vitest.config.ts`:
+Vitest configuration lives in [`vitest.config.ts`](https://github.com/HardMax71/Integr8sCode/blob/main/frontend/vitest.config.ts):
```typescript
-export default defineConfig({
- plugins: [svelte({ compilerOptions: { runes: true } }), svelteTesting()],
- test: {
- environment: 'jsdom',
- setupFiles: ['./vitest.setup.ts'],
- include: ['src/**/*.{test,spec}.{js,ts}'],
- globals: true,
- coverage: {
- provider: 'v8',
- include: ['src/**/*.{ts,svelte}'],
- exclude: ['src/lib/api/**', 'src/**/*.test.ts'],
- },
- },
-});
+--8<-- "frontend/vitest.config.ts:5:27"
```
-The setup file (`vitest.setup.ts`) provides browser API mocks that jsdom lacks:
+The setup file [`vitest.setup.ts`](https://github.com/HardMax71/Integr8sCode/blob/main/frontend/vitest.setup.ts)
+provides browser API mocks that jsdom lacks (localStorage, sessionStorage, matchMedia, ResizeObserver,
+IntersectionObserver).
-```typescript
-// localStorage and sessionStorage mocks
-vi.stubGlobal('localStorage', localStorageMock);
-vi.stubGlobal('sessionStorage', sessionStorageMock);
-
-// matchMedia for theme detection
-vi.stubGlobal('matchMedia', vi.fn().mockImplementation(query => ({
- matches: false,
- media: query,
- addEventListener: vi.fn(),
- removeEventListener: vi.fn(),
-})));
-
-// ResizeObserver and IntersectionObserver for layout-dependent components
-vi.stubGlobal('ResizeObserver', vi.fn().mockImplementation(() => ({
- observe: vi.fn(),
- unobserve: vi.fn(),
- disconnect: vi.fn(),
-})));
-```
-
-Playwright configuration in `playwright.config.ts` sets up browser testing:
+Playwright configuration in [`playwright.config.ts`](https://github.com/HardMax71/Integr8sCode/blob/main/frontend/playwright.config.ts):
```typescript
-export default defineConfig({
- testDir: './e2e',
- timeout: 10000,
- use: {
- baseURL: 'https://localhost:5001',
- screenshot: 'only-on-failure',
- trace: 'on',
- },
-});
+--8<-- "frontend/playwright.config.ts:3:25"
```
## Writing component tests
diff --git a/docs/testing/kafka-test-stability.md b/docs/testing/kafka-test-stability.md
index a7effd75..24c89b2f 100644
--- a/docs/testing/kafka-test-stability.md
+++ b/docs/testing/kafka-test-stability.md
@@ -2,51 +2,33 @@
## The problem
-When running tests in parallel (e.g., with `pytest-xdist`), you might encounter sporadic crashes with messages like:
+When running tests in parallel with `pytest-xdist`, you might encounter sporadic crashes:
```text
Fatal Python error: Aborted
```
-The stack trace typically points to `confluent_kafka` operations, often during producer initialization in fixtures or test setup. This isn't a bug in the application code - it's a known race condition in the underlying `librdkafka` C library.
+The stack trace typically points to `confluent_kafka` operations during producer initialization. This isn't a bug in
+the application code—it's a known race condition in the underlying `librdkafka` C library.
## Why it happens
-The `confluent-kafka-python` library is a thin wrapper around `librdkafka`, a high-performance C library. When multiple Python processes or threads try to create Kafka `Producer` instances simultaneously, they can trigger a race condition in `librdkafka`'s internal initialization routines.
-
-This manifests as:
-
-- Random `SIGABRT` signals during test runs
-- Crashes in `rd_kafka_broker_destroy_final` or similar internal functions
-- Flaky CI failures that pass on retry
-
-The issue is particularly common in CI environments where tests run in parallel across multiple workers.
+The `confluent-kafka-python` library wraps `librdkafka`, a high-performance C library. When multiple processes or
+threads create Kafka `Producer` instances simultaneously, they can trigger a race condition in `librdkafka`'s internal
+initialization. This manifests as random `SIGABRT` signals, crashes in `rd_kafka_broker_destroy_final`, or flaky CI
+failures that pass on retry.
## The fix
-The solution is to serialize `Producer` initialization using a global threading lock. This prevents multiple threads from entering `librdkafka`'s initialization code simultaneously.
-
-In `app/events/core/producer.py`:
+Serialize `Producer` initialization using a global threading lock. In
+[`app/events/core/producer.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/events/core/producer.py):
```python
-import threading
-
-# Global lock to serialize Producer initialization (workaround for librdkafka race condition)
-# See: https://github.com/confluentinc/confluent-kafka-python/issues/1797
-_producer_init_lock = threading.Lock()
-
-class UnifiedProducer:
- async def start(self) -> None:
- # ... config setup ...
-
- # Serialize Producer initialization to prevent librdkafka race condition
- with _producer_init_lock:
- self._producer = Producer(producer_config)
-
- # ... rest of startup ...
+--8<-- "backend/app/events/core/producer.py:22:24"
```
-The lock is process-global, so all `UnifiedProducer` instances in the same process will serialize their initialization. This adds negligible overhead in production (producers are typically created once at startup) while eliminating the race condition in tests.
+The lock is process-global, so all `UnifiedProducer` instances serialize their initialization. This adds negligible
+overhead in production (producers are created once at startup) while eliminating the race condition in tests.
## Related issues
diff --git a/docs/testing/load-testing.md b/docs/testing/load-testing.md
index 615c786a..6bc91427 100644
--- a/docs/testing/load-testing.md
+++ b/docs/testing/load-testing.md
@@ -1,27 +1,21 @@
# Load testing
-This suite provides two complementary stress tools. Monkey tests fuzz endpoints with random methods, bodies, and parameters to test durability and correctness under garbage requests (with and without auth/CSRF). User load tests authenticate as normal users and exercise the user-accessible API with realistic flows, avoiding admin routes.
+The load testing suite provides two complementary stress tools. Monkey tests fuzz endpoints with random methods, bodies,
+and parameters to test durability under garbage requests. User load tests authenticate as normal users and exercise the
+API with realistic flows.
## Quick start
-Ensure the backend is running locally at `https://localhost:443` (self-signed cert is fine). Run the CLI with default settings for a ~3 minute test:
+With the backend running at `https://localhost:443`:
```bash
python -m tests.load.cli --mode both --clients 30 --concurrency 10 --duration 180
```
-Available options:
+Options: `--base-url` (target URL), `--mode` (monkey/user/both), `--clients` (virtual clients), `--concurrency`
+(parallel tasks), `--duration` (seconds), `--output` (report directory).
-| Option | Description |
-|--------|-------------|
-| `--base-url` | Target URL (default: `https://localhost:443`) |
-| `--mode` | `monkey`, `user`, or `both` |
-| `--clients` | Number of virtual clients |
-| `--concurrency` | Parallel tasks |
-| `--duration` | Test duration in seconds (default: 180) |
-| `--output` | Output directory (default: `tests/load/out`) |
-
-Reports are saved as JSON under `backend/tests/load/out/report__.json`.
+Reports save as JSON under `backend/tests/load/out/report__.json`.
## Property-based fuzz tests
@@ -54,12 +48,14 @@ Tests included:
## What it tests
-The suite covers authentication (register/login with cookies and CSRF tokens), executions (submit scripts, stream SSE or poll results, fetch events), saved scripts (CRUD operations), user settings (get/update), and notifications (list/mark-all-read).
-
-## Metrics
+The suite covers authentication (register/login with cookies and CSRF tokens), executions (submit scripts, stream SSE
+or poll results, fetch events), saved scripts (CRUD operations), user settings (get/update), and notifications
+(list/mark-all-read).
-Each endpoint reports count, error count, status code distribution, latency percentiles (p50/p90/p99), and bytes received. Global metrics include total requests, total errors, exception types, and runtime.
+Each endpoint reports count, error count, status code distribution, latency percentiles (p50/p90/p99), and bytes
+received. Global metrics include total requests, total errors, exception types, and runtime.
-## Notes
+For write endpoints, the client sends `X-CSRF-Token` from the `csrf_token` cookie (double-submit pattern). SSE streams
+are sampled with a short read window (~8-10s) to avoid lingering connections.
-For write endpoints, the client sends `X-CSRF-Token` from the `csrf_token` cookie (double-submit pattern). SSE streams are sampled with a short read window (~8-10s) to avoid lingering connections.
+See [`tests/load/`](https://github.com/HardMax71/Integr8sCode/tree/main/backend/tests/load) for implementation.
diff --git a/mkdocs.yml b/mkdocs.yml
index 07d57116..c4c622af 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -11,6 +11,8 @@ validation:
theme:
name: material
+ logo: assets/images/logo.png
+ favicon: assets/images/logo.png
palette:
# Light mode
- media: "(prefers-color-scheme: light)"
@@ -79,7 +81,10 @@ markdown_extensions:
- pymdownx.keys
- pymdownx.mark
- pymdownx.smartsymbols
- - pymdownx.snippets
+ - pymdownx.snippets:
+ base_path:
+ - .
+ - docs
- pymdownx.superfences:
custom_fences:
- name: mermaid
@@ -103,6 +108,12 @@ nav:
- Architecture:
- Overview: architecture/overview.md
- Services: architecture/services-overview.md
+ - Authentication: architecture/authentication.md
+ - Rate Limiting: architecture/rate-limiting.md
+ - Idempotency: architecture/idempotency.md
+ - Execution Queue: architecture/execution-queue.md
+ - Runtime Registry: architecture/runtime-registry.md
+ - Middleware: architecture/middleware.md
- Domain Exceptions: architecture/domain-exceptions.md
- Pydantic Dataclasses: architecture/pydantic-dataclasses.md
- Model Conversion: architecture/model-conversion.md
@@ -127,6 +138,7 @@ nav:
- Resource Allocation: components/saga/resource-allocation.md
- Dead Letter Queue: components/dead-letter-queue.md
- Schema Manager: components/schema-manager.md
+ - Saved Scripts: components/saved-scripts.md
- Operations:
- Deployment: operations/deployment.md
@@ -135,9 +147,12 @@ nav:
- Logging: operations/logging.md
- Tracing: operations/tracing.md
- Metrics:
+ - Reference: operations/metrics-reference.md
- Context Variables: operations/metrics-contextvars.md
- CPU Time Measurement: operations/cpu-time-measurement.md
- Notifications: operations/notification-types.md
+ - Grafana Integration: operations/grafana-integration.md
+ - Admin API: operations/admin-api.md
- Network Isolation: security/policies.md