|
1 | 1 | # Design & Decisions
|
2 | 2 |
|
3 |
| -This document aims to help fill in gaps of any decision that are made around the design of the ToolHive Operator. |
| 3 | +This document captures architectural decisions and design patterns for the ToolHive Operator. |
4 | 4 |
|
5 |
| -## CRD Attribute vs `PodTemplateSpec` |
| 5 | +## Operator Design Principles |
| 6 | + |
| 7 | +### CRD Attribute vs `PodTemplateSpec` |
6 | 8 |
|
7 | 9 | When building operators, the decision of when to use a `podTemplateSpec` and when to use a CRD attribute is always disputed. For the ToolHive Operator we have a defined rule of thumb.
|
8 | 10 |
|
9 |
| -### Use Dedicated CRD Attributes For: |
| 11 | +#### Use Dedicated CRD Attributes For: |
10 | 12 | - **Business logic** that affects your operator's behavior
|
11 |
| -- **Validation requirements** (ranges, formats, constraints) |
| 13 | +- **Validation requirements** (ranges, formats, constraints) |
12 | 14 | - **Cross-resource coordination** (affects Services, ConfigMaps, etc.)
|
13 | 15 | - **Operator decision making** (triggers different reconciliation paths)
|
14 | 16 |
|
15 |
| -```yaml |
16 |
| -spec: |
17 |
| - version: "13.4" # Affects operator logic |
18 |
| - replicas: 3 # Affects scaling behavior |
19 |
| - backupSchedule: "0 2 * * *" # Needs validation |
20 |
| -``` |
21 |
| -
|
22 |
| -### Use PodTemplateSpec For: |
| 17 | +#### Use PodTemplateSpec For: |
23 | 18 | - **Infrastructure concerns** (node selection, resources, affinity)
|
24 |
| -- **Sidecar containers** |
| 19 | +- **Sidecar containers** |
25 | 20 | - **Standard Kubernetes pod configuration**
|
26 | 21 | - **Things a cluster admin would typically configure**
|
27 | 22 |
|
28 |
| -```yaml |
29 |
| -spec: |
30 |
| - podTemplate: |
31 |
| - spec: |
32 |
| - nodeSelector: |
33 |
| - disktype: ssd |
34 |
| - containers: |
35 |
| - - name: sidecar |
36 |
| - image: monitoring:latest |
37 |
| -``` |
38 |
| -
|
39 |
| -## Quick Decision Test: |
| 23 | +#### Quick Decision Test: |
40 | 24 | 1. **"Does this affect my operator's reconciliation logic?"** -> Dedicated attribute
|
41 |
| -2. **"Is this standard Kubernetes pod configuration?"** -> PodTemplateSpec |
| 25 | +2. **"Is this standard Kubernetes pod configuration?"** -> PodTemplateSpec |
42 | 26 | 3. **"Do I need to validate this beyond basic Kubernetes validation?"** -> Dedicated attribute
|
43 | 27 |
|
44 |
| -This gives you a clean API for core functionality while maintaining flexibility for infrastructure concerns. |
| 28 | +## MCPRegistry Architecture Decisions |
| 29 | + |
| 30 | +### Status Management Design |
| 31 | + |
| 32 | +**Decision**: Use batched status updates via StatusCollector pattern instead of individual field updates. |
| 33 | + |
| 34 | +**Rationale**: |
| 35 | +- Prevents race conditions between multiple status updates |
| 36 | +- Reduces API server load with fewer update calls |
| 37 | +- Ensures consistent status across reconciliation cycles |
| 38 | +- Handles resource version conflicts gracefully |
| 39 | + |
| 40 | +**Implementation**: StatusCollector interface collects all changes and applies them atomically. |
| 41 | + |
| 42 | +### Sync Operation Design |
| 43 | + |
| 44 | +**Decision**: Separate sync decision logic from sync execution with clear interfaces. |
| 45 | + |
| 46 | +**Rationale**: |
| 47 | +- Testability: Mock sync decisions independently from execution |
| 48 | +- Flexibility: Different sync strategies without changing core logic |
| 49 | +- Maintainability: Clear separation of concerns |
| 50 | + |
| 51 | +**Key Patterns**: |
| 52 | +- Idempotent operations for safe retry |
| 53 | +- Manual vs automatic sync distinction |
| 54 | +- Data preservation on failures |
| 55 | + |
| 56 | +### Storage Architecture |
| 57 | + |
| 58 | +**Decision**: Abstract storage via StorageManager interface with ConfigMap as default implementation. |
| 59 | + |
| 60 | +**Rationale**: |
| 61 | +- Future flexibility: Easy addition of new storage backends (OCI, databases) |
| 62 | +- Testability: Mock storage for unit tests |
| 63 | +- Consistency: Single interface for all storage operations |
| 64 | + |
| 65 | +**Current Implementation**: ConfigMap-based with owner references for automatic cleanup. |
| 66 | + |
| 67 | +### Registry API Service Pattern |
| 68 | + |
| 69 | +**Decision**: Deploy individual API service per MCPRegistry rather than shared service. |
| 70 | + |
| 71 | +**Rationale**: |
| 72 | +- **Isolation**: Each registry has independent lifecycle and scaling |
| 73 | +- **Security**: Per-registry access control possible |
| 74 | +- **Reliability**: Failure of one registry doesn't affect others |
| 75 | +- **Lifecycle Management**: Automatic cleanup via owner references |
| 76 | + |
| 77 | +**Trade-offs**: More resources consumed but better isolation and security. |
| 78 | + |
| 79 | +### Error Handling Strategy |
| 80 | + |
| 81 | +**Decision**: Structured error types with progressive retry backoff. |
| 82 | + |
| 83 | +**Rationale**: |
| 84 | +- Different error types need different handling strategies |
| 85 | +- Progressive backoff prevents thundering herd problems |
| 86 | +- Structured errors enable better observability |
| 87 | + |
| 88 | +**Implementation**: 5m initial retry, exponential backoff with cap, manual sync bypass. |
| 89 | + |
| 90 | +### Performance Design Decisions |
| 91 | + |
| 92 | +#### Resource Optimization |
| 93 | +- **Status Updates**: Batched to reduce API calls (implemented) |
| 94 | +- **Source Fetching**: Planned caching to avoid repeated downloads |
| 95 | +- **API Deployment**: Lazy creation only when needed (implemented) |
| 96 | + |
| 97 | +#### Memory Management |
| 98 | +- **Git Operations**: Shallow clones to minimize disk usage (implemented) |
| 99 | +- **Large Registries**: Stream processing planned for future |
| 100 | +- **Status Objects**: Efficient field-level updates (implemented) |
| 101 | + |
| 102 | +### Security Architecture |
| 103 | + |
| 104 | +#### Permission Model |
| 105 | +Minimal required permissions following principle of least privilege: |
| 106 | +- ConfigMaps: For storage management |
| 107 | +- Services/Deployments: For API service management |
| 108 | +- MCPRegistry: For status updates |
| 109 | + |
| 110 | +#### Network Security |
| 111 | +Optional network policies for registry API access control in security-sensitive environments. |
0 commit comments