Skip to content

Commit dd77f43

Browse files
dmartinolCopilot
andauthored
Docs for MCP registry CRD (#2027)
* docs for MCP registry CRD Signed-off-by: Daniele Martinoli <[email protected]> * Update cmd/thv-operator/REGISTRY.md Co-authored-by: Copilot <[email protected]> * Update cmd/thv-operator/REGISTRY.md Co-authored-by: Copilot <[email protected]> * updated docs from latest changes Signed-off-by: Daniele Martinoli <[email protected]> * integrated comments Signed-off-by: Daniele Martinoli <[email protected]> * rebased Signed-off-by: Daniele Martinoli <[email protected]> --------- Signed-off-by: Daniele Martinoli <[email protected]> Co-authored-by: Copilot <[email protected]>
1 parent 5019227 commit dd77f43

File tree

5 files changed

+723
-30
lines changed

5 files changed

+723
-30
lines changed

README.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,26 @@ ToolHive is available as a GUI desktop app, CLI, and Kubernetes Operator.
5959
</tr>
6060
</table>
6161

62+
## Kubernetes Operator
63+
64+
ToolHive includes a Kubernetes Operator for enterprise and production deployments:
65+
66+
### Features
67+
68+
- **MCPServer CRD**: Deploy and manage MCP servers as Kubernetes resources
69+
- **MCPRegistry CRD** *(Experimental)*: Centralized registry management with automated sync
70+
- **Secure isolation**: Container-based server execution with permission profiles
71+
- **Protocol proxying**: Stdio servers exposed via HTTP/SSE networking protocols
72+
- **Service discovery**: Automatic service creation and DNS integration
73+
74+
### Documentation
75+
76+
- [Operator Guide](cmd/thv-operator/README.md) - Complete operator documentation
77+
- [MCPRegistry Reference](cmd/thv-operator/REGISTRY.md) - Registry management (experimental)
78+
- [CRD API Reference](docs/operator/crd-api.md) - Auto-generated API documentation
79+
- [Deployment Guide](docs/kind/deploying-toolhive-operator.md) - Step-by-step installation
80+
- [Examples](examples/operator/) - Sample configurations
81+
6282
## Quick links
6383

6484
- 📚 [Documentation](https://docs.stacklok.com/toolhive/)

cmd/thv-operator/CLAUDE.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,62 @@ After modifying the CRDs, the following needs to be run:
1313

1414
When committing a change that changes CRDs, it is important to bump the chart version as described in the [CLAUDE.md](../../deploy/charts/operator-crds/CLAUDE.md#bumping-crd-chart) doc for the CRD Helm Chart.
1515

16+
## MCPRegistry CRD (Experimental)
17+
18+
The MCPRegistry CRD enables centralized management of MCP server registries. Requires `operator.features.experimental=true`.
19+
20+
### Key Components
21+
22+
- **CRD**: `api/v1alpha1/mcpregistry_types.go`
23+
- **Controller**: `controllers/mcpregistry_controller.go`
24+
- **Status**: `pkg/mcpregistrystatus/`
25+
- **Sync**: `pkg/sync/`
26+
- **Sources**: `pkg/sources/`
27+
- **API**: `pkg/registryapi/`
28+
29+
### Development Patterns
30+
31+
#### Status Collector Pattern
32+
33+
Always use StatusCollector for batched updates:
34+
35+
```go
36+
// ✅ Good: Collect all changes, apply once
37+
statusCollector := mcpregistrystatus.NewCollector(mcpRegistry)
38+
statusCollector.SetPhase(mcpv1alpha1.MCPRegistryPhaseReady)
39+
statusCollector.Apply(ctx, r.Client)
40+
41+
// ❌ Bad: Multiple individual updates cause conflicts
42+
r.Status().Update(ctx, mcpRegistry)
43+
```
44+
45+
#### Error Handling
46+
47+
Always set status before returning errors:
48+
49+
```go
50+
if err := validateSource(); err != nil {
51+
statusCollector.SetSyncStatus(mcpv1alpha1.SyncPhaseFailed, err.Error(), ...)
52+
return ctrl.Result{RequeueAfter: time.Minute * 5}, err
53+
}
54+
```
55+
56+
#### Source Handler Interface
57+
58+
```go
59+
type SourceHandler interface {
60+
FetchRegistryData(ctx context.Context, source MCPRegistrySource) (*RegistryData, error)
61+
ValidateSource(ctx context.Context, source MCPRegistrySource) error
62+
CalculateHash(ctx context.Context, source MCPRegistrySource) (string, error)
63+
}
64+
```
65+
66+
### Testing Patterns
67+
68+
- **Unit Tests**: Use mocks for external dependencies
69+
- **Integration Tests**: Use envtest framework
70+
- **E2E Tests**: Missing for MCPRegistry (use Chainsaw)
71+
1672
## OpenTelemetry (OTEL) Stack for Testing
1773

1874
When you have been asked to stand up an OTEL stack to test ToolHives integration inside of Kubernetes, you will need to perform the following tasks inside of the cluster that you have been instructed to use.

cmd/thv-operator/DESIGN.md

Lines changed: 94 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,111 @@
11
# Design & Decisions
22

3-
This document aims to help fill in gaps of any decision that are made around the design of the ToolHive Operator.
3+
This document captures architectural decisions and design patterns for the ToolHive Operator.
44

5-
## CRD Attribute vs `PodTemplateSpec`
5+
## Operator Design Principles
6+
7+
### CRD Attribute vs `PodTemplateSpec`
68

79
When building operators, the decision of when to use a `podTemplateSpec` and when to use a CRD attribute is always disputed. For the ToolHive Operator we have a defined rule of thumb.
810

9-
### Use Dedicated CRD Attributes For:
11+
#### Use Dedicated CRD Attributes For:
1012
- **Business logic** that affects your operator's behavior
11-
- **Validation requirements** (ranges, formats, constraints)
13+
- **Validation requirements** (ranges, formats, constraints)
1214
- **Cross-resource coordination** (affects Services, ConfigMaps, etc.)
1315
- **Operator decision making** (triggers different reconciliation paths)
1416

15-
```yaml
16-
spec:
17-
version: "13.4" # Affects operator logic
18-
replicas: 3 # Affects scaling behavior
19-
backupSchedule: "0 2 * * *" # Needs validation
20-
```
21-
22-
### Use PodTemplateSpec For:
17+
#### Use PodTemplateSpec For:
2318
- **Infrastructure concerns** (node selection, resources, affinity)
24-
- **Sidecar containers**
19+
- **Sidecar containers**
2520
- **Standard Kubernetes pod configuration**
2621
- **Things a cluster admin would typically configure**
2722

28-
```yaml
29-
spec:
30-
podTemplate:
31-
spec:
32-
nodeSelector:
33-
disktype: ssd
34-
containers:
35-
- name: sidecar
36-
image: monitoring:latest
37-
```
38-
39-
## Quick Decision Test:
23+
#### Quick Decision Test:
4024
1. **"Does this affect my operator's reconciliation logic?"** -> Dedicated attribute
41-
2. **"Is this standard Kubernetes pod configuration?"** -> PodTemplateSpec
25+
2. **"Is this standard Kubernetes pod configuration?"** -> PodTemplateSpec
4226
3. **"Do I need to validate this beyond basic Kubernetes validation?"** -> Dedicated attribute
4327

44-
This gives you a clean API for core functionality while maintaining flexibility for infrastructure concerns.
28+
## MCPRegistry Architecture Decisions
29+
30+
### Status Management Design
31+
32+
**Decision**: Use batched status updates via StatusCollector pattern instead of individual field updates.
33+
34+
**Rationale**:
35+
- Prevents race conditions between multiple status updates
36+
- Reduces API server load with fewer update calls
37+
- Ensures consistent status across reconciliation cycles
38+
- Handles resource version conflicts gracefully
39+
40+
**Implementation**: StatusCollector interface collects all changes and applies them atomically.
41+
42+
### Sync Operation Design
43+
44+
**Decision**: Separate sync decision logic from sync execution with clear interfaces.
45+
46+
**Rationale**:
47+
- Testability: Mock sync decisions independently from execution
48+
- Flexibility: Different sync strategies without changing core logic
49+
- Maintainability: Clear separation of concerns
50+
51+
**Key Patterns**:
52+
- Idempotent operations for safe retry
53+
- Manual vs automatic sync distinction
54+
- Data preservation on failures
55+
56+
### Storage Architecture
57+
58+
**Decision**: Abstract storage via StorageManager interface with ConfigMap as default implementation.
59+
60+
**Rationale**:
61+
- Future flexibility: Easy addition of new storage backends (OCI, databases)
62+
- Testability: Mock storage for unit tests
63+
- Consistency: Single interface for all storage operations
64+
65+
**Current Implementation**: ConfigMap-based with owner references for automatic cleanup.
66+
67+
### Registry API Service Pattern
68+
69+
**Decision**: Deploy individual API service per MCPRegistry rather than shared service.
70+
71+
**Rationale**:
72+
- **Isolation**: Each registry has independent lifecycle and scaling
73+
- **Security**: Per-registry access control possible
74+
- **Reliability**: Failure of one registry doesn't affect others
75+
- **Lifecycle Management**: Automatic cleanup via owner references
76+
77+
**Trade-offs**: More resources consumed but better isolation and security.
78+
79+
### Error Handling Strategy
80+
81+
**Decision**: Structured error types with progressive retry backoff.
82+
83+
**Rationale**:
84+
- Different error types need different handling strategies
85+
- Progressive backoff prevents thundering herd problems
86+
- Structured errors enable better observability
87+
88+
**Implementation**: 5m initial retry, exponential backoff with cap, manual sync bypass.
89+
90+
### Performance Design Decisions
91+
92+
#### Resource Optimization
93+
- **Status Updates**: Batched to reduce API calls (implemented)
94+
- **Source Fetching**: Planned caching to avoid repeated downloads
95+
- **API Deployment**: Lazy creation only when needed (implemented)
96+
97+
#### Memory Management
98+
- **Git Operations**: Shallow clones to minimize disk usage (implemented)
99+
- **Large Registries**: Stream processing planned for future
100+
- **Status Objects**: Efficient field-level updates (implemented)
101+
102+
### Security Architecture
103+
104+
#### Permission Model
105+
Minimal required permissions following principle of least privilege:
106+
- ConfigMaps: For storage management
107+
- Services/Deployments: For API service management
108+
- MCPRegistry: For status updates
109+
110+
#### Network Security
111+
Optional network policies for registry API access control in security-sensitive environments.

cmd/thv-operator/README.md

Lines changed: 63 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,34 @@
11
# ToolHive Kubernetes Operator
22

3-
The ToolHive Kubernetes Operator manages MCP (Model Context Protocol) servers in Kubernetes clusters. It allows you to define MCP servers as Kubernetes resources and automates their deployment and management.
3+
The ToolHive Kubernetes Operator manages MCP (Model Context Protocol) servers and registries in Kubernetes clusters. It allows you to define MCP servers and registries as Kubernetes resources and automates their deployment and management.
44

55
This operator is built using [Kubebuilder](https://book.kubebuilder.io/), a framework for building Kubernetes APIs using Custom Resource Definitions (CRDs).
66

77
## Overview
88

9-
The operator introduces a new Custom Resource Definition (CRD) called `MCPServer` that represents an MCP server in Kubernetes. When you create an `MCPServer` resource, the operator automatically:
9+
The operator introduces two main Custom Resource Definitions (CRDs):
10+
11+
### MCPServer
12+
Represents an MCP server in Kubernetes. When you create an `MCPServer` resource, the operator automatically:
1013

1114
1. Creates a Deployment to run the MCP server
1215
2. Sets up a Service to expose the MCP server
1316
3. Configures the appropriate permissions and settings
1417
4. Manages the lifecycle of the MCP server
1518

19+
### MCPRegistry (Experimental)
20+
21+
> ⚠️ **Experimental Feature**: MCPRegistry requires `ENABLE_EXPERIMENTAL_FEATURES=true`
22+
23+
Represents an MCP server registry in Kubernetes. When you create an `MCPRegistry` resource, the operator automatically:
24+
25+
1. Synchronizes registry data from various sources (ConfigMap, Git)
26+
2. Deploys a Registry API service for server discovery
27+
3. Provides content filtering and image validation
28+
4. Manages automatic and manual synchronization policies
29+
30+
For detailed MCPRegistry documentation, see [REGISTRY.md](REGISTRY.md).
31+
1632
```mermaid
1733
---
1834
config:
@@ -107,7 +123,11 @@ helm upgrade -i toolhive-operator-crds oci://ghcr.io/stacklok/toolhive/toolhive-
107123
2. Install the operator:
108124

109125
```bash
126+
# Standard installation
110127
helm upgrade -i <release_name> oci://ghcr.io/stacklok/toolhive/toolhive-operator --version=<version> -n toolhive-system --create-namespace
128+
129+
# OR with experimental features (for MCPRegistry support)
130+
helm upgrade -i <release_name> oci://ghcr.io/stacklok/toolhive/toolhive-operator --version=<version> -n toolhive-system --create-namespace --set operator.features.experimental=true
111131
```
112132

113133
## Usage
@@ -236,9 +256,49 @@ permissionProfile:
236256
237257
The ConfigMap should contain a JSON permission profile.
238258
259+
### Creating an MCP Registry (Experimental)
260+
261+
> ⚠️ **Requires**: `operator.features.experimental=true`
262+
263+
First, create a ConfigMap containing ToolHive registry data. The ConfigMap must be user-defined and is not managed by the operator:
264+
265+
```bash
266+
# Create ConfigMap from existing registry data
267+
kubectl create configmap my-registry-data --from-file registry.json=pkg/registry/data/registry.json -n toolhive-system
268+
269+
# Or create from your own registry file
270+
kubectl create configmap my-registry-data --from-file registry.json=/path/to/your/registry.json -n toolhive-system
271+
```
272+
273+
Then create the MCPRegistry resource that references the ConfigMap:
274+
275+
```yaml
276+
apiVersion: toolhive.stacklok.dev/v1alpha1
277+
kind: MCPRegistry
278+
metadata:
279+
name: my-registry
280+
namespace: toolhive-system
281+
spec:
282+
displayName: "My MCP Registry"
283+
source:
284+
type: configmap
285+
configmap:
286+
name: my-registry-data # References the user-created ConfigMap
287+
key: registry.json # Key in ConfigMap (default: "registry.json")
288+
syncPolicy:
289+
interval: "1h"
290+
filter:
291+
tags:
292+
include: ["production"]
293+
exclude: ["experimental"]
294+
```
295+
296+
For complete MCPRegistry examples and documentation, see [REGISTRY.md](REGISTRY.md).
297+
239298
## Examples
240299

241-
See the `examples/operator/mcp-servers/` directory for example MCPServer resources.
300+
- **MCPServer examples**: `examples/operator/mcp-servers/` directory
301+
- **MCPRegistry examples**: `examples/operator/mcp-registries/` directory
242302

243303
## Development
244304

0 commit comments

Comments
 (0)