Skip to content

Conversation

@amirejaz
Copy link
Contributor

@amirejaz amirejaz commented Nov 25, 2025

Add Feature Flags for Controller Groups and Optional Virtual MCP CRD Installation

Summary

This PR implements feature flags to enable/disable controller groups in the ToolHive operator and adds the ability to skip installing Virtual MCP CRDs. This allows users to deploy a minimal operator configuration when they only need core MCP server management features.

Related Issues

Changes

1. Feature Flags Implementation

Added three environment variable flags to control controller groups:

  • ENABLE_SERVER (default: true)

    • Controls: MCPServer, MCPExternalAuthConfig, MCPRemoteProxy, and ToolConfig controllers
    • When disabled: Skips all server-related controllers and field indexing
  • ENABLE_REGISTRY (default: true)

    • Controls: MCPRegistry controller
    • When disabled: Skips MCPRegistry controller
    • Also affects MCPServer image validation mode (registry-enforcing vs always-allow)
  • ENABLE_AGGREGATION (default: true)

    • Controls: VirtualMCPServer, MCPGroup controllers and their webhooks
    • Dependency: Requires ENABLE_SERVER=true (validated with warning log)
    • When disabled: Skips Virtual MCP aggregation features

2. Optional Virtual MCP CRD Installation

Added crds.install.virtualMCP option to the toolhive-operator-crds Helm chart:

  • When set to false: Skips installing VirtualMCPServer and VirtualMCPCompositeToolDefinition CRDs
  • Saves approximately 54KB of CRDs
  • Users should also set ENABLE_AGGREGATION=false in the operator to prevent controller errors

3. Code Refactoring

Refactored setupControllersAndWebhooks function into smaller, focused functions:

  • setupServerControllers() - Sets up server-related controllers
  • setupRegistryController() - Sets up registry controller
  • setupAggregationControllers() - Sets up aggregation controllers and webhooks

This improves maintainability and makes the code easier to test.

4. Image Validation Logic

Fixed image validation mode selection:

  • When ENABLE_REGISTRY=true: MCPServer uses ImageValidationRegistryEnforcing
  • When ENABLE_REGISTRY=false: MCPServer uses ImageValidationAlwaysAllow
  • Previously, this was set incorrectly after controller setup

Usage Examples

Minimal Deployment (Core Server Management Only)

# Skip Virtual MCP CRDs
helm upgrade -i toolhive-operator-crds oci://ghcr.io/stacklok/toolhive/toolhive-operator-crds \
  --set crds.install.virtualMCP=false

# Disable registry and aggregation features
helm upgrade -i toolhive-operator oci://ghcr.io/stacklok/toolhive/toolhive-operator \
  -n toolhive-system --create-namespace \
  --set operator.env.ENABLE_REGISTRY=false \
  --set operator.env.ENABLE_AGGREGATION=false

Registry-Only Deployment

# Enable only server and registry features
helm upgrade -i toolhive-operator oci://ghcr.io/stacklok/toolhive/toolhive-operator \
  -n toolhive-system --create-namespace \
  --set operator.env.ENABLE_AGGREGATION=false

Backward Compatibility

  • All feature flags default to true when not set
  • Existing deployments continue to work without any changes
  • CRD installation defaults to true (installs all CRDs)

Documentation

  • Updated operator-crds/README.md with feature flags documentation
  • Updated operator/values.yaml with feature flag comments
  • Added usage examples and dependency information

Testing

  • ✅ Code compiles successfully
  • ✅ All linting checks pass
  • ✅ Functionality verified with feature flags enabled/disabled

@github-actions github-actions bot added the size/XS Extra small PR: < 100 lines changed label Nov 25, 2025
Copy link
Collaborator

@JAORMX JAORMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add an environment variable to disable the controllers if this is enabled.

@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed size/S Small PR: 100-299 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Nov 25, 2025
@codecov
Copy link

codecov bot commented Nov 25, 2025

Codecov Report

❌ Patch coverage is 0% with 53 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.38%. Comparing base (c73c5ef) to head (07a8538).

Files with missing lines Patch % Lines
cmd/thv-operator/main.go 0.00% 53 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2729      +/-   ##
==========================================
- Coverage   56.48%   56.38%   -0.11%     
==========================================
  Files         319      319              
  Lines       30943    30992      +49     
==========================================
- Hits        17479    17475       -4     
- Misses      11960    12012      +52     
- Partials     1504     1505       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Nov 25, 2025
@amirejaz amirejaz requested a review from JAORMX November 25, 2025 17:44
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Nov 25, 2025
Copy link
Collaborator

@JAORMX JAORMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amirejaz what about the virtual MCP controllers? shouldn't those be handled via a boolean as well? I was thinking the helm boolean would set the ENABLE_VMCP (or similar) environment variable in the main operator deployment and thus allow us to toggle easily. I'm not sure we should expose environment variables like this, instead, we should use helm flags with proper documentation.

@jhrozek
Copy link
Contributor

jhrozek commented Nov 26, 2025

@dmartinol FYI you were interested in reviewing this on our call

@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/S Small PR: 100-299 lines changed labels Nov 26, 2025
# -- Install Virtual MCP CRDs (VirtualMCPServer and VirtualMCPCompositeToolDefinition).
# Users who only need core MCP server management can set this to false to skip
# installing Virtual MCP aggregation features (saves ~54KB of CRDs).
virtualMCP: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add the same implementation to disable the registry and server CRDs according to the specs at #2564

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will update the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@dmartinol dmartinol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my concern is that there are mixed CRDs and controllers that seem not to be managed, at least not as defined in the #2564
and I fear we should also anticipate the removal of unneeded controllers from the platform, as requested in #2662 so that we can reduce the number of controllers to handle by these flags

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these files are generated, it's not enough renaming them, you should update the operator-manifests task.
anyway, these

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated operator-manifest

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a file to be edited IMO, ask @ChrisJBurns

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need to modify this file unless you're providing information that you want generated onto the README.md. In this case it may be ok. Otherwise if it's only on the readme.md it will get overridden by the gotmpl

helm uninstall <release_name>
```

### Skipping Virtual MCP CRDs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the scope should be extended to disable registry CRD

@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Nov 26, 2025
Copy link
Collaborator

@dmartinol dmartinol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume @jhrozek will also share his thoughts here.
Once we agree on the design, pls also update the ITs to re-run the existing tests with the other features disabled

id: git-check
run: |
git diff --exit-code deploy/charts/operator-crds/crds || echo "crd-changes=true" >> $GITHUB_OUTPUT
git diff --exit-code deploy/charts/operator-crds/crds deploy/charts/operator-crds/crds-server deploy/charts/operator-crds/crds-registry deploy/charts/operator-crds/crds-virtualmcp || echo "crd-changes=true" >> $GITHUB_OUTPUT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about using regex like:
... deploy/charts/operator-crds/crds* ...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,5 @@
{{- if .Values.crds.install.server }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, external auth can also be used with virtual MCPs, so this should be in OR with .Values.crds.install.virtualMCP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

id: git-check
run: |
git diff --exit-code deploy/charts/operator-crds/crds || echo "crd-changes=true" >> $GITHUB_OUTPUT
git diff --exit-code deploy/charts/operator-crds/crds deploy/charts/operator-crds/crds-server deploy/charts/operator-crds/crds-registry deploy/charts/operator-crds/crds-virtualmcp || echo "crd-changes=true" >> $GITHUB_OUTPUT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: why aren't we keeping all the original CRDs in the same folder instead? This would simplify all the involved steps (of course, ITs should list all the individual files)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we keep all of them under crds/, it automatically installs all the crds. I moved all of them in one folder crd-files.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what I meant: the same folder but different from the one used as the chart template

CRDDirectoryPaths: []string{filepath.Join("..", "..", "..", "..", "deploy", "charts", "operator-crds", "crds")},
CRDDirectoryPaths: []string{
filepath.Join("..", "..", "..", "..", "deploy", "charts", "operator-crds", "crds-server"),
filepath.Join("..", "..", "..", "..", "deploy", "charts", "operator-crds", "crds-virtualmcp"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see MCPGroup is needed here because the association is bottom-up, from server to group: the server controller should take this into account and avoid, e.g., using validateGroupRef

testEnv = &envtest.Environment{
CRDDirectoryPaths: []string{filepath.Join("..", "..", "..", "..", "deploy", "charts", "operator-crds", "crds")},
CRDDirectoryPaths: []string{
filepath.Join("..", "..", "..", "..", "deploy", "charts", "operator-crds", "crds-server"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if this is logically correct, I imagine the tests would work even w/o the server CRD.
Anyway, I imagine that the vMCP controllers should fail to start if the server functionality is disabled

// Check feature flags
enableServer := isFeatureEnabled("ENABLE_SERVER", true)
enableRegistry := isFeatureEnabled("ENABLE_REGISTRY", true)
enableAggregation := isFeatureEnabled("ENABLE_AGGREGATION", true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think enableAggregation also depends on enableServer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going to become quite brittle quite quickly. Could we have a simple dependency Go map where we express the dependencies?

Another possible simplification would be to not implement ENABLE_SERVER for now but only ENABLE_REGISTRY and ENABLE_AGGREGATION and treat ENABLE_SERVER as a dependency.

@dmartinol did you have a use case for ENABLE_REGISTRY only? I know it was in the original issue but I wasn't sure if it was just for completeness

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmartinol did you have a use case for ENABLE_REGISTRY only? I know it was in the original issue but I wasn't sure if it was just for completeness

I imagine it would be for those wanting to deploy an "official" registry in k8s without caring about the toolhive tools. The same could be probably achieved via an ad-hoc deployment that someone tracked in the design document as:

In Kubernetes via a Helm chart (Optional, if we have enough time)

(I'm not sure this is tracked for the MVP, anyway)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going to become quite brittle quite quickly. Could we have a simple dependency Go map where we express the dependencies?

Move the dependencies to the go map.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to digest the changes, better, submitting a first pass

--set crds.install.virtualMCP=false
```

This saves approximately 54KB of CRDs and is useful for deployments that don't require Virtual MCP aggregation features.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not include the size, it screams "AI generated".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed


// setupServerControllers sets up server-related controllers (MCPServer, MCPExternalAuthConfig, MCPRemoteProxy, ToolConfig)
func setupServerControllers(mgr ctrl.Manager, enableRegistry bool) error {
// Create a shared platform detector for all controllers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least the comment is a bit misleading. I wonder if there is a reason not to pass the detector down from the caller, IIRC it already uses sync.Once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passed platform detector from caller

if !found {
return defaultValue
}
return strings.EqualFold(value, "true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strconv.ParseBool ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Check feature flags
enableServer := isFeatureEnabled("ENABLE_SERVER", true)
enableRegistry := isFeatureEnabled("ENABLE_REGISTRY", true)
enableAggregation := isFeatureEnabled("ENABLE_AGGREGATION", true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going to become quite brittle quite quickly. Could we have a simple dependency Go map where we express the dependencies?

Another possible simplification would be to not implement ENABLE_SERVER for now but only ENABLE_REGISTRY and ENABLE_AGGREGATION and treat ENABLE_SERVER as a dependency.

@dmartinol did you have a use case for ENABLE_REGISTRY only? I know it was in the original issue but I wasn't sure if it was just for completeness

value: {{ .Values.operator.features.server | quote }}
- name: ENABLE_REGISTRY
value: {{ .Values.operator.features.registry | quote }}
- name: ENABLE_AGGREGATION
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just call this ENABLE_VMCP ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this move is correct, MCPGroup is usable even without vMCP, moving it here would break the use-case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what use case is it?

# Move server CRDs
- cmd: |
mv deploy/charts/operator-crds/crds/toolhive.stacklok.dev_mcpservers.yaml deploy/charts/operator-crds/crds-server/ 2>/dev/null || true
mv deploy/charts/operator-crds/crds/toolhive.stacklok.dev_mcpexternalauthconfigs.yaml deploy/charts/operator-crds/crds-server/ 2>/dev/null || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we ignoring errors?

@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Nov 27, 2025
@amirejaz amirejaz force-pushed the operator-crds-optional-virtualmcp branch from 2ceca6b to 215e33a Compare November 28, 2025 01:52
@amirejaz amirejaz force-pushed the operator-crds-optional-virtualmcp branch from 215e33a to d2c3613 Compare November 28, 2025 02:00
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Nov 28, 2025
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/M Medium PR: 300-599 lines changed labels Nov 28, 2025
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Nov 28, 2025
@amirejaz amirejaz marked this pull request as ready for review November 28, 2025 15:24
helm uninstall <release_name>
```

### Skipping CRDs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great but maybe a summary table could be more effective.
Can be fixed in another PR if we think it's needed.

# Users who only need server management without registry features can set this to false
# to skip installing the registry CRD.
registry: true
# -- Install Virtual MCP CRDs (VirtualMCPServer and VirtualMCPCompositeToolDefinition).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and MCPGroup

- get
- list
- watch
{{- if not .Values.operator.testMode }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain where this operator.testMode come from now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operator.testMode comes from our chart-testing setup and is used in CI to disable operator RBAC/resources to avoid Helm ownership conflicts during ct install.

Copy link
Collaborator

@dmartinol dmartinol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm leaving final approval to @jhrozek
Thanks!

@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Nov 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Large PR: 600-999 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add optional installation flag for Virtual MCP CRDs in operator-crds Helm chart Provide feature flags for enabling/disabling controllers

6 participants