Skip to content

Conversation

JAORMX
Copy link
Collaborator

@JAORMX JAORMX commented Oct 10, 2025

Summary

This proposal extends the ToolHive Kubernetes operator to support proxying remote MCP servers with OAuth authentication, bringing the thv run <remote-url> CLI capability to Kubernetes.

Problem

Organizations can't deploy centralized remote MCP proxies in Kubernetes to gain visibility, enforce policies, and maintain audit logs for employee interactions with external SaaS MCP services (e.g., Salesforce MCP, GitHub MCP).

Proposed Solution

  • MCPRemoteProxy CRD: Deploy stateless proxies for remote MCP servers
  • MCPExternalAuthConfig CRD: Optional token exchange configuration
  • JWT validation and identity extraction for policy enforcement
  • Reuses existing authz, audit, and tool filtering middleware

Seeking Feedback On

  1. Overall architecture and CRD design
  2. Security model and token handling approach
  3. Integration with existing middleware
  4. Edge cases or concerns not covered

🤖 Generated with Claude Code

This proposal extends the ToolHive Kubernetes operator to support
proxying remote Model Context Protocol (MCP) servers with OAuth
authentication, bringing feature parity with the existing
`thv run <remote-url>` CLI capability.

Key features:
- New MCPRemoteProxy CRD for deploying remote MCP proxies
- JWT validation and identity extraction for policy enforcement
- Optional token exchange support via MCPExternalAuthConfig CRD
- Integration with existing authz, audit, and tool filtering
- Stateless, horizontally scalable architecture

This enables enterprises to gain centralized observability and policy
enforcement over external MCP usage without requiring changes to
remote services or storing user credentials.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Copy link

codecov bot commented Oct 10, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 48.84%. Comparing base (81e6666) to head (1a50fe9).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2151   +/-   ##
=======================================
  Coverage   48.84%   48.84%           
=======================================
  Files         242      242           
  Lines       30722    30722           
=======================================
  Hits        15005    15005           
  Misses      14595    14595           
  Partials     1122     1122           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ChrisJBurns
Copy link
Collaborator

@claude Can you review the proposal made in this PR? Put your Kubernetes and security hat on for this one

Copy link
Contributor

claude bot commented Oct 10, 2025

Claude finished @ChrisJBurns's task —— View job


Remote MCP Proxy Proposal Review

I've completed a thorough review of the proposal from both Kubernetes and security perspectives. Overall, this is a well-architected proposal that builds intelligently on existing ToolHive infrastructure. Below is my detailed feedback.


✅ Strengths

Architecture & Design

  1. Excellent reuse of existing middleware - Leveraging auth, authz, audit, and tool filtering middleware is the right approach
  2. Stateless design - No token storage, horizontally scalable, follows cloud-native principles
  3. Clear CRD separation - MCPRemoteProxy vs MCPExternalAuthConfig provides good separation of concerns
  4. Defense in depth - Both proxy and remote MCP validate tokens independently
  5. Follows operator patterns - Consistent with existing MCPServer and MCPRegistry CRDs

Security Model

  1. JWT signature validation via JWKS is the right approach for token trust
  2. Token introspection fallback already implemented in pkg/auth/token.go
  3. No credential storage - Tokens flow through without persistence
  4. TLS enforcement mentioned for all connections
  5. Multi-tenancy via stateless architecture - Per-request validation ensures isolation

🔴 Critical Issues

1. Missing CRD Type Definition for MCPExternalAuthConfig

Severity: High

The proposal references MCPExternalAuthConfig CRD extensively (lines 223-237, 258) but doesn't define its structure.

Required:

// In cmd/thv-operator/api/v1alpha1/mcpexternalauthconfig_types.go
type MCPExternalAuthConfigSpec struct {
    Type string `json:"type"` // "tokenExchange"
    TokenExchange *TokenExchangeConfig `json:"tokenExchange,omitempty"`
}

type TokenExchangeConfig struct {
    TokenURL string `json:"token_url"`
    ClientID string `json:"client_id"`
    ClientSecretRef SecretRef `json:"client_secret_ref"`
    Audience string `json:"audience,omitempty"`
    Scope string `json:"scope,omitempty"`
}

Questions:

  • Should this be namespace-scoped or cluster-scoped?
  • Should it support cross-namespace references? (I'd recommend NO for security)
  • What status fields are needed? (e.g., connection health, last exchange time)

2. OIDC Configuration Ambiguity

Severity: High | Location: Lines 172-180, 250-255

The proposal shows two different uses of oidcConfig:

  • Base case: Validates tokens from remote MCP IDP (line 176: issuer: https://login.salesforce.com)
  • Token exchange: Validates company tokens (line 253: issuer: https://auth.company.com)

Problem: This overloading creates confusion. When externalAuthConfigRef is present, whose tokens should oidcConfig validate?

Recommendation:

# Option A: Make it explicit with two separate fields
spec:
  remoteURL: https://mcp.salesforce.com
  
  # For validating incoming tokens from employees/clients
  incomingAuth:
    oidcConfig: {...}
  
  # Optional: For exchanging tokens with remote service
  externalAuthConfigRef:
    name: salesforce-token-exchange

Or document clearly: "When externalAuthConfigRef is present, oidcConfig validates incoming tokens. When absent, oidcConfig validates tokens destined for the remote MCP."

3. Token Exchange Middleware Registration Missing

Severity: Medium | Location: Line 610

The proposal states: "⚠️ Needs registration in middleware factory (1-line fix)"

This needs to be addressed before implementation. Where exactly should this be registered? Looking at the existing auth middleware pattern in pkg/auth/middleware.go, you'll need:

  1. A middleware type constant in pkg/auth/tokenexchange/middleware.go
  2. Registration in the middleware factory (likely in pkg/transport or pkg/runner)
  3. Proper ordering guarantees (token exchange MUST run after auth)

4. RBAC Permissions Undefined

Severity: Medium

What Kubernetes RBAC permissions does the MCPRemoteProxy controller need? Looking at mcpserver_controller.go:141-150, you'll need kubebuilder RBAC markers like:

// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpremoteproxies,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpremoteproxies/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpexternalauthconfigs,verbs=get;list;watch

⚠️ Security Concerns

1. JWKS Endpoint Trust

Location: Lines 468-472

Concern: If the remote IDP's JWKS endpoint is compromised, the proxy will accept forged tokens.

Mitigations (already mentioned but emphasize):

  • ✅ Certificate validation (mentioned line 477)
  • ✅ Periodic refresh with caching (line 537)
  • ⚠️ Missing: JWKS endpoint pinning or allowlist for production environments
  • ⚠️ Missing: Monitoring/alerting for JWKS changes

Recommendation: Add CRD field for JWKS certificate pinning:

oidcConfig:
  inline:
    issuer: https://login.salesforce.com
    jwksUrl: https://login.salesforce.com/.well-known/jwks.json
    jwksCertFingerprint: "sha256:abc123..."  # Optional pinning

2. Token Replay Prevention

Severity: Medium | Location: Architecture diagrams

Observation: The proposal doesn't mention token replay protection (e.g., jti claim tracking, nonce validation).

Assessment: Probably acceptable because:

  • Tokens are validated by both proxy and remote MCP (defense in depth)
  • Short-lived tokens (typical OAuth2 practice)
  • HTTPS prevents network-level replay

Recommendation: Document this explicitly in "Security Considerations" and recommend short token TTLs (e.g., 5-15 minutes).

3. Authorization Policy Context

Location: Lines 354-383

Excellent: Cedar policies can reference principal.email, principal.groups extracted from validated JWT.

Question: How are groups extracted? Different IDPs put groups in different claims:

  • Some use groups (array)
  • Some use roles (array)
  • Some use cognito:groups (AWS Cognito)
  • Some embed in nested structures

Recommendation: Add configurable claim mapping to OIDCConfig:

oidcConfig:
  inline:
    issuer: https://auth.company.com
    claimMappings:
      email: "email"
      groups: "cognito:groups"  # Or "roles", etc.
      name: "preferred_username"

4. Network Policy Not Enforced

Location: Line 479

"Kubernetes NetworkPolicies to restrict egress" is mentioned but not in the CRD design.

Recommendation: Consider adding to MCPRemoteProxySpec:

type MCPRemoteProxySpec struct {
    // ...
    
    // NetworkPolicy defines network access controls
    // +optional
    NetworkPolicy *NetworkPolicyConfig `json:"networkPolicy,omitempty"`
}

type NetworkPolicyConfig struct {
    // Enabled controls whether NetworkPolicy is created
    Enabled bool `json:"enabled,omitempty"`
    
    // AllowedEgress defines allowed egress destinations
    AllowedEgress []NetworkPolicyEgressRule `json:"allowedEgress,omitempty"`
}

🟡 Design Recommendations

1. Status Field Design

Location: Lines 304-309

The proposed status is minimal. Recommend adding:

type MCPRemoteProxyStatus struct {
    Phase       MCPRemoteProxyPhase `json:"phase,omitempty"`
    URL         string              `json:"url,omitempty"`
    ExternalURL string              `json:"externalURL,omitempty"`
    Conditions  []metav1.Condition  `json:"conditions,omitempty"`
    
    // New fields:
    LastHealthCheck    *metav1.Time `json:"lastHealthCheck,omitempty"`
    RemoteReachable    bool         `json:"remoteReachable,omitempty"`
    TokenExchangeReady bool         `json:"tokenExchangeReady,omitempty"` // If configured
    ObservedGeneration int64        `json:"observedGeneration,omitempty"`
}

Conditions should include:

  • RemoteAvailable - Can proxy reach remote MCP?
  • AuthConfigured - Is OIDC/token exchange properly configured?
  • Ready - Overall readiness

2. Consistent with DESIGN.md Principles

Location: cmd/thv-operator/DESIGN.md

Your proposal correctly uses dedicated CRD attributes for business logic (remoteURL, oidcConfig, authzConfig) and would defer to PodTemplateSpec for infrastructure concerns. ✅

However: The proposal doesn't show PodTemplateSpec support in MCPRemoteProxySpec. Recommend adding:

type MCPRemoteProxySpec struct {
    // ...
    
    // PodTemplateSpec for infrastructure customization (node selection, resources, etc.)
    // +optional
    PodTemplateSpec *corev1.PodTemplateSpec `json:"podTemplateSpec,omitempty"`
}

3. Validation Webhooks

Not mentioned in proposal

You'll need admission webhooks for:

  1. Validating MCPRemoteProxy:

    • Ensure remoteURL is valid HTTPS
    • Ensure externalAuthConfigRef exists in same namespace (if specified)
    • Ensure oidcConfig OR externalAuthConfigRef is present (not neither)
    • Validate port ranges
  2. Validating MCPExternalAuthConfig:

    • Ensure client_secret_ref secret exists
    • Validate token_url is valid HTTPS
  3. Defaulting: Set default port (8080), default replicas (1), etc.

Example from existing operator:

// In cmd/thv-operator/api/v1alpha1/mcpremoteproxy_webhook.go
func (r *MCPRemoteProxy) ValidateCreate() error {
    if r.Spec.OIDCConfig == nil && r.Spec.ExternalAuthConfigRef == nil {
        return fmt.Errorf("must specify either oidcConfig or externalAuthConfigRef")
    }
    // ...
}

4. Observability

Location: Lines 534-546 (Performance Considerations)

Good: Mentions JWKS caching and introspection caching.

Missing: Metrics to expose. Recommend:

# Request metrics
toolhive_remote_proxy_requests_total{proxy="salesforce-mcp",status="success|failure"}
toolhive_remote_proxy_request_duration_seconds{proxy="salesforce-mcp"}

# Token metrics
toolhive_remote_proxy_token_validations_total{proxy="salesforce-mcp",result="valid|invalid|error"}
toolhive_remote_proxy_token_exchanges_total{proxy="salesforce-mcp",result="success|failure"}

# Cache metrics
toolhive_remote_proxy_jwks_cache_hits_total{proxy="salesforce-mcp"}
toolhive_remote_proxy_jwks_refreshes_total{proxy="salesforce-mcp"}

🟢 Kubernetes-Specific Feedback

1. Controller Implementation Path

Based on existing patterns, you'll create:

cmd/thv-operator/
├── api/v1alpha1/
│   ├── mcpremoteproxy_types.go (NEW)
│   ├── mcpexternalauthconfig_types.go (NEW)
│   ├── mcpremoteproxy_webhook.go (NEW)
│   └── mcpexternalauthconfig_webhook.go (NEW)
└── controllers/
    ├── mcpremoteproxy_controller.go (NEW)
    └── mcpremoteproxy_runconfig.go (NEW)

2. Resource Ownership

The Deployment and Service created by MCPRemoteProxy should have owner references:

controllerutil.SetControllerReference(proxy, deployment, r.Scheme)

This ensures automatic cleanup when proxy is deleted.

3. Proxy Deployment Strategy

The proposal doesn't specify. Recommend:

Deployment.Spec.Strategy = appsv1.DeploymentStrategy{
    Type: appsv1.RollingUpdateDeploymentStrategyType,
    RollingUpdate: &appsv1.RollingUpdateDeployment{
        MaxUnavailable: &intstr.IntOrString{Type: intstr.Int, IntVal: 0},
        MaxSurge:       &intstr.IntOrString{Type: intstr.Int, IntVal: 1},
    },
}

This ensures zero-downtime updates.

4. Service Account for Proxy

Will the proxy pods need a service account with specific permissions? If externalAuthConfigRef reads secrets, you'll need:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
rules:
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["get"]
  resourceNames: ["token-exchange-creds"]  # Specific secrets only

📋 Edge Cases & Concerns

1. Token Expiration Mid-Request

What happens if a token expires while a long-running MCP tool call is executing?

Recommendation: Document that clients should ensure tokens have sufficient TTL for expected operation duration.

2. Remote MCP Authentication Changes

If the remote MCP changes its OIDC configuration (new issuer, rotated keys), the proxy will start rejecting tokens.

Mitigation: JWKS refresh handles key rotation. But issuer changes require proxy reconfiguration.

3. Cross-Namespace References

The proposal doesn't address whether externalAuthConfigRef can reference resources in other namespaces.

Recommendation: Enforce same-namespace only (consistent with ToolConfigRef in mcpserver_types.go:477-481):

// The referenced MCPExternalAuthConfig must exist in the same namespace.
// Cross-namespace references are not supported for security and isolation reasons.

4. Multiple Proxies to Same Remote MCP

Can multiple MCPRemoteProxy resources proxy to the same remote URL with different configs?

Use case: Different teams with different authz policies for same external service.

Recommendation: Yes, allow this. Each proxy is independent with its own policies.

5. Graceful Degradation

If token exchange is configured but the exchange endpoint is down, should the proxy:

  • A) Fail closed (reject all requests) ✅ Recommended
  • B) Fall through to direct token forwarding ❌ Security risk

Recommendation: Fail closed. Document this behavior.


🎯 Summary Recommendations

Must Fix Before Implementation

  1. ✅ Define MCPExternalAuthConfig CRD types
  2. ✅ Clarify oidcConfig usage in token exchange vs direct auth scenarios
  3. ✅ Register token exchange middleware in factory
  4. ✅ Define RBAC permissions for controller
  5. ✅ Add validation webhooks

Should Consider

  1. ✅ Add JWKS certificate pinning support
  2. ✅ Document token replay prevention strategy
  3. ✅ Add configurable claim mapping for groups/roles
  4. ✅ Enhance status fields with health checks
  5. ✅ Add NetworkPolicy support in CRD
  6. ✅ Define prometheus metrics
  7. ✅ Add PodTemplateSpec support for consistency

Nice to Have

  1. ✅ Add examples for common IDPs (Okta, Auth0, Keycloak)
  2. ✅ Document migration path from CLI to operator
  3. ✅ E2E test scenarios with mock remote MCP

✨ Final Assessment

This is a strong proposal that demonstrates deep understanding of both the MCP protocol and Kubernetes operator patterns. The architecture is sound, security model is reasonable, and the reuse of existing middleware is excellent.

The main gaps are around CRD completeness and operational details (RBAC, webhooks, observability). These are standard implementation concerns, not fundamental design flaws.

Recommendation: Proceed with implementation after addressing the critical issues above.


Great work on this proposal! The level of detail and the diagrams make it very clear what you're building. Looking forward to seeing this feature in ToolHive.


Job Run | Branch: proposal/remote-mcp-proxy

@JAORMX
Copy link
Collaborator Author

JAORMX commented Oct 10, 2025

Note that the external auth config crd is here #2150

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants