Skip to content

Health gate in nginx config generation blocks routing to reachable servers #496

@WPrintz

Description

@WPrintz

Summary

When servers are registered and enabled, nginx_service.py only generates active nginx location blocks for servers that pass the internal Python health check. Servers that fail the health check get commented-out location blocks, making them permanently unreachable via the MCP gateway proxy -- even when nginx itself can successfully proxy to them.

This creates a Catch-22 in certain deployment environments (e.g., ECS Service Connect with HTTP-type Cloud Map services): the Python health checker cannot resolve internal service hostnames via the system DNS resolver, so servers are marked unhealthy, so nginx never gets their routes, so they can never receive traffic -- even though nginx can resolve and proxy to those same hostnames via the Envoy sidecar.

Observed Behavior

  1. Register and enable MCP servers via the API (mix of internal and external)
  2. Health checker runs: external servers pass, internal servers fail with [Errno -2] Name or service not known
  3. generate_config_async() generates nginx config with only healthy servers as active location blocks; unhealthy servers are commented out
  4. Requests to /{server}/mcp for internal servers return 405 Method Not Allowed (no matching location block)
  5. Requests to healthy servers return 200 OK
currenttime/mcp:          HTTP 405  (internal, health check fails)
mcpgw/mcp:                HTTP 405  (internal, health check fails)
realserverfaketools/mcp:  HTTP 405  (internal, health check fails)
cloudflare-docs/mcp:      HTTP 200  (external, health check passes)

Root Cause

1. Health gate in config generation (nginx_service.py)

# nginx_service.py, generate_config_async()
for path, server_info in servers.items():
    proxy_pass_url = server_info.get("proxy_pass_url")
    if proxy_pass_url:
        health_status = health_service.server_health_status.get(path, HealthStatus.UNKNOWN)

        if HealthStatus.is_healthy(health_status):
            # Active location block -- server is routable
            transport_blocks = self._generate_transport_location_blocks(path, server_info)
            location_blocks.extend(transport_blocks)
        else:
            # Commented-out location block -- server is NOT routable
            commented_block = f"""
#    location {path}/ {{
#        # Service currently unhealthy (status: {health_status})
#        ...
#    }}"""
            location_blocks.append(commented_block)

The health status acts as a routing gate rather than being purely informational. Servers that fail the health check are excluded from nginx routing entirely.

2. DNS resolution mismatch between nginx and Python

The health checker uses Python's httpx/aiohttp, which resolves DNS via the system resolver. In environments where DNS resolution is handled by a sidecar proxy (e.g., ECS Service Connect with Envoy), the system resolver cannot resolve internal service hostnames -- but nginx can, because it goes through the sidecar.

This means the health checker's view of reachability does not match nginx's actual ability to proxy traffic.

3. Registration does not trigger nginx config regeneration

register_server() in server_service.py creates the database record and indexes for search, but does not call generate_config() + reload_nginx(). Both update_server() and toggle_service() do trigger regeneration. This is a gap in the code path -- a newly registered server requires a separate toggle or service restart before nginx picks it up.

4. PID file race condition on container startup

During container startup, FastAPI generates the nginx config and calls reload_nginx() before the entrypoint script has started nginx. The reload fails with:

[error] 186#186: invalid PID number "" in "/run/nginx.pid"

This is harmless since nginx picks up the config on its first start, but it produces misleading error logs.

Impact

  • Any server that fails the health check is completely unreachable via the MCP gateway proxy, regardless of whether nginx could successfully proxy to it
  • The 405 response is confusing -- it implies the route doesn't exist, not that the server is unhealthy
  • In environments where the Python health checker cannot reach internal services (but nginx can), all internal servers are permanently blocked
  • Restarting the container does not help -- the same health check failure occurs on every startup

Proposed Options

The health gate serves a purpose: in production environments with many servers, you may not want to route traffic to a server that's genuinely down. But the current implementation conflates "health checker can't reach the server" with "nginx can't reach the server," which are not the same thing in all deployment environments.

We'd like feedback from maintainers and the community on which approach best balances safety and usability across the range of deployment environments this project supports.

Option A: Fix the health checker, keep the gate

Add a fallback health check path that goes through nginx itself (e.g., curl localhost/{server}/mcp) rather than directly hitting the upstream. If the direct Python check fails but the nginx-proxied check succeeds, mark the server as healthy.

Pros:

  • Preserves the safety of the health gate -- genuinely down servers are still excluded
  • Accurate health status in all environments where nginx can proxy but Python cannot resolve DNS
  • No behavior change for existing deployments where direct health checks work

Cons:

  • More complex to implement (two-tier health check)
  • Nginx must be running before the fallback check can work (interacts with the startup race condition)
  • Health check now depends on nginx config already containing the location block (circular dependency unless the first check always generates the block)

Option B: Configurable bypass via environment variable

Add an environment variable (e.g., NGINX_ROUTE_UNHEALTHY_SERVERS=true) that, when set, generates active location blocks for all enabled servers regardless of health status. Default to current behavior (gated).

Pros:

  • No behavior change for existing deployments (opt-in)
  • Simple to implement
  • Lets operators in affected environments (ECS Service Connect, Kubernetes with sidecar proxies, etc.) fix their deployment without upstream code changes

Cons:

  • Adds a configuration knob that users need to discover and understand
  • Doesn't fix the underlying mismatch -- just works around it
  • Operators may not realize they need this flag until they hit the issue

Option C: Replace commented blocks with 503 error-returning blocks

Instead of commenting out the location block for unhealthy servers, generate an active block that returns 503 Service Unavailable with a JSON body explaining the server is unhealthy.

location /currenttime/ {
    # Service currently unhealthy (status: unhealthy)
    default_type application/json;
    return 503 '{"error": "service_unavailable", "message": "Server is registered but currently unhealthy", "server": "currenttime"}';
}

Pros:

  • Route always exists -- no confusing 405
  • Clients get an informative, structured error with the actual reason
  • No traffic reaches a dead upstream (preserves the safety intent of the gate)
  • No configuration flags needed

Cons:

  • Still blocks routing to servers that nginx could actually reach (the DNS mismatch problem remains)
  • Requires nginx reload when health status changes (already happens today)

Option D: Remove the health gate entirely

Always generate active location blocks for all enabled servers. Health status becomes purely informational (displayed in the UI, available via API) but does not control routing.

for path, server_info in servers.items():
    proxy_pass_url = server_info.get("proxy_pass_url")
    if proxy_pass_url:
        transport_blocks = self._generate_transport_location_blocks(path, server_info)
        location_blocks.extend(transport_blocks)

If a server is truly down, nginx returns 502 Bad Gateway -- the correct HTTP semantics for an unreachable upstream.

Pros:

  • Simplest implementation
  • Eliminates the DNS mismatch problem entirely
  • 502 is semantically correct for an unreachable upstream (vs 405 which is misleading)
  • Reduces nginx reload frequency (no reload needed on health status transitions)
  • Health status remains visible in the UI for operators

Cons:

  • Clients may receive 502 errors for genuinely down servers (instead of being shielded from them)
  • Changes behavior for all existing deployments
  • In environments with many registered-but-broken servers, nginx config will contain location blocks that always 502

Additional Fix: Add nginx regen to register_server()

Regardless of which option is chosen for the health gate, register_server() should trigger nginx config regeneration, consistent with update_server() and toggle_service():

async def register_server(self, server_info: Dict[str, Any]) -> bool:
    result = await self._repo.create(server_info)
    if result:
        # ... existing search indexing ...

        # Regenerate nginx config if server is enabled
        if await self._repo.get_state(server_info["path"]):
            from ..core.nginx_service import nginx_service
            enabled_servers = {
                service_path: await self.get_server_info(service_path)
                for service_path in await self.get_enabled_services()
            }
            nginx_service.generate_config(enabled_servers)
            nginx_service.reload_nginx()

    return result

Additional Fix: Suppress reload when nginx is not running

In reload_nginx(), check if the PID file exists and is non-empty before attempting reload, to avoid the misleading error log on container startup.

Reproduction Steps

  1. Deploy MCP Gateway Registry in an environment where internal service DNS is handled by a sidecar proxy (e.g., ECS Fargate with Service Connect, or Kubernetes with a service mesh)
  2. Register an internal MCP server with a sidecar-resolved hostname
  3. Enable the server via POST /api/servers/toggle
  4. Attempt to proxy through the gateway: POST /{server}/mcp
  5. Observe 405 Method Not Allowed (or check the generated nginx config and see the location block is commented out)

Affected Files

File Issue
registry/core/nginx_service.py Health gate in generate_config_async() excludes unhealthy servers from routing
registry/services/server_service.py register_server() missing nginx config regeneration
registry/core/nginx_service.py reload_nginx() produces misleading error on startup

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions