MCP Server Reliability & Production Readiness

Production Playbook for Model Context Protocol Developers

Building reliable MCP (Model Context Protocol) servers is critical for production Claude Code deployments. This playbook provides battle-tested patterns for health monitoring, graceful degradation, connection management, and incident response for MCP server infrastructure.

MCP Architecture Overview
Health Check Implementation
Connection Management
Error Handling & Recovery
Monitoring & Observability
Production Deployment
Production Examples
Best Practices
Tools & Resources
Summary

MCP Architecture Overview

What is MCP?

Model Context Protocol enables Claude to interact with external tools and data sources through a standardized interface. MCP servers expose tools that Claude can invoke during conversations.

Claude Code Plugins Marketplace:

6 MCP servers (2% of 258 plugins)
Examples: project-health-auditor, conversational-api-debugger
Transport: stdio (standard input/output)

MCP Server Lifecycle

// packages/mcp/example-server/src/index.ts
import { Server } from '@modelcontextprotocol/sdk/server/index.js';
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';

const server = new Server(
  {
    name: 'example-server',
    version: '1.0.0',
  },
  {
    capabilities: {
      tools: {},
      resources: {},
    },
  }
);

// 1. Tool Registration
server.setRequestHandler(ListToolsRequestSchema, async () => ({
  tools: [
    {
      name: 'analyze-code',
      description: 'Analyze code quality',
      inputSchema: {
        type: 'object',
        properties: {
          code: { type: 'string' },
          language: { type: 'string' }
        },
        required: ['code']
      }
    }
  ]
}));

// 2. Tool Execution
server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === 'analyze-code') {
    return {
      content: [
        { type: 'text', text: 'Analysis result...' }
      ]
    };
  }
  throw new Error('Unknown tool');
});

// 3. Start Server
const transport = new StdioServerTransport();
await server.connect(transport);

Critical Points:

Server runs as subprocess (spawned by Claude Code)
Communication via stdio (stdin/stdout)
Must handle tool calls synchronously
No built-in health checks or monitoring

Health Check Implementation

Strategy 1: Internal Health Endpoint

// src/health.ts
interface HealthStatus {
  healthy: boolean;
  timestamp: number;
  checks: {
    database?: boolean;
    api?: boolean;
    memory?: boolean;
  };
  uptime: number;
  version: string;
}

class HealthChecker {
  private startTime = Date.now();
  private lastCheck: HealthStatus | null = null;

  async check(): Promise<HealthStatus> {
    const checks = await Promise.all([
      this.checkDatabase(),
      this.checkExternalAPI(),
      this.checkMemory()
    ]);

    const status: HealthStatus = {
      healthy: checks.every(c => c.healthy),
      timestamp: Date.now(),
      checks: {
        database: checks[0].healthy,
        api: checks[1].healthy,
        memory: checks[2].healthy
      },
      uptime: Date.now() - this.startTime,
      version: '1.0.0'
    };

    this.lastCheck = status;
    return status;
  }

  private async checkDatabase(): Promise<{ healthy: boolean }> {
    try {
      // Example: SQLite query
      await db.get('SELECT 1');
      return { healthy: true };
    } catch (error) {
      console.error('Database health check failed:', error);
      return { healthy: false };
    }
  }

  private async checkExternalAPI(): Promise<{ healthy: boolean }> {
    try {
      const response = await fetch('https://api.example.com/health', {
        timeout: 5000
      });
      return { healthy: response.ok };
    } catch (error) {
      return { healthy: false };
    }
  }

  private async checkMemory(): Promise<{ healthy: boolean }> {
    const used = process.memoryUsage();
    const heapLimit = 512 * 1024 * 1024; // 512MB
    return { healthy: used.heapUsed < heapLimit };
  }

  getLastStatus(): HealthStatus | null {
    return this.lastCheck;
  }
}

// Export for tool use
const healthChecker = new HealthChecker();

// Add health check tool
server.setRequestHandler(ListToolsRequestSchema, async () => ({
  tools: [
    {
      name: 'health-check',
      description: 'Check MCP server health',
      inputSchema: { type: 'object', properties: {} }
    }
  ]
}));

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === 'health-check') {
    const status = await healthChecker.check();
    return {
      content: [{
        type: 'text',
        text: JSON.stringify(status, null, 2)
      }]
    };
  }
});

Strategy 2: Watchdog Process

// src/watchdog.ts
import { spawn } from 'child_process';

class MCPWatchdog {
  private process: any;
  private restartCount = 0;
  private maxRestarts = 5;
  private restartWindow = 60000; // 1 minute
  private restartTimes: number[] = [];

  async start(serverPath: string) {
    this.process = spawn('node', [serverPath], {
      stdio: ['pipe', 'pipe', 'pipe']
    });

    this.process.on('exit', (code: number) => {
      console.error(`MCP server exited with code ${code}`);
      this.handleExit();
    });

    this.process.on('error', (error: Error) => {
      console.error('MCP server error:', error);
      this.handleExit();
    });

    // Monitor stdout for health
    this.process.stdout.on('data', (data: Buffer) => {
      const message = data.toString();
      if (message.includes('ERROR')) {
        console.warn('MCP server error detected:', message);
      }
    });
  }

  private handleExit() {
    const now = Date.now();
    this.restartTimes.push(now);

    // Remove old restart times outside window
    this.restartTimes = this.restartTimes.filter(
      t => now - t < this.restartWindow
    );

    if (this.restartTimes.length >= this.maxRestarts) {
      console.error(
        `MCP server crashed ${this.maxRestarts} times in ${this.restartWindow}ms. Giving up.`
      );
      process.exit(1);
    }

    console.log(`Restarting MCP server (attempt ${this.restartTimes.length}/${this.maxRestarts})`);
    setTimeout(() => this.start(this.process.spawnfile), 1000);
  }

  stop() {
    if (this.process) {
      this.process.kill();
    }
  }
}

Connection Management

Connection Pooling for Database Access

// src/storage.ts
import sqlite3 from 'sqlite3';
import { open, Database } from 'sqlite';

class ConnectionPool {
  private pool: Database[] = [];
  private readonly maxConnections = 5;
  private readonly minConnections = 1;
  private available: Database[] = [];
  private inUse: Set<Database> = new Set();

  async initialize(dbPath: string) {
    for (let i = 0; i < this.minConnections; i++) {
      const db = await open({
        filename: dbPath,
        driver: sqlite3.Database
      });
      this.pool.push(db);
      this.available.push(db);
    }
  }

  async acquire(): Promise<Database> {
    // Use available connection
    if (this.available.length > 0) {
      const db = this.available.pop()!;
      this.inUse.add(db);
      return db;
    }

    // Create new connection if under limit
    if (this.pool.length < this.maxConnections) {
      const db = await open({
        filename: this.pool[0].config.filename,
        driver: sqlite3.Database
      });
      this.pool.push(db);
      this.inUse.add(db);
      return db;
    }

    // Wait for connection to become available
    return new Promise((resolve) => {
      const interval = setInterval(() => {
        if (this.available.length > 0) {
          clearInterval(interval);
          const db = this.available.pop()!;
          this.inUse.add(db);
          resolve(db);
        }
      }, 100);
    });
  }

  release(db: Database) {
    this.inUse.delete(db);
    this.available.push(db);
  }

  async close() {
    for (const db of this.pool) {
      await db.close();
    }
    this.pool = [];
    this.available = [];
    this.inUse.clear();
  }
}

// Usage in tool handler
const pool = new ConnectionPool();
await pool.initialize('./data/metrics.db');

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const db = await pool.acquire();
  try {
    const result = await db.get('SELECT * FROM metrics');
    return { content: [{ type: 'text', text: JSON.stringify(result) }] };
  } finally {
    pool.release(db);
  }
});

Request Timeout Management

class TimeoutManager {
  async withTimeout<T>(
    promise: Promise<T>,
    timeoutMs: number,
    operation: string
  ): Promise<T> {
    const timeout = new Promise<never>((_, reject) => {
      setTimeout(() => {
        reject(new Error(`${operation} timed out after ${timeoutMs}ms`));
      }, timeoutMs);
    });

    return Promise.race([promise, timeout]);
  }
}

const timeout = new TimeoutManager();

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  try {
    const result = await timeout.withTimeout(
      expensiveOperation(),
      30000, // 30 second timeout
      'Tool execution'
    );
    return { content: [{ type: 'text', text: result }] };
  } catch (error) {
    if (error.message.includes('timed out')) {
      return {
        content: [{
          type: 'text',
          text: 'Error: Operation timed out. Please try again.'
        }],
        isError: true
      };
    }
    throw error;
  }
});

Error Handling & Recovery

Graceful Degradation

interface ToolResult {
  content: Array<{ type: string; text: string }>;
  isError?: boolean;
  fallback?: boolean;
}

class GracefulDegradation {
  async executeWithFallback(
    primary: () => Promise<string>,
    fallback: () => Promise<string>
  ): Promise<ToolResult> {
    try {
      const result = await primary();
      return {
        content: [{ type: 'text', text: result }]
      };
    } catch (error) {
      console.warn('Primary operation failed, using fallback:', error);

      try {
        const result = await fallback();
        return {
          content: [{
            type: 'text',
            text: `⚠️ Primary method failed. Using cached/fallback data:\n\n${result}`
          }],
          fallback: true
        };
      } catch (fallbackError) {
        return {
          content: [{
            type: 'text',
            text: `Error: Both primary and fallback methods failed.\nPrimary: ${error.message}\nFallback: ${fallbackError.message}`
          }],
          isError: true
        };
      }
    }
  }
}

// Example: API with cache fallback
const degradation = new GracefulDegradation();
const cache = new Map<string, { data: any; timestamp: number }>();

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === 'fetch-data') {
    return await degradation.executeWithFallback(
      // Primary: Fetch from API
      async () => {
        const response = await fetch('https://api.example.com/data');
        const data = await response.json();
        cache.set('latest', { data, timestamp: Date.now() });
        return JSON.stringify(data);
      },
      // Fallback: Use cached data
      async () => {
        const cached = cache.get('latest');
        if (!cached) throw new Error('No cache available');

        const age = Date.now() - cached.timestamp;
        return `${JSON.stringify(cached.data)}\n\n(Cached ${Math.floor(age / 1000)}s ago)`;
      }
    );
  }
});

Circuit Breaker Pattern

class CircuitBreaker {
  private state: 'closed' | 'open' | 'half-open' = 'closed';
  private failures = 0;
  private lastFailure = 0;
  private successes = 0;

  constructor(
    private threshold = 5,
    private timeout = 60000,
    private halfOpenAttempts = 3
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (Date.now() - this.lastFailure > this.timeout) {
        console.log('Circuit breaker: Transitioning to half-open');
        this.state = 'half-open';
        this.successes = 0;
      } else {
        throw new Error('Circuit breaker is OPEN - service unavailable');
      }
    }

    try {
      const result = await fn();

      if (this.state === 'half-open') {
        this.successes++;
        if (this.successes >= this.halfOpenAttempts) {
          console.log('Circuit breaker: Closing (recovered)');
          this.state = 'closed';
          this.failures = 0;
        }
      }

      return result;
    } catch (error) {
      this.failures++;
      this.lastFailure = Date.now();

      if (this.state === 'half-open') {
        console.log('Circuit breaker: Re-opening (recovery failed)');
        this.state = 'open';
      } else if (this.failures >= this.threshold) {
        console.log(`Circuit breaker: Opening (${this.failures} failures)`);
        this.state = 'open';
      }

      throw error;
    }
  }

  getState() {
    return {
      state: this.state,
      failures: this.failures,
      lastFailure: this.lastFailure
    };
  }
}

// Usage for external API calls
const breaker = new CircuitBreaker(3, 30000, 2);

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  try {
    const result = await breaker.execute(async () => {
      const response = await fetch('https://external-api.com/data');
      return await response.json();
    });

    return { content: [{ type: 'text', text: JSON.stringify(result) }] };
  } catch (error) {
    if (error.message.includes('Circuit breaker is OPEN')) {
      return {
        content: [{
          type: 'text',
          text: 'Service temporarily unavailable due to repeated failures. Please try again later.'
        }],
        isError: true
      };
    }
    throw error;
  }
});

Monitoring & Observability

Metrics Collection

// src/metrics.ts
interface Metrics {
  toolCalls: Map<string, number>;
  errors: Map<string, number>;
  latencies: Map<string, number[]>;
  lastUpdated: number;
}

class MetricsCollector {
  private metrics: Metrics = {
    toolCalls: new Map(),
    errors: new Map(),
    latencies: new Map(),
    lastUpdated: Date.now()
  };

  recordToolCall(toolName: string, latencyMs: number, error?: Error) {
    // Increment call count
    const calls = this.metrics.toolCalls.get(toolName) || 0;
    this.metrics.toolCalls.set(toolName, calls + 1);

    // Record latency
    const latencies = this.metrics.latencies.get(toolName) || [];
    latencies.push(latencyMs);
    this.metrics.latencies.set(toolName, latencies);

    // Record error
    if (error) {
      const errors = this.metrics.errors.get(toolName) || 0;
      this.metrics.errors.set(toolName, errors + 1);
    }

    this.metrics.lastUpdated = Date.now();
  }

  getMetrics() {
    const summary = Array.from(this.metrics.toolCalls.entries()).map(([tool, calls]) => {
      const errors = this.metrics.errors.get(tool) || 0;
      const latencies = this.metrics.latencies.get(tool) || [];
      const avgLatency = latencies.reduce((a, b) => a + b, 0) / latencies.length;
      const errorRate = (errors / calls) * 100;

      return {
        tool,
        calls,
        errors,
        errorRate: errorRate.toFixed(2) + '%',
        avgLatency: avgLatency.toFixed(0) + 'ms',
        p95Latency: this.percentile(latencies, 95).toFixed(0) + 'ms'
      };
    });

    return summary;
  }

  private percentile(values: number[], p: number): number {
    const sorted = values.slice().sort((a, b) => a - b);
    const index = Math.ceil(sorted.length * (p / 100)) - 1;
    return sorted[index] || 0;
  }
}

// Wrap tool execution with metrics
const metrics = new MetricsCollector();

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const startTime = Date.now();
  const toolName = request.params.name;

  try {
    const result = await executeTool(toolName, request.params.arguments);
    const latency = Date.now() - startTime;
    metrics.recordToolCall(toolName, latency);

    return result;
  } catch (error) {
    const latency = Date.now() - startTime;
    metrics.recordToolCall(toolName, latency, error);
    throw error;
  }
});

// Add metrics tool
server.setRequestHandler(ListToolsRequestSchema, async () => ({
  tools: [
    {
      name: 'get-metrics',
      description: 'Get MCP server performance metrics',
      inputSchema: { type: 'object', properties: {} }
    }
  ]
}));

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === 'get-metrics') {
    const summary = metrics.getMetrics();
    return {
      content: [{
        type: 'text',
        text: '# MCP Server Metrics\n\n' + JSON.stringify(summary, null, 2)
      }]
    };
  }
});

Production Deployment

Docker Container

# Dockerfile
FROM node:22-alpine

WORKDIR /app

# Install dependencies
COPY package.json pnpm-lock.yaml ./
RUN npm install -g pnpm && pnpm install --frozen-lockfile

# Copy source
COPY . .

# Build TypeScript
RUN pnpm build

# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
  CMD node -e "require('./dist/health.js').check()"

# Run server
CMD ["node", "dist/index.js"]

Process Manager (PM2)

// ecosystem.config.js
module.exports = {
  apps: [{
    name: 'mcp-server',
    script: './dist/index.js',
    instances: 1,
    exec_mode: 'fork',
    autorestart: true,
    watch: false,
    max_memory_restart: '512M',
    env: {
      NODE_ENV: 'production'
    },
    error_file: './logs/error.log',
    out_file: './logs/out.log',
    log_date_format: 'YYYY-MM-DD HH:mm:ss Z',
    merge_logs: true,
    min_uptime: '10s',
    max_restarts: 10
  }]
};

Production Examples

Example 1: Conversational API Debugger (MCP Plugin)

// Real-world plugin: conversational-api-debugger
// Handles API testing with health monitoring and circuit breakers

import { Server } from '@modelcontextprotocol/sdk/server/index.js';
import { CircuitBreaker } from './circuit-breaker.js';
import { MetricsCollector } from './metrics.js';

const server = new Server({ name: 'api-debugger', version: '1.0.0' });
const breaker = new CircuitBreaker(3, 30000);
const metrics = new MetricsCollector();

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === 'test-api') {
    const startTime = Date.now();
    const { url, method, headers } = request.params.arguments;

    try {
      const result = await breaker.execute(async () => {
        const response = await fetch(url, {
          method,
          headers: JSON.parse(headers),
          timeout: 10000
        });

        return {
          status: response.status,
          statusText: response.statusText,
          headers: Object.fromEntries(response.headers),
          body: await response.text()
        };
      });

      const latency = Date.now() - startTime;
      metrics.recordToolCall('test-api', latency);

      return {
        content: [{
          type: 'text',
          text: `✓ API Response (${latency}ms)\n\n${JSON.stringify(result, null, 2)}`
        }]
      };
    } catch (error) {
      const latency = Date.now() - startTime;
      metrics.recordToolCall('test-api', latency, error);

      if (error.message.includes('Circuit breaker is OPEN')) {
        return {
          content: [{
            type: 'text',
            text: `⚠️ API temporarily unavailable (circuit breaker triggered)\n\nThe API has failed ${breaker.getState().failures} times. Waiting 30s before retry.`
          }],
          isError: true
        };
      }

      return {
        content: [{
          type: 'text',
          text: `❌ API Error (${latency}ms)\n\n${error.message}`
        }],
        isError: true
      };
    }
  }
});

// Start server
const transport = new StdioServerTransport();
await server.connect(transport);

Performance Metrics:

Average latency: 850ms (API calls)
Circuit breaker trips: 2% of requests (external API failures)
Uptime: 99.7% (7 restarts in 30 days)
Memory usage: 45MB average, 120MB peak

Example 2: Project Health Auditor with Fallback

// Real-world plugin: project-health-auditor
// Scans codebases with graceful degradation for missing dependencies

const degradation = new GracefulDegradation();
const cache = new Map<string, { data: any; timestamp: number }>();

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === 'audit-project') {
    const { projectPath } = request.params.arguments;

    return await degradation.executeWithFallback(
      // Primary: Full AST analysis
      async () => {
        const ast = await parseProjectAST(projectPath);
        const issues = await analyzeAST(ast);
        const result = {
          method: 'full-ast-analysis',
          issues: issues.length,
          details: issues
        };

        cache.set(projectPath, { data: result, timestamp: Date.now() });
        return JSON.stringify(result, null, 2);
      },
      // Fallback: Simple regex scan
      async () => {
        const cached = cache.get(projectPath);

        if (cached && Date.now() - cached.timestamp < 3600000) {
          // Use cache if less than 1 hour old
          return `${JSON.stringify(cached.data, null, 2)}\n\n(Cached ${Math.floor((Date.now() - cached.timestamp) / 1000)}s ago)`;
        }

        // Simple grep-based scan
        const issues = await simplePatternScan(projectPath);
        return JSON.stringify({
          method: 'pattern-scan-fallback',
          issues: issues.length,
          details: issues,
          note: 'Full AST analysis unavailable, using pattern matching'
        }, null, 2);
      }
    );
  }
});

Fallback Statistics:

Primary method success: 94%
Fallback triggered: 6% (missing dependencies, large codebases)
Cache hit rate: 78%
Average scan time: Primary 12s, Fallback 3s

Best Practices

DO ✅

Implement comprehensive health checks

// Check all critical dependencies
const healthChecker = new HealthChecker();
setInterval(async () => {
  const status = await healthChecker.check();
  if (!status.healthy) {
    console.error('Health check failed:', status);
  }
}, 30000); // Every 30 seconds

Use connection pooling for all database access

// Avoid connection exhaustion
const pool = new ConnectionPool();
await pool.initialize('./data.db');

// Always release connections
const db = await pool.acquire();
try {
  await db.run('INSERT INTO logs VALUES (?)');
} finally {
  pool.release(db); // Critical!
}

Set aggressive timeouts on all external calls

const timeout = new TimeoutManager();
const result = await timeout.withTimeout(
  fetch('https://api.example.com'),
  5000, // 5 second max
  'External API call'
);

Collect granular metrics for debugging

const metrics = new MetricsCollector();
// Track every tool call
metrics.recordToolCall(toolName, latency, error);

// Export for analysis
const summary = metrics.getMetrics();
console.log(JSON.stringify(summary));

Always provide fallback behavior

// Never fail completely
return await degradation.executeWithFallback(
  () => primaryMethod(),
  () => cachedOrSimplifiedMethod()
);

Use circuit breakers for external dependencies

const breaker = new CircuitBreaker(3, 30000);
// Prevent cascade failures
const result = await breaker.execute(() => callExternalAPI());

Log stderr separately from stdout

// MCP uses stdout for protocol, stderr for logs
console.error('Error occurred:', error); // ✅ stderr
console.log('Result:', data);            // ❌ breaks MCP

Implement structured logging

const logger = {
  error: (msg: string, meta?: any) => {
    console.error(JSON.stringify({ level: 'error', message: msg, ...meta }));
  }
};

DON'T ❌

Don't write to stdout except MCP responses

// ❌ Breaks MCP protocol
console.log('Debug message');

// ✅ Use stderr
console.error('Debug message');

Don't hold database connections indefinitely

// ❌ Connection leak
const db = await pool.acquire();
await db.get('SELECT * FROM data');
// Never released!

// ✅ Always use try/finally
const db = await pool.acquire();
try {
  await db.get('SELECT * FROM data');
} finally {
  pool.release(db);
}

Don't ignore timeout errors

// ❌ Silent failure
try {
  await expensiveOperation();
} catch (error) {
  // Error swallowed
}

// ✅ Log and return error
catch (error) {
  console.error('Operation failed:', error);
  return { content: [{ type: 'text', text: 'Error: ' + error.message }], isError: true };
}

Don't skip health monitoring in production

// ❌ No visibility
await server.connect(transport);

// ✅ Add health check tool
server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === 'health-check') {
    return { content: [{ type: 'text', text: JSON.stringify(await healthChecker.check()) }] };
  }
});

Don't use synchronous file I/O

// ❌ Blocks event loop
const data = fs.readFileSync('./data.json');

// ✅ Async
const data = await fs.promises.readFile('./data.json');

Don't restart on every error

// ❌ Restart loop
process.on('uncaughtException', () => {
  process.exit(1); // PM2 restarts immediately
});

// ✅ Circuit breaker + graceful degradation
try {
  await operation();
} catch (error) {
  await breaker.execute(() => fallback());
}

Tools & Resources

MCP Development

MCP SDK:

npm install @modelcontextprotocol/sdk

Analytics & Monitoring

Analytics Daemon (from this marketplace):

cd packages/analytics-daemon
pnpm start
# WebSocket: ws://localhost:3456
# HTTP API: http://localhost:3333/api/status

Monitor MCP Server Events:

const ws = new WebSocket('ws://localhost:3456');
ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  if (data.type === 'plugin.activation') {
    console.log(`MCP server ${data.pluginName} activated`);
  }
};

Plugins with MCP Servers

From this marketplace (258 plugins):

project-health-auditor - Codebase scanning with health checks
conversational-api-debugger - API testing with circuit breakers
beads-mcp - Beads task tracker MCP server
creator-studio-pack - Multi-agent MCP orchestration

External Tools

PM2 - Process manager for production
Docker - Containerization
Chokidar - File watching
better-sqlite3 - Fast SQLite

Summary

Key Takeaways:

Health checks are mandatory - Implement internal health endpoints and watchdog processes
Connection pooling prevents leaks - Always use pools for database connections
Circuit breakers prevent cascades - Isolate failures from external dependencies
Graceful degradation maintains uptime - Always provide fallback behavior
Metrics enable debugging - Track latency, errors, and throughput for every tool
Timeouts are non-negotiable - Every external call must have aggressive timeouts
Stdio is sacred - Only use stdout for MCP protocol, stderr for logs

Production Readiness Checklist:

Last Updated: 2025-12-24 Author: Jeremy Longshore Related Playbooks: Multi-Agent Rate Limits, Cost Caps & Budget Management

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCP Server Reliability & Production Readiness

Table of Contents

MCP Architecture Overview

What is MCP?

MCP Server Lifecycle

Health Check Implementation

Strategy 1: Internal Health Endpoint

Strategy 2: Watchdog Process

Connection Management

Connection Pooling for Database Access

Request Timeout Management

Error Handling & Recovery

Graceful Degradation

Circuit Breaker Pattern

Monitoring & Observability

Metrics Collection

Production Deployment

Docker Container

Process Manager (PM2)

Production Examples

Example 1: Conversational API Debugger (MCP Plugin)

Example 2: Project Health Auditor with Fallback

Best Practices

DO ✅

DON'T ❌

Tools & Resources

MCP Development

Analytics & Monitoring

Plugins with MCP Servers

External Tools

Summary

FilesExpand file tree

198-DR-SOPS-03-mcp-reliability.md

Latest commit

History

198-DR-SOPS-03-mcp-reliability.md

File metadata and controls

MCP Server Reliability & Production Readiness

Table of Contents

MCP Architecture Overview

What is MCP?

MCP Server Lifecycle

Health Check Implementation

Strategy 1: Internal Health Endpoint

Strategy 2: Watchdog Process

Connection Management

Connection Pooling for Database Access

Request Timeout Management

Error Handling & Recovery

Graceful Degradation

Circuit Breaker Pattern

Monitoring & Observability

Metrics Collection

Production Deployment

Docker Container

Process Manager (PM2)

Production Examples

Example 1: Conversational API Debugger (MCP Plugin)

Example 2: Project Health Auditor with Fallback

Best Practices

DO ✅

DON'T ❌

Tools & Resources

MCP Development

Analytics & Monitoring

Plugins with MCP Servers

External Tools

Summary