diff --git a/development/backend/plugins.mdx b/development/backend/plugins.mdx
index e1dbdf4..58e76d1 100644
--- a/development/backend/plugins.mdx
+++ b/development/backend/plugins.mdx
@@ -313,8 +313,8 @@ The `databaseExtension` property allows your plugin to:
#### How Plugin Database Tables Work
**Security Architecture:**
-- **Phase 1 (Trusted)**: Core migrations run first (static, secure)
-- **Phase 2 (Untrusted)**: Plugin tables created dynamically (sandboxed)
+- **Stage 1 (Trusted)**: Core migrations run first (static, secure)
+- **Stage 2 (Untrusted)**: Plugin tables created dynamically (sandboxed)
- **Clear Separation**: Plugin tables cannot interfere with core database structure
**Dynamic Table Creation:**
@@ -421,7 +421,7 @@ The database initialization follows a strict security-first approach:
```
┌─────────────────────────────────────────┐
-│ Phase 1: Core System (Trusted) │
+│ Stage 1: Core System (Trusted) │
├─────────────────────────────────────────┤
│ 1. Apply core migrations │
│ 2. Create core tables │
@@ -430,7 +430,7 @@ The database initialization follows a strict security-first approach:
│
▼ Security Boundary
┌─────────────────────────────────────────┐
-│ Phase 2: Plugin System (Sandboxed) │
+│ Stage 2: Plugin System (Sandboxed) │
├─────────────────────────────────────────┤
│ 1. Generate CREATE TABLE SQL │
│ 2. Drop existing plugin tables │
diff --git a/development/backend/satellite/commands.mdx b/development/backend/satellite/commands.mdx
index 0608d69..c8cdeaf 100644
--- a/development/backend/satellite/commands.mdx
+++ b/development/backend/satellite/commands.mdx
@@ -32,7 +32,7 @@ The system supports 5 command types defined in the `command_type` enum:
| `spawn` | Start MCP server process | Launch HTTP proxy or stdio process |
| `kill` | Stop MCP server process | Terminate process gracefully |
| `restart` | Restart MCP server | Stop and start process |
-| `health_check` | Verify server health | Call tools/list to check connectivity |
+| `health_check` | Verify server health and validate credentials | Check connectivity or validate OAuth tokens |
### Configure Commands
@@ -74,6 +74,30 @@ interface CommandPayload {
}
```
+## Status Changes Triggered by Commands
+
+Commands trigger installation status changes through satellite event emission:
+
+| Command | Status Before | Status After | When |
+|---------|--------------|--------------|------|
+| `configure` (install) | N/A | `provisioning` → `command_received` → `connecting` | Installation creation flow |
+| `configure` (update) | `online` | `restarting` → `online` | Configuration change applied |
+| `configure` (delete) | Any | Process terminated | Installation removal |
+| `health_check` (credential) | `online` | `requires_reauth` | OAuth token invalid |
+| `restart` | `online` | `restarting` → `online` | Manual restart requested |
+
+**Status Lifecycle on Installation**:
+1. Backend creates installation → status=`provisioning`
+2. Backend sends `configure` command → status=`command_received`
+3. Satellite connects to server → status=`connecting`
+4. Satellite discovers tools → status=`discovering_tools`
+5. Satellite syncs tools to backend → status=`syncing_tools`
+6. Process complete → status=`online`
+
+For complete status transition documentation, see [Backend Events - Status Values](/development/backend/satellite/events#mcp-server-status_changed).
+
+---
+
## Command Event Types
All `configure` commands include an `event` field in the payload for tracking and logging:
@@ -168,6 +192,14 @@ await satelliteCommandService.notifyMcpRecovery(
**Payload**: `event: 'mcp_recovery'`
+**Status Flow**:
+- Triggered by health check detecting offline installation
+- Sets status to `connecting`
+- Satellite rediscovers tools
+- Status progresses: offline → connecting → discovering_tools → online
+
+For complete recovery system documentation, see [Backend Communication - Auto-Recovery](/development/backend/satellite/communication#auto-recovery-system).
+
## Critical Pattern
**ALWAYS use the correct convenience method**:
@@ -247,9 +279,22 @@ When satellites receive commands:
3. Execute spawn sequence
**For `health_check` commands**:
-1. Call tools/list on target server
-2. Verify response
-3. Report health status
+1. Check `payload.check_type` field:
+ - `connectivity` (default): Call tools/list to verify server responds
+ - `credential_validation`: Validate OAuth tokens for installation
+2. Execute appropriate validation
+3. Report health status via `mcp.server.status_changed` event:
+ - `online` - Health check passed
+ - `requires_reauth` - OAuth token expired/revoked
+ - `error` - Validation failed with error
+
+**Credential Validation Flow**:
+- Backend cron job sends `health_check` command with `check_type: 'credential_validation'`
+- Satellite validates OAuth token (performs token refresh test)
+- Emits status event based on validation result
+- Backend updates `mcpServerInstallations.status` and `last_credential_check_at`
+
+For satellite-side credential validation implementation, see [Satellite OAuth Authentication](/development/satellite/oauth-authentication).
## Example Usage
diff --git a/development/backend/satellite/communication.mdx b/development/backend/satellite/communication.mdx
index 3eefcce..a6a9dde 100644
--- a/development/backend/satellite/communication.mdx
+++ b/development/backend/satellite/communication.mdx
@@ -106,20 +106,20 @@ The system uses three distinct communication patterns:
### Security Architecture
-The satellite pairing process implements a secure **two-phase JWT-based authentication system** that prevents unauthorized satellite connections. For complete implementation details, see [API Security - Registration Token Authentication](/development/backend/api/security#registration-token-authentication).
+The satellite pairing process implements a secure **two-step JWT-based authentication system** that prevents unauthorized satellite connections. For complete implementation details, see [API Security - Registration Token Authentication](/development/backend/api/security#registration-token-authentication).
-**Phase 1: Token Generation**
+**Step 1: Token Generation**
- Administrators generate temporary registration tokens through admin APIs
- Scope-specific tokens (global vs team) with cryptographic signatures
- Token management endpoints for generation, listing, and revocation
-**Phase 2: Satellite Registration**
+**Step 2: Satellite Registration**
- Satellites authenticate using `Authorization: Bearer deploystack_satellite_*` headers
- Backend validates JWT tokens with single-use consumption
- Permanent API keys issued after successful token validation
- Token consumed to prevent replay attacks
-**Breaking Change**: As of Phase 3 implementation, all new satellite registrations require valid registration tokens. The open registration system has been secured.
+**Note**: All new satellite registrations require valid registration tokens. The open registration system has been secured.
### Registration Middleware
@@ -261,6 +261,153 @@ Configuration respects team boundaries and isolation:
- Team-defined security policies
- Internal resource access settings
+## Frontend API Endpoints
+
+The backend provides REST and SSE endpoints for frontend access to installation status, logs, and requests.
+
+### Status & Monitoring Endpoints
+
+**GET `/api/teams/{teamId}/mcp/installations/{installationId}/status`**
+- Returns current installation status, status message, and last update timestamp
+- Used by frontend for real-time status badges and progress indicators
+
+**GET `/api/teams/{teamId}/mcp/installations/{installationId}/logs`**
+- Returns paginated server logs (stderr output, connection errors)
+- Query params: `limit`, `offset` for pagination
+- Limited to 100 lines per installation (enforced by cleanup cron job)
+
+**GET `/api/teams/{teamId}/mcp/installations/{installationId}/requests`**
+- Returns paginated request logs (tool execution history)
+- Includes request params, duration, success status
+- Response data included if `request_logging_enabled=true`
+
+**GET `/api/teams/{teamId}/mcp/installations/{installationId}/requests/{requestId}`**
+- Returns detailed request log for specific execution
+- Includes full request/response payloads when available
+
+### Settings Management
+
+**PATCH `/api/teams/{teamId}/mcp/installations/{installationId}/settings`**
+- Updates installation settings (stored in `mcpServerInstallations.settings` jsonb column)
+- Settings distributed to satellites via config endpoint
+- Current settings:
+ - `request_logging_enabled` (boolean) - Controls capture of tool responses
+
+### Real-Time Streaming (SSE)
+
+**GET `/api/teams/{teamId}/mcp/installations/{installationId}/logs/stream`**
+- Server-Sent Events endpoint for real-time log streaming
+- Frontend subscribes for live stderr output
+- Auto-reconnects on connection loss
+
+**GET `/api/teams/{teamId}/mcp/installations/{installationId}/requests/stream`**
+- Server-Sent Events endpoint for real-time request log streaming
+- Frontend subscribes for live tool execution updates
+- Includes duration, status, and optionally response data
+
+**SSE vs REST Comparison**:
+| Feature | REST Endpoints | SSE Endpoints |
+|---------|---------------|---------------|
+| Use Case | Historical data, pagination | Real-time updates |
+| Connection | Request/response | Persistent connection |
+| Data Flow | Pull (client requests) | Push (server sends) |
+| Frontend Usage | Initial load, manual refresh | Live monitoring |
+
+**SSE Controller Implementation**: `services/backend/src/controllers/mcp/sse.controller.ts`
+
+**Routes Implementation**: `services/backend/src/routes/api/teams/mcp/installations.routes.ts`
+
+---
+
+## Health Check & Recovery Systems
+
+### Cumulative Health Check System
+
+**Purpose**: Template-level health aggregation across all installations of an MCP server.
+
+**McpHealthCheckService** (`services/backend/src/services/mcp-health-check.service.ts`):
+- Aggregates health status from all installations of each MCP server template
+- Updates `mcpServers.health_status` based on installation health
+- Provides template-level health visibility in admin dashboard
+
+**Cron Job**: `mcp-health-check` runs every 3 minutes
+- Implementation: `services/backend/src/jobs/mcp-health-check.job.ts`
+- Checks all MCP server templates
+- Updates template health status for admin visibility
+
+### Credential Validation System
+
+**Purpose**: Per-installation OAuth token validation to detect expired/revoked credentials.
+
+**McpCredentialValidationWorker** (`services/backend/src/workers/mcp-credential-validation.worker.ts`):
+- Validates OAuth tokens for each installation
+- Sends `health_check` command to satellite with `check_type: 'credential_validation'`
+- Satellite performs OAuth validation and reports status
+
+**Cron Job**: `mcp-credential-validation` runs every 1 minute
+- Implementation: `services/backend/src/jobs/mcp-credential-validation.job.ts`
+- Validates installations on 15-minute rotation
+- Triggers `requires_reauth` status on validation failure
+
+**Health Check Command Payload**:
+```json
+{
+ "commandType": "health_check",
+ "priority": "immediate",
+ "payload": {
+ "check_type": "credential_validation",
+ "installation_id": "inst_123",
+ "team_id": "team_xyz"
+ }
+}
+```
+
+Satellite validates credentials and emits `mcp.server.status_changed` with status:
+- `online` - Credentials valid
+- `requires_reauth` - OAuth token expired/revoked
+- `error` - Validation failed with error
+
+### Auto-Recovery System
+
+**Recovery Trigger**:
+- Health check system detects offline installations
+- Backend calls `notifyMcpRecovery(installation_id, team_id)`
+- Sends command to satellite: Set status=`connecting`, rediscover tools
+- Status progression: offline → connecting → discovering_tools → online
+
+**Tool Execution Recovery**:
+- Satellite detects recovery during tool execution (offline server responds)
+- Emits immediate status change event (doesn't wait for health check)
+- Triggers asynchronous re-discovery
+
+For satellite-side recovery implementation, see [Satellite Recovery System](/development/satellite/recovery-system).
+
+---
+
+## Background Cron Jobs
+
+The backend runs three MCP-related cron jobs for maintenance and monitoring:
+
+**cleanup-mcp-server-logs**:
+- **Schedule**: Every 10 minutes
+- **Purpose**: Enforce 100-line limit per installation in `mcpServerLogs` table
+- **Action**: Deletes oldest logs beyond 100-line limit
+- **Implementation**: `services/backend/src/jobs/cleanup-mcp-server-logs.job.ts`
+
+**mcp-health-check**:
+- **Schedule**: Every 3 minutes
+- **Purpose**: Template-level health aggregation
+- **Action**: Updates `mcpServers.health_status` column
+- **Implementation**: `services/backend/src/jobs/mcp-health-check.job.ts`
+
+**mcp-credential-validation**:
+- **Schedule**: Every 1 minute
+- **Purpose**: Detect expired/revoked OAuth tokens
+- **Action**: Sends `health_check` commands to satellites
+- **Implementation**: `services/backend/src/jobs/mcp-credential-validation.job.ts`
+
+---
+
## Database Schema Integration
### Core Table Structure
@@ -298,6 +445,37 @@ The satellite system integrates with existing DeployStack schema through 5 speci
- Alert generation and notification triggers
- Historical health trend analysis
+### New Columns Added (Status & Health Tracking System)
+
+**mcpServerInstallations** table:
+- `status` (text) - Current installation status (11 possible values)
+- `status_message` (text, nullable) - Human-readable status context or error details
+- `status_updated_at` (timestamp) - Last status change timestamp
+- `last_health_check_at` (timestamp, nullable) - Last health check execution time
+- `last_credential_check_at` (timestamp, nullable) - Last credential validation time
+- `settings` (jsonb, nullable) - Generic settings object (e.g., `request_logging_enabled`)
+
+**mcpServers** table:
+- `health_status` (text, nullable) - Template-level aggregated health status
+- `last_health_check_at` (timestamp, nullable) - Last template health check time
+- `health_check_error` (text, nullable) - Last health check error message
+
+**mcpServerLogs** table:
+- Stores batched stderr logs from satellites
+- 100-line limit per installation (enforced by cleanup cron job)
+- Fields: `installation_id`, `team_id`, `log_level`, `message`, `timestamp`
+
+**mcpRequestLogs** table:
+- Stores batched tool execution logs
+- `tool_response` (jsonb, nullable) - MCP server response data
+- Privacy control: Only captured when `request_logging_enabled=true`
+- Fields: `installation_id`, `team_id`, `tool_name`, `request_params`, `tool_response`, `duration_ms`, `success`, `error_message`, `timestamp`
+
+**mcpToolMetadata** table:
+- Stores discovered tools with token counts
+- Used for hierarchical router token savings calculations
+- Fields: `installation_id`, `server_slug`, `tool_name`, `description`, `input_schema`, `token_count`, `discovered_at`
+
### Team Isolation in Data Model
All satellite data respects team boundaries:
diff --git a/development/backend/satellite/events.mdx b/development/backend/satellite/events.mdx
index 7524a68..d77b0e4 100644
--- a/development/backend/satellite/events.mdx
+++ b/development/backend/satellite/events.mdx
@@ -197,18 +197,23 @@ Updates `mcpServerInstallations` table when server status changes during install
**Optional Fields**: `status_message` (string, human-readable context or error details)
-**Status Values**:
+**Status Values** (11 total):
- `provisioning` - Installation created, waiting for satellite
- `command_received` - Satellite acknowledged install command
- `connecting` - Satellite connecting to MCP server
- `discovering_tools` - Tool discovery in progress
- `syncing_tools` - Sending discovered tools to backend
- `online` - Server healthy and responding
+- `restarting` - Configuration changed, server restarting
- `offline` - Server unreachable
- `error` - Connection failed with specific error
- `requires_reauth` - OAuth token expired/revoked
- `permanently_failed` - Process crashed 3+ times in 5 minutes
+**Handler Implementation**: `services/backend/src/events/handlers/mcp/status-changed.handler.ts`
+
+For satellite-side status detection logic and lifecycle flows, see [Satellite Status Tracking](/development/satellite/status-tracking).
+
**Emission Points**:
- Success path: After successful tool discovery → status='online'
- Failure path: On connection errors → status='offline', 'error', or 'requires_reauth'
@@ -225,6 +230,48 @@ Inserts record into `satelliteUsageLogs` for analytics and audit trails.
**Optional Fields**: `error_message` (string, only present when success=false)
+### Logging Events
+
+#### mcp.server.logs
+
+Inserts batched stderr output from MCP servers into `mcpServerLogs` table for debugging and monitoring.
+
+**Business Logic**: Captures stderr output, connection errors, and process lifecycle events. Limited to 100 lines per installation via cleanup cron job.
+
+**Required Fields** (snake_case): `installation_id`, `team_id`, `logs` (array of log entries)
+
+**Handler Implementation**: `services/backend/src/events/handlers/mcp/server-logs.handler.ts`
+
+Event batching strategy (3-second interval, max 20 per batch) is documented in [Satellite Event Emission](/development/satellite/event-emission).
+
+#### mcp.request.logs
+
+Inserts batched tool execution logs into `mcpRequestLogs` table with full request/response data for audit trails.
+
+**Business Logic**: Captures tool execution with request parameters, response data, duration, and success status. Privacy controlled via `mcpServerInstallations.settings.request_logging_enabled`.
+
+**Required Fields** (snake_case): `installation_id`, `team_id`, `tool_name`, `request_params`, `duration_ms`, `success`
+
+**Optional Fields**: `tool_response` (jsonb), `error_message` (string)
+
+**Handler Implementation**: `services/backend/src/events/handlers/mcp/request-logs.handler.ts`
+
+**Database Storage**: `mcpRequestLogs.tool_response` column stores MCP server responses when request logging is enabled.
+
+### Tool Discovery Events
+
+#### mcp.tools.discovered
+
+Updates `mcpToolMetadata` table with discovered tools, token counts, and tool schemas from MCP servers.
+
+**Business Logic**: Stores tool metadata for team visibility, hierarchical router token savings calculations, and frontend tool catalog display.
+
+**Required Fields** (snake_case): `installation_id`, `team_id`, `server_slug`, `tool_count`, `total_tokens`, `tools` (array)
+
+**Handler Implementation**: `services/backend/src/events/handlers/mcp/tools-discovered.handler.ts`
+
+For satellite-side tool discovery implementation, see [Satellite Tool Discovery](/development/satellite/tool-discovery).
+
## Creating New Event Handlers
### Handler Template
@@ -339,6 +386,9 @@ Events route to existing business tables based on their purpose:
| `mcp.server.crashed` | `satelliteProcesses` | Update status='failed', log error details |
| `mcp.server.status_changed` | `mcpServerInstallations` | Update status, status_message, status_updated_at |
| `mcp.tool.executed` | `satelliteUsageLogs` | Insert usage record with metrics |
+| `mcp.server.logs` | `mcpServerLogs` | Insert batched stderr logs (100-line limit) |
+| `mcp.request.logs` | `mcpRequestLogs` | Insert tool execution logs with request/response |
+| `mcp.tools.discovered` | `mcpToolMetadata` | Update tool metadata with token counts |
### Transaction Strategy
diff --git a/development/satellite/architecture.mdx b/development/satellite/architecture.mdx
index 9407649..7df1c60 100644
--- a/development/satellite/architecture.mdx
+++ b/development/satellite/architecture.mdx
@@ -284,31 +284,36 @@ For complete implementation details, see [Backend Polling Implementation](/devel
### Real-Time Event System
-**Event Emission with Batching:**
-```
-Satellite Operations EventBus Backend
- │ │ │
- │─── mcp.server.started ──▶│ │
- │─── mcp.tool.executed ───▶│ [Queue] │
- │─── mcp.client.connected ─▶│ │
- │ [Every 3 seconds] │
- │ │ │
- │ │─── POST /events ───▶│
- │ │◀─── 200 OK ─────────│
-```
-
-**Event Features:**
-- **Immediate Emission**: Events emitted when actions occur (not delayed by 30s heartbeat)
-- **Automatic Batching**: Events collected for 3 seconds, then sent as single batch (max 100 events)
-- **Memory Management**: In-memory queue (10,000 event limit) with overflow protection
-- **Graceful Error Handling**: 429 exponential backoff, 400 drops invalid events, 500/network errors retry
-- **10 Event Types**: Server lifecycle, client connections, tool discovery, configuration updates
+The satellite emits typed events for status changes, logs, and tool metadata. Events enable real-time monitoring without polling.
**Difference from Heartbeat:**
- **Heartbeat** (every 30s): Aggregate metrics, system health, resource usage
-- **Events** (immediate): Point-in-time occurrences, user actions, precise timestamps
+- **Events** (immediate): Point-in-time status updates, precise timestamps
+
+See [Event Emission](/development/satellite/event-emission) for complete event types, payloads, and batching configuration.
+
+### Status Tracking System
+
+The satellite tracks MCP server installation health through an 11-state status system that drives tool availability and automatic recovery.
+
+**Status Values:**
+- Installation lifecycle: `provisioning`, `command_received`, `connecting`, `discovering_tools`, `syncing_tools`
+- Healthy state: `online` (tools available)
+- Configuration changes: `restarting`
+- Failure states: `offline`, `error`, `requires_reauth`, `permanently_failed`
+
+**Status Integration:**
+- **Tool Filtering**: Tools from non-online servers hidden from discovery
+- **Auto-Recovery**: Offline servers auto-recover when responsive
+- **Event Emission**: Status changes emitted immediately to backend
+
+See [Status Tracking](/development/satellite/status-tracking) for complete status lifecycle and transitions.
+
+### Log Capture System
+
+The satellite captures and batches two types of logs for debugging and monitoring: **server logs** (stderr output) and **request logs** (tool execution with full request/response data).
-For complete event system documentation, see [Event System](/development/satellite/event-system).
+See [Log Capture](/development/satellite/log-capture) for buffering implementation, batching configuration, backend storage limits, and privacy controls.
## Security Architecture
@@ -437,14 +442,14 @@ For testing the hierarchical router (tool discovery and execution), see [Hierarc
## Implementation Status
-The satellite service has completed **Phase 1: MCP Transport Implementation** and **Phase 4: Backend Integration**. Current implementation provides:
+The satellite service has completed MCP Transport Implementation and Backend Integration. Current implementation provides:
-**Phase 1 - MCP Transport Layer:**
+**MCP Transport Layer:**
- **Complete MCP Transport Layer**: SSE, SSE Messaging, Streamable HTTP
- **Session Management**: Cryptographically secure with automatic cleanup
- **JSON-RPC 2.0 Compliance**: Full protocol support with error handling
-**Phase 4 - Backend Integration:**
+**Backend Integration:**
- **Command Polling Service**: Adaptive polling with three modes (normal/immediate/error)
- **Dynamic Configuration Management**: Replaces hardcoded MCP server configurations
- **Command Processing**: HTTP MCP server management (spawn/kill/restart/health_check)
diff --git a/development/satellite/backend-communication.mdx b/development/satellite/backend-communication.mdx
index 1f09bcf..f2d9018 100644
--- a/development/satellite/backend-communication.mdx
+++ b/development/satellite/backend-communication.mdx
@@ -157,11 +157,11 @@ For detailed event system documentation, see [Event System](/development/satelli
- Performance metrics collection
**Terminate Process:**
-- Graceful shutdown with SIGTERM
-- Force kill with SIGKILL after timeout
- Resource cleanup and deallocation
- Final status report to Backend
+See [Process Management - Graceful Termination](/development/satellite/process-management#graceful-termination) for SIGTERM/SIGKILL shutdown details.
+
## Internal Architecture
### Five Core Components
@@ -286,7 +286,7 @@ See `services/backend/src/db/schema.ts` for complete schema definitions.
### Authentication Flow
-**Registration Phase:**
+**Registration:**
1. Admin generates JWT registration token via backend API
2. Satellite includes token in Authorization header during registration
3. Backend validates token signature, scope, and expiration
@@ -295,7 +295,7 @@ See `services/backend/src/db/schema.ts` for complete schema definitions.
For detailed token validation process, see [Registration Security](/development/backend/satellite-communication#satellite-pairing-process).
-**Operational Phase:**
+**Ongoing Operations:**
1. All requests include `Authorization: Bearer {api_key}`
2. Backend validates API key and satellite scope
3. Team context extracted from satellite registration
@@ -375,6 +375,70 @@ server.log.info({
4. Add comprehensive monitoring and alerting
5. End-to-end testing and performance validation
+## Events vs Heartbeat
+
+The satellite communicates status and metrics through two distinct channels:
+
+**Events (Immediate):**
+- Emitted when actions occur (not delayed by heartbeat interval)
+- Point-in-time status updates with precise timestamps
+- Batched automatically (3-second interval, max 20 per batch)
+- Types: Status changes, logs, tool metadata, lifecycle events
+
+**Heartbeat (Periodic, every 30s):**
+- Aggregate metrics and system health
+- Resource usage statistics
+- Overall satellite status
+
+See [Event Emission](/development/satellite/event-emission) for complete event types and batching strategy.
+
+## Health Check Command
+
+The backend sends `health_check` commands for credential validation:
+
+**Command Structure:**
+```typescript
+{
+ commandType: 'health_check',
+ priority: 'immediate',
+ payload: {
+ check_type: 'credential_validation',
+ installation_id: string,
+ team_id: string
+ }
+}
+```
+
+**Satellite Action:**
+- Calls `tools/list` on MCP server with credentials
+- Detects auth errors (401, 403)
+- Emits `requires_reauth` status if validation fails
+
+See [Commands](/development/satellite/commands) for complete command reference.
+
+## Recovery Commands
+
+When offline servers recover, backend sends recovery commands:
+
+**Command Structure:**
+```typescript
+{
+ commandType: 'configure',
+ priority: 'high',
+ payload: {
+ event: 'mcp_recovery',
+ installation_id: string,
+ team_id: string
+ }
+}
+```
+
+**Satellite Action:**
+- Triggers re-discovery for the recovered server
+- Status progresses: `offline` → `connecting` → `discovering_tools` → `online`
+
+See [Recovery System](/development/satellite/recovery-system) for automatic recovery logic.
+
The satellite communication system is designed for enterprise deployment with complete team isolation, resource management, and audit logging while maintaining the developer experience that defines the DeployStack platform.
diff --git a/development/satellite/commands.mdx b/development/satellite/commands.mdx
index ff428df..cb328e1 100644
--- a/development/satellite/commands.mdx
+++ b/development/satellite/commands.mdx
@@ -115,6 +115,44 @@ Each satellite command contains:
4. Restart affected components
5. Verify system integrity post-update
+### health_check
+
+**Purpose**: Validates MCP server credentials and connectivity
+
+**Priority**: `immediate`
+
+**Triggered By**:
+- Backend credential validation cron (every 1 minute)
+- Manual credential testing
+- OAuth token expiration detection
+
+**Payload Structure**:
+```json
+{
+ "check_type": "credential_validation",
+ "installation_id": "installation-uuid",
+ "team_id": "team-uuid"
+}
+```
+
+**Satellite Actions**:
+1. Find MCP server configuration by installation_id
+2. Skip stdio servers (no HTTP credentials to validate)
+3. Build HTTP request with configured credentials (headers, query params)
+4. Call `tools/list` with 15-second timeout
+5. Detect authentication errors:
+ - HTTP 401/403 responses
+ - Error messages containing "auth", "unauthorized", "forbidden"
+6. Emit status event:
+ - On auth failure → `requires_reauth` status
+ - On success → credentials valid (no status change)
+
+**Error Detection Patterns**:
+- HTTP status codes: 401, 403
+- Response body keywords: "auth", "unauthorized", "forbidden", "token", "credentials"
+
+See [Status Tracking](/development/satellite/status-tracking) for credential validation status flow.
+
## Command Lifecycle
### Creation
diff --git a/development/satellite/event-emission.mdx b/development/satellite/event-emission.mdx
new file mode 100644
index 0000000..46ae8e9
--- /dev/null
+++ b/development/satellite/event-emission.mdx
@@ -0,0 +1,427 @@
+---
+title: Event Emission
+description: Events emitted by the satellite to communicate with the backend
+---
+
+# Event Emission
+
+The satellite communicates with the backend through a centralized EventBus that emits typed events. These events enable real-time status updates, log streaming, and tool metadata synchronization without polling.
+
+## Overview
+
+The satellite emits events for:
+- **Status Changes**: Real-time installation status updates
+- **Server Logs**: Batched stderr output from MCP servers
+- **Request Logs**: Batched tool execution logs with request/response data
+- **Tool Metadata**: Tool discovery results with token counts
+- **Process Lifecycle**: Server start, crash, restart, permanent failure events
+
+All events are processed by the backend's event handler system and trigger database updates, SSE broadcasts to frontend, and health monitoring actions.
+
+## Event System Architecture
+
+```
+Satellite Component (ProcessManager, McpServerWrapper, DiscoveryManager)
+ ↓
+EventBus.emit(eventType, eventData)
+ ↓
+Backend Polling Service (30-second interval)
+ ↓
+Backend Event Handlers (process events, update database)
+ ↓
+Frontend SSE Streams (real-time updates to users)
+```
+
+## Event Types Reference
+
+### mcp.server.status_changed
+
+**Purpose:** Update installation status in real-time
+
+**Emitted by:**
+- ProcessManager (connecting, online, crashed, permanently_failed)
+- McpServerWrapper (offline, error, requires_reauth on tool execution failures)
+- RemoteToolDiscoveryManager (connecting, online, offline, error, requires_reauth)
+
+For complete status transition triggers and lifecycle flows, see [Status Tracking](/development/satellite/status-tracking).
+
+**Payload:**
+```typescript
+{
+ installation_id: string;
+ team_id: string;
+ status: 'provisioning' | 'command_received' | 'connecting' | 'discovering_tools'
+ | 'syncing_tools' | 'online' | 'restarting' | 'offline' | 'error'
+ | 'requires_reauth' | 'permanently_failed';
+ status_message?: string;
+ timestamp: string; // ISO 8601
+}
+```
+
+**Example:**
+```typescript
+eventBus.emit('mcp.server.status_changed', {
+ installation_id: 'inst_abc123',
+ team_id: 'team_xyz',
+ status: 'online',
+ status_message: 'Server connected successfully',
+ timestamp: '2025-01-15T10:30:00.000Z'
+});
+```
+
+**Backend Action:** Updates `mcpServerInstallations.status` and broadcasts via SSE
+
+---
+
+### mcp.server.logs
+
+**Purpose:** Stream server logs (stderr, connection errors, startup messages) to backend
+
+**Emitted by:**
+- ProcessManager (batched stderr output from stdio MCP servers)
+
+**Batching Strategy:**
+- **Interval**: 3 seconds after first log entry
+- **Max Size**: 20 logs per batch (forces immediate flush)
+- **Grouping**: By `installation_id + team_id`
+
+**Payload:**
+```typescript
+{
+ installation_id: string;
+ team_id: string;
+ logs: Array<{
+ level: 'info' | 'warn' | 'error' | 'debug';
+ message: string;
+ metadata?: Record;
+ timestamp: string; // ISO 8601
+ }>;
+}
+```
+
+**Example:**
+```typescript
+eventBus.emit('mcp.server.logs', {
+ installation_id: 'inst_abc123',
+ team_id: 'team_xyz',
+ logs: [
+ {
+ level: 'error',
+ message: 'Connection refused to http://localhost:3568/sse',
+ metadata: { error_code: 'ECONNREFUSED' },
+ timestamp: '2025-01-15T10:30:00.000Z'
+ },
+ {
+ level: 'info',
+ message: 'Retrying connection in 2 seconds...',
+ timestamp: '2025-01-15T10:30:02.000Z'
+ }
+ ]
+});
+```
+
+**Backend Action:** Inserts logs into `mcpServerLogs` table, enforces 100-line limit per installation
+
+---
+
+### mcp.request.logs
+
+**Purpose:** Stream tool execution logs with full request/response data
+
+**Emitted by:**
+- McpServerWrapper (batched tool call logs)
+
+**Batching Strategy:**
+- **Interval**: 3 seconds after first request
+- **Max Size**: 20 requests per batch
+- **Grouping**: By `installation_id + team_id`
+
+**Payload:**
+```typescript
+{
+ installation_id: string;
+ team_id: string;
+ requests: Array<{
+ user_id?: string;
+ tool_name: string;
+ tool_params: Record;
+ tool_response?: unknown; // Full MCP server response
+ response_time_ms: number;
+ success: boolean;
+ error_message?: string;
+ timestamp: string; // ISO 8601
+ }>;
+}
+```
+
+**Example:**
+```typescript
+eventBus.emit('mcp.request.logs', {
+ installation_id: 'inst_abc123',
+ team_id: 'team_xyz',
+ requests: [
+ {
+ user_id: 'user_xyz',
+ tool_name: 'github:list-repos',
+ tool_params: { owner: 'deploystackio' },
+ tool_response: { repos: ['deploystack', 'mcp-server'], total: 2 },
+ response_time_ms: 234,
+ success: true,
+ timestamp: '2025-01-15T10:30:00.000Z'
+ }
+ ]
+});
+```
+
+**Backend Action:** Inserts requests into `mcpRequestLogs` table, enforces 100-line limit
+
+**Privacy Note:** Only emitted if `settings.request_logging_enabled !== false`
+
+---
+
+### mcp.tools.discovered
+
+**Purpose:** Synchronize discovered tools and metadata to backend
+
+**Emitted by:**
+- UnifiedToolDiscoveryManager (after tool discovery completes)
+
+**Payload:**
+```typescript
+{
+ installation_id: string;
+ team_id: string;
+ tools: Array<{
+ tool_path: string; // e.g., "github:list-repos"
+ name: string;
+ description?: string;
+ inputSchema: unknown;
+ token_count: number; // Estimated token usage
+ }>;
+ timestamp: string; // ISO 8601
+}
+```
+
+**Example:**
+```typescript
+eventBus.emit('mcp.tools.discovered', {
+ installation_id: 'inst_abc123',
+ team_id: 'team_xyz',
+ tools: [
+ {
+ tool_path: 'github:list-repos',
+ name: 'list-repos',
+ description: 'List all repositories for an owner',
+ inputSchema: { type: 'object', properties: { owner: { type: 'string' } } },
+ token_count: 42
+ }
+ ],
+ timestamp: '2025-01-15T10:30:00.000Z'
+});
+```
+
+**Backend Action:** Updates `mcpTools` table with discovered tools and metadata
+
+---
+
+### Process Lifecycle Events
+
+These events track stdio MCP server process state:
+
+#### mcp.server.started
+
+**Emitted when:** Stdio process successfully spawned
+
+**Payload:**
+```typescript
+{
+ installation_id: string;
+ team_id: string;
+ process_id: string;
+ timestamp: string;
+}
+```
+
+#### mcp.server.crashed
+
+**Emitted when:** Stdio process terminates unexpectedly
+
+**Payload:**
+```typescript
+{
+ installation_id: string;
+ team_id: string;
+ process_id: string;
+ exit_code: number | null;
+ signal: string | null;
+ crash_count: number; // Crashes within 5-minute window
+ timestamp: string;
+}
+```
+
+#### mcp.server.restarted
+
+**Emitted when:** Stdio process automatically restarted after crash
+
+**Payload:**
+```typescript
+{
+ installation_id: string;
+ team_id: string;
+ process_id: string;
+ restart_count: number;
+ timestamp: string;
+}
+```
+
+#### mcp.server.permanently_failed
+
+**Emitted when:** Stdio process crashes 3 times within 5 minutes
+
+**Payload:**
+```typescript
+{
+ installation_id: string;
+ team_id: string;
+ process_id: string;
+ crash_count: number; // Always 3
+ message: string; // "Process crashed 3 times in 5 minutes"
+ timestamp: string;
+}
+```
+
+**Backend Action:** Sets installation status to `permanently_failed`, requires manual restart
+
+---
+
+## Event Batching Strategy
+
+### Why Batching?
+
+Batching reduces:
+- Backend API calls (20 logs = 1 API call instead of 20)
+- Database transactions (bulk insert instead of individual inserts)
+- Network overhead (fewer HTTP requests)
+- Backend processing load (batch operations are more efficient)
+
+### Batching Configuration
+
+| Parameter | Value | Reason |
+|-----------|-------|--------|
+| Batch Interval | 3 seconds | Balance between real-time feel and efficiency |
+| Max Batch Size | 20 entries | Prevent large payloads, force timely emission |
+| Grouping Key | `installation_id + team_id` | Separate batches per installation |
+
+### Batching Implementation
+
+Log batching implementation details are in [Log Capture - Buffering Implementation](/development/satellite/log-capture#buffering-implementation) for both server logs and request logs.
+
+## EventBus Usage
+
+### Emitting Events
+
+```typescript
+import { EventBus } from './events/event-bus';
+
+// EventBus is a singleton
+const eventBus = EventBus.getInstance();
+
+// Emit with type safety
+eventBus.emit('mcp.server.status_changed', {
+ installation_id: 'inst_123',
+ team_id: 'team_456',
+ status: 'online',
+ timestamp: new Date().toISOString()
+});
+```
+
+### Event Registry
+
+All event types are defined in the event registry:
+
+```typescript
+// services/satellite/src/events/registry.ts
+
+export type EventType =
+ | 'mcp.server.status_changed'
+ | 'mcp.server.logs'
+ | 'mcp.request.logs'
+ | 'mcp.tools.discovered'
+ | 'mcp.server.started'
+ | 'mcp.server.crashed'
+ | 'mcp.server.restarted'
+ | 'mcp.server.permanently_failed'
+ // ... 13 total event types
+ ;
+
+export interface EventDataMap {
+ 'mcp.server.status_changed': { /* payload */ };
+ 'mcp.server.logs': { /* payload */ };
+ // ... type definitions for all events
+}
+```
+
+## Backend Event Handlers
+
+Each event type has a dedicated backend handler:
+
+**Status Changed:**
+```typescript
+// services/backend/src/events/satellite/mcp-server-status-changed.ts
+// Updates mcpServerInstallations.status
+```
+
+**Server Logs:**
+```typescript
+// services/backend/src/events/satellite/mcp-server-logs.ts
+// Inserts into mcpServerLogs table
+```
+
+**Request Logs:**
+```typescript
+// services/backend/src/events/satellite/mcp-request-logs.ts
+// Inserts into mcpRequestLogs table (if logging enabled)
+```
+
+**Tools Discovered:**
+```typescript
+// services/backend/src/events/satellite/mcp-tools-discovered.ts
+// Updates mcpTools table with metadata
+```
+
+## Integration Points
+
+**Process Manager:**
+- Emits server logs (stderr batching)
+- Emits lifecycle events (started, crashed, restarted, permanently_failed)
+- Emits status changes (connecting, online, permanently_failed)
+
+**MCP Server Wrapper:**
+- Emits request logs (tool execution batching)
+- Emits status changes (offline, error, requires_reauth on failures)
+- Emits status changes (connecting, online on recovery)
+
+**Tool Discovery Managers:**
+- Emit status changes (connecting, discovering_tools, online, offline, error)
+- Trigger tool metadata emission via UnifiedToolDiscoveryManager
+
+**Unified Tool Discovery Manager:**
+- Emits `mcp.tools.discovered` after successful discovery
+- Coordinates status callbacks from discovery managers
+
+## Implementation Components
+
+The event emission system consists of several integrated components:
+- Backend event handler system
+- Satellite status event emission
+- Server and request log batching
+- Tool metadata event emission
+- Stdio permanently_failed event
+- Tool execution failure status events
+
+## Related Documentation
+
+- [Status Tracking](/development/satellite/status-tracking) - Status values and lifecycle
+- [Log Capture](/development/satellite/log-capture) - Logging system details
+- [Process Management](/development/satellite/process-management) - Lifecycle events
+- [Tool Discovery](/development/satellite/tool-discovery) - Tool metadata events
diff --git a/development/satellite/hierarchical-router.mdx b/development/satellite/hierarchical-router.mdx
index 5267e6c..9a11617 100644
--- a/development/satellite/hierarchical-router.mdx
+++ b/development/satellite/hierarchical-router.mdx
@@ -384,66 +384,9 @@ Satellite → Client
## Format Conversion
-### External vs Internal Formats
+The satellite converts between user-facing format (`serverName:toolName`) and internal routing format (`serverName-toolName`) transparently during tool discovery and execution.
-The satellite uses different tool path formats for different purposes:
-
-**External Format (User-Facing): `serverName:toolName`**
-
-Used in:
-- `discover_mcp_tools` responses
-- `execute_mcp_tool` requests
-- Any client-facing communication
-
-Examples:
-- `github:create_issue`
-- `figma:get_file`
-- `postgres:query`
-
-Why colon?
-- Standard separator in URIs and paths
-- Clean, readable format
-- Industry convention (npm packages, docker images)
-
-**Internal Format (Routing): `serverName-toolName`**
-
-Used in:
-- Unified tool cache keys
-- Tool discovery manager
-- Process routing
-- Internal lookups
-
-Examples:
-- `github-create_issue`
-- `figma-get_file`
-- `postgres-query`
-
-Why dash?
-- Existing codebase convention
-- Backward compatibility
-- All existing code uses dash format
-
-### Conversion Logic
-
-```typescript
-// In handleExecuteTool()
-const toolPath = "github:create_issue"; // From client
-
-// Parse external format
-const [serverSlug, toolName] = toolPath.split(':');
-
-// Convert to internal format
-const namespacedToolName = `${serverSlug}-${toolName}`;
-// Result: "github-create_issue"
-
-// Look up in cache
-const cachedTool = toolDiscoveryManager.getTool(namespacedToolName);
-
-// Route to actual MCP server
-await executeToolCall(namespacedToolName, toolArguments);
-```
-
-The conversion is transparent to both clients and actual MCP servers - it's purely a satellite internal concern.
+See [Tool Discovery - Namespacing Strategy](/development/satellite/tool-discovery#namespacing-strategy) for complete details on naming conventions and format conversion logic.
## Search Implementation
@@ -586,9 +529,17 @@ Both meta-tools are implemented and production-ready:
- Fast search performance
- Easy to monitor and debug
+## Status-Based Tool Filtering
+
+The hierarchical router integrates with status tracking to hide tools from unavailable servers and provide clear error messages when unavailable tools are executed.
+
+See [Status Tracking - Tool Filtering](/development/satellite/status-tracking#tool-filtering-by-status) for complete filtering logic, execution blocking rules, and status values.
+
## Related Documentation
- [Tool Discovery Implementation](/development/satellite/tool-discovery) - Internal tool caching and discovery
+- [Status Tracking](/development/satellite/status-tracking) - Tool filtering by server status
+- [Recovery System](/development/satellite/recovery-system) - How offline servers auto-recover
- [MCP Transport Protocols](/development/satellite/mcp-transport) - How clients connect
- [Process Management](/development/satellite/process-management) - stdio server lifecycle
- [Architecture Overview](/development/satellite/architecture) - Complete satellite design
diff --git a/development/satellite/index.mdx b/development/satellite/index.mdx
index f99b41e..1315a32 100644
--- a/development/satellite/index.mdx
+++ b/development/satellite/index.mdx
@@ -214,7 +214,7 @@ npm run release # Release management
## Implemented Features
-### Phase 2: MCP Server Process Management
+### MCP Server Process Management
- **Process Lifecycle**: Spawn, monitor, auto-restart (max 3), and terminate MCP servers
- **stdio Communication**: Full JSON-RPC 2.0 protocol over stdin/stdout
- **HTTP Proxy**: Reverse proxy for external MCP server endpoints working
@@ -223,20 +223,20 @@ npm run release # Release management
- **Tool Discovery**: Automatic tool caching from both HTTP and stdio servers
- **Team-Grouped Heartbeat**: processes_by_team reporting every 30 seconds
-### Phase 3: Team Isolation
+### Team Isolation
- **nsjail Sandboxing**: Complete process isolation with built-in resource limits
- **Namespace Isolation**: PID, mount, UTS, IPC namespaces per team
- **Filesystem Isolation**: Team-specific read-only and writable directories
- **Credential Management**: Secure environment injection via nsjail
-### Phase 4: Backend Integration
+### Backend Integration
- **HTTP Polling**: Outbound communication with DeployStack Backend
- **Configuration Sync**: Dynamic configuration updates from Backend
- **Status Reporting**: Real-time satellite health and usage metrics
- **Command Processing**: Execute Backend commands with acknowledgment
- **Event System**: Real-time event emission with automatic batching (10 event types)
-### Phase 5: Enterprise Features
+### Enterprise Features
- **OAuth 2.1 Authentication**: Resource server with token introspection
- **Audit Logging**: Complete audit trails for compliance
- **Multi-Region Support**: Global satellite deployment
diff --git a/development/satellite/log-capture.mdx b/development/satellite/log-capture.mdx
new file mode 100644
index 0000000..6a01d4a
--- /dev/null
+++ b/development/satellite/log-capture.mdx
@@ -0,0 +1,451 @@
+---
+title: Log Capture
+description: Server and request logging system in the satellite
+---
+
+# Log Capture
+
+The satellite captures and batches two types of logs for each MCP server installation: **server logs** (stderr output, connection errors, startup messages) and **request logs** (tool execution with full request/response data).
+
+## Overview
+
+Log capture serves three purposes: **Debugging** lets developers see stderr output and tool execution details, **Monitoring** tracks server health and tool usage in real-time, and **Audit Trail** provides a complete record of tool calls with parameters and responses
+
+Both log types use the same batching strategy (3-second interval, max 20 per batch) to optimize backend API calls and database writes.
+
+## Server Logs
+
+Server logs capture stderr output and connection events from MCP servers, particularly useful for debugging stdio-based servers.
+
+### What Gets Logged
+
+**Stdio Servers:**
+- stderr output from the MCP server process
+- Connection errors (handshake failures)
+- Process spawn errors
+- Crash information
+
+**HTTP/SSE Servers:**
+- Connection errors (ECONNREFUSED, ETIMEDOUT)
+- HTTP error responses (4xx, 5xx)
+- OAuth authentication failures
+- Network timeouts
+
+### Log Levels
+
+| Level | Usage |
+|-------|-------|
+| `info` | Normal operations (connection established, tool discovery started) |
+| `warn` | Non-critical issues (retry attempts, temporary failures) |
+| `error` | Critical errors (connection refused, auth failures, crashes) |
+| `debug` | Detailed diagnostic information (handshake details, raw responses) |
+
+### Buffering Implementation
+
+```typescript
+// services/satellite/src/process/manager.ts
+
+interface BufferedLogEntry {
+ installation_id: string;
+ team_id: string;
+ level: 'info' | 'warn' | 'error' | 'debug';
+ message: string;
+ metadata?: Record;
+ timestamp: string;
+}
+
+class ProcessManager {
+ private logBuffer: BufferedLogEntry[] = [];
+ private logFlushTimeout: NodeJS.Timeout | null = null;
+ private readonly LOG_BATCH_INTERVAL_MS = 3000;
+ private readonly LOG_BATCH_MAX_SIZE = 20;
+
+ // Called when stderr receives data
+ private handleStderrData(processInfo: ProcessInfo, data: Buffer) {
+ const message = data.toString().trim();
+
+ this.bufferLogEntry({
+ installation_id: processInfo.config.installation_id,
+ team_id: processInfo.config.team_id,
+ level: this.inferLogLevel(message), // 'error' if contains "error", etc.
+ message,
+ metadata: { process_id: processInfo.processId },
+ timestamp: new Date().toISOString()
+ });
+ }
+
+ private bufferLogEntry(entry: BufferedLogEntry) {
+ this.logBuffer.push(entry);
+
+ // Force immediate flush if buffer full
+ if (this.logBuffer.length >= this.LOG_BATCH_MAX_SIZE) {
+ this.flushLogBuffer();
+ } else {
+ this.scheduleLogFlush(); // Flush after 3 seconds
+ }
+ }
+
+ private scheduleLogFlush() {
+ if (this.logFlushTimeout) return; // Already scheduled
+
+ this.logFlushTimeout = setTimeout(() => {
+ this.flushLogBuffer();
+ }, this.LOG_BATCH_INTERVAL_MS);
+ }
+
+ private flushLogBuffer() {
+ if (this.logBuffer.length === 0) return;
+
+ // Group by installation
+ const groupedLogs = new Map();
+ for (const entry of this.logBuffer) {
+ const key = `${entry.installation_id}:${entry.team_id}`;
+ if (!groupedLogs.has(key)) {
+ groupedLogs.set(key, []);
+ }
+ groupedLogs.get(key)!.push(entry);
+ }
+
+ // Emit one event per installation
+ for (const [key, logs] of groupedLogs.entries()) {
+ this.eventBus?.emit('mcp.server.logs', {
+ installation_id: logs[0].installation_id,
+ team_id: logs[0].team_id,
+ logs: logs.map(log => ({
+ level: log.level,
+ message: log.message,
+ metadata: log.metadata,
+ timestamp: log.timestamp
+ }))
+ });
+ }
+
+ // Clear buffer
+ this.logBuffer = [];
+ this.logFlushTimeout = null;
+ }
+}
+```
+
+### Example Server Logs
+
+```json
+{
+ "installation_id": "inst_abc123",
+ "team_id": "team_xyz",
+ "logs": [
+ {
+ "level": "info",
+ "message": "MCP server starting on port 3568",
+ "timestamp": "2025-01-15T10:30:00.000Z"
+ },
+ {
+ "level": "error",
+ "message": "Connection refused: ECONNREFUSED",
+ "metadata": { "error_code": "ECONNREFUSED" },
+ "timestamp": "2025-01-15T10:30:05.000Z"
+ },
+ {
+ "level": "warn",
+ "message": "Retrying connection in 2 seconds...",
+ "timestamp": "2025-01-15T10:30:07.000Z"
+ }
+ ]
+}
+```
+
+## Request Logs
+
+Request logs capture tool execution with full request parameters and server responses, providing complete visibility into MCP tool usage.
+
+### What Gets Logged
+
+For each tool execution:
+- Tool name (e.g., `github:list-repos`)
+- Input parameters sent to tool
+- **Full response from MCP server** (when request logging is enabled)
+- Response time in milliseconds
+- Success/failure status
+- Error message (if failed)
+- User ID (who called the tool)
+- Timestamp
+
+### Privacy Control
+
+Request logging can be disabled per-installation via settings:
+
+```typescript
+// Installation settings
+{
+ "request_logging_enabled": false
+}
+```
+
+When disabled:
+- No request logs are buffered or emitted
+- Tool execution still works normally
+- Server logs (stderr) still captured
+- Used for privacy-sensitive tools (internal APIs, credentials, PII)
+
+### Buffering Implementation
+
+```typescript
+// services/satellite/src/core/mcp-server-wrapper.ts
+
+interface BufferedRequestEntry {
+ installation_id: string;
+ team_id: string;
+ user_id?: string;
+ tool_name: string;
+ tool_params: Record;
+ tool_response?: unknown; // Full MCP server response
+ response_time_ms: number;
+ success: boolean;
+ error_message?: string;
+ timestamp: string;
+}
+
+class McpServerWrapper {
+ private requestLogBuffer: BufferedRequestEntry[] = [];
+ private requestLogFlushTimeout: NodeJS.Timeout | null = null;
+ private readonly REQUEST_LOG_BATCH_INTERVAL_MS = 3000;
+ private readonly REQUEST_LOG_BATCH_MAX_SIZE = 20;
+
+ async handleExecuteTool(toolPath: string, toolArguments: unknown) {
+ const startTime = Date.now();
+ let result: unknown;
+ let success = false;
+ let errorMessage: string | undefined;
+
+ try {
+ result = await this.executeToolCall(toolPath, toolArguments);
+ success = true;
+ } catch (error) {
+ errorMessage = error instanceof Error ? error.message : 'Unknown error';
+ } finally {
+ const responseTimeMs = Date.now() - startTime;
+
+ // Check if logging is enabled (default: true)
+ const loggingEnabled = config?.settings?.request_logging_enabled !== false;
+
+ // Buffer request log if installation context exists and logging enabled
+ if ((config?.installation_id && config?.team_id) && loggingEnabled) {
+ this.bufferRequestLogEntry({
+ installation_id: config.installation_id,
+ team_id: config.team_id,
+ user_id: config.user_id,
+ tool_name: toolPath,
+ tool_params: toolArguments as Record,
+ tool_response: result, // Captured response
+ response_time_ms: responseTimeMs,
+ success,
+ error_message: errorMessage,
+ timestamp: new Date().toISOString()
+ });
+ }
+ }
+
+ return result;
+ }
+
+ private bufferRequestLogEntry(entry: BufferedRequestEntry) {
+ this.requestLogBuffer.push(entry);
+
+ // Force flush if buffer full
+ if (this.requestLogBuffer.length >= this.REQUEST_LOG_BATCH_MAX_SIZE) {
+ this.flushRequestLogBuffer();
+ } else {
+ this.scheduleRequestLogFlush();
+ }
+ }
+
+ private flushRequestLogBuffer() {
+ if (this.requestLogBuffer.length === 0) return;
+
+ // Group by installation
+ const grouped = this.groupRequestsByInstallation(this.requestLogBuffer);
+
+ // Emit one event per installation
+ for (const [key, requests] of grouped.entries()) {
+ this.eventBus?.emit('mcp.request.logs', {
+ installation_id: requests[0].installation_id,
+ team_id: requests[0].team_id,
+ requests: requests.map(req => ({
+ user_id: req.user_id,
+ tool_name: req.tool_name,
+ tool_params: req.tool_params,
+ tool_response: req.tool_response, // Include response
+ response_time_ms: req.response_time_ms,
+ success: req.success,
+ error_message: req.error_message,
+ timestamp: req.timestamp
+ }))
+ });
+ }
+
+ // Clear buffer
+ this.requestLogBuffer = [];
+ this.requestLogFlushTimeout = null;
+ }
+}
+```
+
+### Example Request Logs
+
+```json
+{
+ "installation_id": "inst_abc123",
+ "team_id": "team_xyz",
+ "requests": [
+ {
+ "user_id": "user_xyz",
+ "tool_name": "github:list-repos",
+ "tool_params": {
+ "owner": "deploystackio"
+ },
+ "tool_response": {
+ "repos": ["deploystack", "mcp-server"],
+ "total": 2
+ },
+ "response_time_ms": 234,
+ "success": true,
+ "timestamp": "2025-01-15T10:30:00.000Z"
+ },
+ {
+ "user_id": "user_xyz",
+ "tool_name": "slack:send-message",
+ "tool_params": {
+ "channel": "#general",
+ "text": "Deploy complete"
+ },
+ "response_time_ms": 456,
+ "success": false,
+ "error_message": "Channel not found",
+ "timestamp": "2025-01-15T10:30:05.000Z"
+ }
+ ]
+}
+```
+
+## Batching Configuration
+
+Both server logs and request logs use the same batching strategy. See [Event Emission - Batching Configuration](/development/satellite/event-emission#batching-configuration) for configuration parameters and rationale.
+
+### Batching Flow
+
+```
+Log/Request occurs
+ ↓
+Buffer entry in memory
+ ↓
+ ├─ Buffer size < 20?
+ │ ↓
+ │ Schedule flush after 3 seconds
+ │
+ └─ Buffer size >= 20?
+ ↓
+ Flush immediately (force)
+ ↓
+Group entries by installation
+ ↓
+Emit one event per installation
+ ↓
+Backend receives batched logs
+ ↓
+Bulk insert into database
+```
+
+## Backend Storage
+
+### Server Logs Table
+
+```sql
+CREATE TABLE mcpServerLogs (
+ id TEXT PRIMARY KEY,
+ installation_id TEXT NOT NULL,
+ level TEXT NOT NULL, -- 'info'|'warn'|'error'|'debug'
+ message TEXT NOT NULL,
+ metadata JSONB,
+ created_at TIMESTAMP NOT NULL,
+ FOREIGN KEY (installation_id) REFERENCES mcpServerInstallations(id)
+);
+```
+
+### Request Logs Table
+
+```sql
+CREATE TABLE mcpRequestLogs (
+ id TEXT PRIMARY KEY,
+ installation_id TEXT NOT NULL,
+ user_id TEXT,
+ tool_name TEXT NOT NULL,
+ tool_params JSONB NOT NULL,
+ tool_response JSONB, -- Full response from MCP server
+ response_time_ms INTEGER NOT NULL,
+ success BOOLEAN NOT NULL,
+ error_message TEXT,
+ created_at TIMESTAMP NOT NULL,
+ FOREIGN KEY (installation_id) REFERENCES mcpServerInstallations(id),
+ FOREIGN KEY (user_id) REFERENCES authUser(id)
+);
+```
+
+### Cleanup Job
+
+A backend cron job enforces a 100-line limit per installation for both tables:
+
+```typescript
+// Runs every 10 minutes
+// For each installation with > 100 logs:
+// 1. Find oldest logs to delete (keep most recent 100)
+// 2. DELETE FROM table WHERE id NOT IN (recent 100)
+```
+
+This prevents unbounded table growth while maintaining recent debugging history.
+
+## Buffer Management
+
+### Memory Usage
+
+**Server Logs:**
+- Maximum ~20 entries in buffer before flush
+- Each entry: ~200 bytes average (message + metadata)
+- Max buffer size: ~4 KB per ProcessManager instance
+
+**Request Logs:**
+- Maximum ~20 entries in buffer before flush
+- Each entry: Variable (depends on params/response size)
+- Typically: 500 bytes - 5 KB per entry
+- Max buffer size: ~10-100 KB per McpServerWrapper instance
+
+### Cleanup on Shutdown
+
+Both buffer managers flush remaining logs on cleanup:
+
+```typescript
+// ProcessManager cleanup
+cleanup() {
+ this.flushLogBuffer(); // Flush any buffered logs
+ clearTimeout(this.logFlushTimeout);
+}
+
+// McpServerWrapper cleanup
+cleanup() {
+ this.flushRequestLogBuffer(); // Flush any buffered requests
+ clearTimeout(this.requestLogFlushTimeout);
+}
+```
+
+## Implementation Components
+
+The log capture system consists of several integrated components:
+- Server and request log batching implementation
+- Request logging toggle and tool response capture
+- Backend log tables and event handlers
+- 100-line cleanup job
+
+## Related Documentation
+
+- [Event Emission](/development/satellite/event-emission) - Log event types and payloads
+- [Process Management](/development/satellite/process-management) - Server log buffering
+- [Status Tracking](/development/satellite/status-tracking) - How logs relate to status
diff --git a/development/satellite/mcp-server-token-injection.mdx b/development/satellite/mcp-server-token-injection.mdx
index f1cb108..75cfd8a 100644
--- a/development/satellite/mcp-server-token-injection.mdx
+++ b/development/satellite/mcp-server-token-injection.mdx
@@ -341,7 +341,7 @@ private isCacheValid(cachedAt: number, expiresAt: string | null): boolean {
async handleHttpToolCall(serverName: string, originalToolName: string, args: unknown) {
const config = this.serverConfigs.get(serverName);
- // Phase 10: OAuth token injection for HTTP/SSE MCP servers
+ // OAuth token injection for HTTP/SSE MCP servers
let headers: Record = {};
// Add regular headers from config (API keys, custom headers, etc.)
@@ -447,7 +447,7 @@ async handleHttpToolCall(serverName: string, originalToolName: string, args: unk
```typescript
// From remote-tool-discovery-manager.ts:376-440
async discoverServerTools(serverName: string, config: ServerConfig) {
- // Phase 10: OAuth token injection for tool discovery
+ // OAuth token injection for tool discovery
let headers: Record = {};
// Add regular headers from config (API keys, custom headers, etc.)
diff --git a/development/satellite/process-management.mdx b/development/satellite/process-management.mdx
index 34c974e..6a10fc3 100644
--- a/development/satellite/process-management.mdx
+++ b/development/satellite/process-management.mdx
@@ -147,17 +147,17 @@ All communication uses newline-delimited JSON following JSON-RPC 2.0 specificati
### Graceful Termination
-Process termination follows a two-phase graceful shutdown approach to ensure clean process exit and proper resource cleanup.
+Process termination follows a two-step graceful shutdown approach to ensure clean process exit and proper resource cleanup.
-#### Termination Phases
+#### Termination Steps
-**Phase 1: SIGTERM (Graceful Shutdown)**
+**Step 1: SIGTERM (Graceful Shutdown)**
- Send SIGTERM signal to the process
- Process has 10 seconds (default timeout) to shut down gracefully
- Process can complete in-flight operations and cleanup resources
- Wait for process to exit voluntarily
-**Phase 2: SIGKILL (Force Termination)**
+**Step 2: SIGKILL (Force Termination)**
- If process doesn't exit within timeout period
- Send SIGKILL signal to force immediate termination
- Guaranteed process termination (cannot be caught or ignored)
@@ -408,36 +408,11 @@ The ProcessManager emits events for monitoring and integration:
## Event Emission
-The ProcessManager emits real-time events to the Backend for operational visibility and audit trails. These events are batched every 3 seconds and sent via the Event System.
+The ProcessManager emits real-time lifecycle events (started, crashed, restarted, permanently_failed) to the Backend for operational visibility and audit trails.
-### Lifecycle Events
+ProcessManager internal events (processSpawned, processTerminated) are for satellite-internal coordination. Event System events (mcp.server.started, etc.) are sent to Backend for external visibility.
-**mcp.server.started**
-- Emitted after successful spawn and handshake completion
-- Includes: server_id, process_id, spawn_duration_ms, tool_count
-- Provides immediate visibility into new MCP server availability
-
-**mcp.server.crashed**
-- Emitted on unexpected process exit with non-zero code
-- Includes: exit_code, signal, uptime_seconds, crash_count, will_restart
-- Enables real-time alerting for process failures
-
-**mcp.server.restarted**
-- Emitted after successful automatic restart
-- Includes: old_process_id, new_process_id, restart_reason, attempt_number
-- Tracks restart attempts for reliability monitoring
-
-**mcp.server.permanently_failed**
-- Emitted when restart limit (3 attempts) is exceeded
-- Includes: total_crashes, last_error, failed_at timestamp
-- Critical alert requiring manual intervention
-
-**Event vs Internal Events:**
-- ProcessManager internal events (processSpawned, processTerminated, etc.) are for satellite-internal coordination
-- Event System events (mcp.server.started, etc.) are sent to Backend for external visibility
-- Both work together: Internal events trigger state changes, Event System events provide audit trail
-
-For complete event system documentation and all event types, see [Event System](/development/satellite/event-system).
+See [Event Emission - Process Lifecycle Events](/development/satellite/event-emission#event-types-reference) for complete event types, payloads, and batching configuration.
## Team Isolation
@@ -531,10 +506,51 @@ LOG_LEVEL=debug npm run dev
- Enabled by default (MCP servers need external connectivity)
- Can be disabled for higher security requirements
+## Status Events
+
+Process lifecycle emits status events to backend for real-time monitoring:
+
+**Status Event Emission:**
+- `connecting` - When process spawn starts
+- `online` - After successful handshake and tool discovery
+- `permanently_failed` - When process crashes 3 times in 5 minutes
+
+See [Event Emission](/development/satellite/event-emission) for complete event types and payloads.
+
+## Log Buffering
+
+Process stderr output is buffered and batched before emission:
+
+**Buffering Strategy:**
+- Batch interval: 3 seconds after first log
+- Max batch size: 20 logs (forces immediate flush)
+- Grouping: By installation_id + team_id
+
+**Log Levels:**
+- Inferred from message content (`error` if contains "error", etc.)
+- Metadata includes process_id for debugging
+
+See [Log Capture](/development/satellite/log-capture) for buffer management details.
+
+## Configuration Restart Flow
+
+When configuration is updated (env vars, args, headers, query params):
+
+1. Backend sets installation status to `restarting`
+2. Backend sends `configure` command to satellite
+3. Satellite receives command and stops old process
+4. Satellite clears tool cache for installation
+5. Satellite spawns new process with updated configuration
+6. Status progresses: `restarting` → `connecting` → `discovering_tools` → `online`
+
+See [Status Tracking](/development/satellite/status-tracking) for configuration update status transitions.
+
## Related Documentation
- [Satellite Architecture Design](/development/satellite/architecture) - Overall system architecture
- [Idle Process Management](/development/satellite/idle-process-management) - Automatic termination and respawning of idle processes
- [Tool Discovery Implementation](/development/satellite/tool-discovery) - How tools are discovered from processes
-- [Team Isolation Implementation](/development/satellite/team-isolation) - Team-based access control
+- [Event Emission](/development/satellite/event-emission) - Process lifecycle events
+- [Log Capture](/development/satellite/log-capture) - stderr log buffering
+- [Status Tracking](/development/satellite/status-tracking) - Process status management
- [Backend Communication](/development/satellite/backend-communication) - Integration with Backend commands
diff --git a/development/satellite/recovery-system.mdx b/development/satellite/recovery-system.mdx
new file mode 100644
index 0000000..bf8123d
--- /dev/null
+++ b/development/satellite/recovery-system.mdx
@@ -0,0 +1,371 @@
+---
+title: Recovery System
+description: Automatic recovery and failure handling for MCP servers
+---
+
+# Recovery System
+
+The satellite automatically detects and recovers from MCP server failures without manual intervention. Recovery works for HTTP/SSE servers (network failures) and stdio servers (process crashes).
+
+## Overview
+
+The recovery system handles **HTTP/SSE Servers** (network failures, server downtime, connection timeouts) and **Stdio Servers** (process crashes up to 3 times in 5 minutes)
+
+Recovery is fully automatic for recoverable failures. Permanent failures (3+ crashes, OAuth token expired) require manual action.
+
+## Recovery Detection
+
+### Tool Execution Recovery
+
+When a tool is executed on a server that was previously offline/error, recovery is detected automatically:
+
+```typescript
+// services/satellite/src/core/mcp-server-wrapper.ts
+
+async handleExecuteTool(toolPath: string, toolArguments: unknown) {
+ const serverSlug = toolPath.split(':')[0];
+ const statusEntry = this.toolDiscoveryManager?.getServerStatus(serverSlug);
+ const wasOfflineOrError = statusEntry && ['offline', 'error'].includes(statusEntry.status);
+
+ // Execute tool with retry logic
+ const result = await this.executeHttpToolCallWithRetry(...);
+
+ // If execution succeeded but server was offline/error → RECOVERY DETECTED
+ if (wasOfflineOrError) {
+ this.handleServerRecovery(serverSlug, config);
+ }
+
+ return result;
+}
+```
+
+### Health Check Recovery
+
+Backend health checks periodically test offline servers. When they respond again:
+
+```
+Backend health check runs (every 3 minutes)
+ ↓
+Offline template now responds
+ ↓
+Backend sets installations to 'connecting'
+ ↓
+Backend sends 'configure' command with event='mcp_recovery'
+ ↓
+Satellite receives command and triggers re-discovery
+ ↓
+Status progresses: connecting → discovering_tools → online
+```
+
+## Retry Logic (HTTP/SSE)
+
+Before marking a server as offline, the satellite retries tool execution with exponential backoff:
+
+```typescript
+// services/satellite/src/core/mcp-server-wrapper.ts
+
+interface RetryConfig {
+ maxRetries: 3;
+ backoffMs: [500, 1000, 2000]; // Exponential: 500ms, 1s, 2s
+}
+
+async executeHttpToolCallWithRetry(
+ serverConfig: McpServerConfig,
+ toolName: string,
+ args: unknown
+): Promise {
+ let lastError: Error;
+
+ for (let attempt = 1; attempt <= 3; attempt++) {
+ try {
+ const response = await this.executeHttpToolCall(serverConfig, toolName, args);
+ return response; // Success - no retry needed
+ } catch (error) {
+ lastError = error;
+
+ // Non-retryable errors (auth failures) → fail immediately
+ if (this.isNonRetryableError(error)) {
+ throw error;
+ }
+
+ // Retryable errors (connection refused) → backoff and retry
+ if (attempt < 3) {
+ const backoffMs = [500, 1000, 2000][attempt - 1];
+ await new Promise(resolve => setTimeout(resolve, backoffMs));
+ }
+ }
+ }
+
+ // All retries exhausted → throw last error
+ throw lastError;
+}
+
+private isNonRetryableError(error: Error): boolean {
+ const msg = error.message.toLowerCase();
+ return msg.includes('401') || msg.includes('403') ||
+ msg.includes('unauthorized') || msg.includes('forbidden') ||
+ msg.includes('oauth') || msg.includes('authorization required');
+}
+```
+
+### Retryable vs Non-Retryable Errors
+
+| Error Type | Action | Reason |
+|------------|--------|--------|
+| ECONNREFUSED | **Retry** | Server may be restarting |
+| ETIMEDOUT | **Retry** | Network hiccup, may recover |
+| ENOTFOUND | **Retry** | DNS issue, may be temporary |
+| fetch failed | **Retry** | Network error, transient |
+| 401 Unauthorized | **No retry** | Token expired, retrying won't help |
+| 403 Forbidden | **No retry** | Access denied, retrying won't help |
+| OAuth errors | **No retry** | Auth issue, needs user action |
+
+## Recovery Flow
+
+When servers recover from failure, the satellite updates status and triggers re-discovery asynchronously without blocking tool execution responses.
+
+See [Status Tracking - Status Lifecycle](/development/satellite/status-tracking#status-lifecycle) for complete recovery flow diagrams including successful recovery, failed recovery, and status transitions.
+
+## Automatic Re-Discovery
+
+When recovery is detected, tools are refreshed from the server without blocking the user:
+
+```typescript
+// services/satellite/src/core/mcp-server-wrapper.ts
+
+private async handleServerRecovery(
+ serverSlug: string,
+ config: McpServerConfig
+): Promise {
+ // Prevent duplicate recovery attempts
+ if (this.recoveryInProgress.has(serverSlug)) {
+ return; // Already recovering
+ }
+
+ this.recoveryInProgress.add(serverSlug);
+
+ try {
+ this.logger.info({ serverSlug }, 'Server recovered - triggering re-discovery');
+
+ // Emit status change to backend
+ this.eventBus?.emit('mcp.server.status_changed', {
+ installation_id: config.installation_id,
+ team_id: config.team_id,
+ status: 'connecting',
+ status_message: 'Server recovered, re-discovering tools',
+ timestamp: new Date().toISOString()
+ });
+
+ // Trigger re-discovery asynchronously (doesn't block tool response)
+ await this.toolDiscoveryManager?.remoteToolManager?.discoverServerTools(serverSlug);
+
+ this.logger.info({ serverSlug }, 'Tool re-discovery successful after recovery');
+ } catch (error) {
+ // Re-discovery failed (non-fatal, tool response still returned)
+ this.logger.error({ serverSlug, error }, 'Tool re-discovery failed after recovery');
+ } finally {
+ this.recoveryInProgress.delete(serverSlug);
+ }
+}
+```
+
+### Why Asynchronous Re-Discovery?
+
+**User Experience:**
+- Tool execution result returned immediately
+- User doesn't wait for tool discovery (can take 1-5 seconds)
+- If re-discovery fails, user already got their result
+
+**Reliability:**
+- Tool response isn't blocked by discovery errors
+- Discovery failure doesn't affect user's current request
+- Recovery can be retried later
+
+## Tool Preservation
+
+When re-discovery fails, tools are NOT removed from cache:
+
+```typescript
+// services/satellite/src/services/remote-tool-discovery-manager.ts
+
+async rediscoverServerTools(serverSlug: string): Promise {
+ try {
+ // Attempt discovery
+ const newTools = await this.fetchToolsFromServer(serverSlug);
+
+ // Discovery succeeded → remove old tools and add new ones
+ this.removeToolsForServer(serverSlug);
+ this.addTools(newTools);
+
+ this.statusCallback?.(serverSlug, 'online');
+ } catch (error) {
+ // Discovery failed → keep old tools in cache
+ // Tools remain available for future attempts
+ this.statusCallback?.(serverSlug, 'error', error.message);
+ }
+}
+```
+
+**Why preserve tools on failure?**
+- User can still see what tools are available
+- Tools may work if server recovers later
+- Better UX than empty tool list
+- Discovery can be retried without losing tool metadata
+
+## Stdio Process Recovery
+
+Stdio servers auto-restart after crashes (up to 3 times in 5 minutes):
+
+```typescript
+// services/satellite/src/process/manager.ts
+
+async handleProcessExit(processInfo: ProcessInfo, exitCode: number) {
+ const now = Date.now();
+ const fiveMinutesAgo = now - 5 * 60 * 1000;
+
+ // Track crashes in 5-minute window
+ processInfo.crashHistory = processInfo.crashHistory.filter(t => t > fiveMinutesAgo);
+ processInfo.crashHistory.push(now);
+
+ const crashCount = processInfo.crashHistory.length;
+
+ if (crashCount >= 3) {
+ // Permanent failure - emit status event
+ this.eventBus?.emit('mcp.server.permanently_failed', {
+ installation_id: processInfo.config.installation_id,
+ team_id: processInfo.config.team_id,
+ process_id: processInfo.processId,
+ crash_count: crashCount,
+ message: `Process crashed ${crashCount} times in 5 minutes`,
+ timestamp: new Date().toISOString()
+ });
+
+ // Also emit status_changed for database update
+ this.eventBus?.emit('mcp.server.status_changed', {
+ installation_id: processInfo.config.installation_id,
+ team_id: processInfo.config.team_id,
+ status: 'permanently_failed',
+ status_message: `Process crashed ${crashCount} times in 5 minutes. Manual restart required.`,
+ timestamp: new Date().toISOString()
+ });
+
+ return; // No auto-restart
+ }
+
+ // Auto-restart (crash count < 3)
+ this.logger.info({ processId: processInfo.processId, crashCount }, 'Auto-restarting crashed process');
+ await this.startProcess(processInfo.config);
+}
+```
+
+### Stdio Recovery Timeline
+
+```
+Process crashes (crash #1)
+ ↓
+Auto-restart immediately
+ ↓
+Process crashes again (crash #2, within 5 min)
+ ↓
+Auto-restart immediately
+ ↓
+Process crashes again (crash #3, within 5 min)
+ ↓
+Status → 'permanently_failed'
+ ↓
+No auto-restart (manual action required)
+```
+
+## Failure Status Mapping
+
+When tool execution fails after all retries, error messages are mapped to appropriate status values:
+
+```typescript
+// services/satellite/src/services/remote-tool-discovery-manager.ts
+
+static getStatusFromError(error: Error): { status: string; message: string } {
+ const msg = error.message.toLowerCase();
+
+ // Auth errors → requires_reauth
+ if (msg.includes('401') || msg.includes('unauthorized')) {
+ return { status: 'requires_reauth', message: 'Authentication failed (HTTP 401)' };
+ }
+ if (msg.includes('403') || msg.includes('forbidden')) {
+ return { status: 'requires_reauth', message: 'Access forbidden (HTTP 403)' };
+ }
+
+ // Connection errors → offline
+ if (msg.includes('econnrefused') || msg.includes('etimedout') ||
+ msg.includes('enotfound') || msg.includes('fetch failed')) {
+ return { status: 'offline', message: 'Server unreachable' };
+ }
+
+ // Other errors → error
+ return { status: 'error', message: error.message };
+}
+```
+
+## Debouncing Concurrent Recovery
+
+Multiple tool executions may detect recovery simultaneously. Debouncing prevents duplicate re-discoveries:
+
+```typescript
+class McpServerWrapper {
+ private recoveryInProgress: Set = new Set();
+
+ private async handleServerRecovery(serverSlug: string, config: McpServerConfig) {
+ // Check if already recovering
+ if (this.recoveryInProgress.has(serverSlug)) {
+ return; // Skip duplicate recovery
+ }
+
+ this.recoveryInProgress.add(serverSlug);
+
+ try {
+ await this.performRecovery(serverSlug, config);
+ } finally {
+ this.recoveryInProgress.delete(serverSlug);
+ }
+ }
+}
+```
+
+**Scenario:**
+- LLM executes 3 tools from same server concurrently
+- All 3 detect recovery (server was offline)
+- Only first execution triggers re-discovery
+- Other 2 skip (already in progress)
+
+## Recovery Timing
+
+| Recovery Type | Detection Time | Re-Discovery Time | Total |
+|---------------|----------------|-------------------|-------|
+| **Tool Execution** | Immediate (on next tool call) | 1-5 seconds | ~1-5s |
+| **Health Check** | Up to 3 minutes (polling interval) | 1-5 seconds | ~3-8 min |
+
+**Recommendation:** Tool execution recovery is faster and more responsive than health check recovery.
+
+## Manual Recovery (Requires User Action)
+
+Some failures cannot auto-recover:
+
+| Status | Reason | User Action |
+|--------|--------|-------------|
+| `requires_reauth` | OAuth token expired/revoked | Re-authenticate in dashboard |
+| `permanently_failed` | 3+ crashes in 5 minutes (stdio) | Check logs, fix issue, manual restart |
+
+See [Process Management - Auto-Restart System](/development/satellite/process-management#auto-restart-system) for complete stdio restart policy details (3 crashes in 5-minute window, backoff delays).
+
+## Implementation Components
+
+The recovery system consists of several integrated components:
+- Stdio auto-recovery and permanently_failed status
+- Tool execution retry logic and recovery detection
+- Health check recovery via backend
+
+## Related Documentation
+
+- [Status Tracking](/development/satellite/status-tracking) - Status values and transitions
+- [Event Emission](/development/satellite/event-emission) - Recovery status events
+- [Tool Discovery](/development/satellite/tool-discovery) - Re-discovery after recovery
+- [Process Management](/development/satellite/process-management) - Stdio crash recovery
diff --git a/development/satellite/status-tracking.mdx b/development/satellite/status-tracking.mdx
new file mode 100644
index 0000000..710408d
--- /dev/null
+++ b/development/satellite/status-tracking.mdx
@@ -0,0 +1,285 @@
+---
+title: Status Tracking
+description: MCP server installation status tracking system in the satellite
+---
+
+# Status Tracking
+
+The satellite tracks the health and availability of each MCP server installation through an 11-state status system. This enables real-time monitoring, automatic recovery, and tool availability filtering.
+
+## Overview
+
+Status tracking serves three primary purposes:
+
+1. **User Visibility**: Users see current server state in real-time via the frontend
+2. **Tool Availability**: Tools from unavailable servers are filtered from discovery
+3. **Automatic Recovery**: System detects and recovers from failures automatically
+
+The status system is managed by `UnifiedToolDiscoveryManager` and updated through:
+- Installation lifecycle events (provisioning → online)
+- Health check results (online → offline)
+- Tool execution failures (online → offline/error/requires_reauth)
+- Configuration changes (online → restarting)
+- Recovery detection (offline → connecting → online)
+
+## Status Values
+
+| Status | Description | Tools Available? | User Action Required |
+|--------|-------------|------------------|---------------------|
+| `provisioning` | Initial state after installation created | No | Wait |
+| `command_received` | Satellite received configuration command | No | Wait |
+| `connecting` | Connecting to MCP server | No | Wait |
+| `discovering_tools` | Running tool discovery | No | Wait |
+| `syncing_tools` | Syncing tools to backend | No | Wait |
+| `online` | Server healthy and responding | **Yes** | None |
+| `restarting` | Configuration updated, server restarting | No | Wait |
+| `offline` | Server unreachable (auto-recovers) | No | Wait or check server |
+| `error` | General error state (auto-recovers) | No | Check logs |
+| `requires_reauth` | OAuth token expired/revoked | No | Re-authenticate |
+| `permanently_failed` | 3+ crashes in 5 minutes (stdio only) | No | Manual restart required |
+
+## Status Lifecycle
+
+### Initial Installation Flow
+
+```
+provisioning
+ ↓
+command_received (satellite received configure command)
+ ↓
+connecting (spawning MCP server process or connecting to HTTP/SSE)
+ ↓
+discovering_tools (calling tools/list)
+ ↓
+syncing_tools (sending tools to backend)
+ ↓
+online (ready for use)
+```
+
+### Configuration Update Flow
+
+```
+online
+ ↓
+restarting (user updated config, backend sets status immediately)
+ ↓
+connecting (satellite receives command, restarts server)
+ ↓
+discovering_tools
+ ↓
+online
+```
+
+### Failure and Recovery Flow
+
+```
+online
+ ↓
+offline/error (server unreachable or error response)
+ ↓
+[automatic recovery when server comes back]
+ ↓
+connecting
+ ↓
+discovering_tools
+ ↓
+online
+```
+
+### OAuth Failure Flow
+
+```
+online
+ ↓
+requires_reauth (401/403 response or token refresh failed)
+ ↓
+[user re-authenticates via dashboard]
+ ↓
+connecting
+ ↓
+discovering_tools
+ ↓
+online
+```
+
+### Stdio Crash Flow (Permanent Failure)
+
+```
+online
+ ↓
+(stdio process crashes)
+ ↓
+connecting (auto-restart attempt 1)
+ ↓
+(crashes again within 5 minutes)
+ ↓
+connecting (auto-restart attempt 2)
+ ↓
+(crashes again within 5 minutes)
+ ↓
+permanently_failed (manual intervention required)
+```
+
+## Status Tracking Implementation
+
+### UnifiedToolDiscoveryManager
+
+The status system is implemented in `UnifiedToolDiscoveryManager`:
+
+```typescript
+// services/satellite/src/services/unified-tool-discovery-manager.ts
+
+export type ServerAvailabilityStatus =
+ | 'online'
+ | 'offline'
+ | 'error'
+ | 'requires_reauth'
+ | 'permanently_failed'
+ | 'connecting'
+ | 'discovering_tools';
+
+export interface ServerStatusEntry {
+ status: ServerAvailabilityStatus;
+ lastUpdated: Date;
+ message?: string;
+}
+
+class UnifiedToolDiscoveryManager {
+ private serverStatus: Map = new Map();
+
+ // Set server status (called by discovery managers and MCP wrapper)
+ setServerStatus(serverSlug: string, status: ServerAvailabilityStatus, message?: string): void {
+ this.serverStatus.set(serverSlug, {
+ status,
+ lastUpdated: new Date(),
+ message
+ });
+ }
+
+ // Check if server is available for tool execution
+ isServerAvailable(serverSlug: string): boolean {
+ const statusEntry = this.serverStatus.get(serverSlug);
+ if (!statusEntry) return true; // Unknown = available (safe default)
+ return statusEntry.status === 'online';
+ }
+
+ // Get all tools, filtered by server status
+ getAllTools(): ToolMetadata[] {
+ const allTools = this.getAllToolsUnfiltered();
+ return allTools.filter(tool => {
+ const serverSlug = tool.tool_path.split(':')[0];
+ return this.isServerAvailable(serverSlug);
+ });
+ }
+}
+```
+
+### Status Callbacks
+
+Discovery managers call status callbacks when discovery succeeds or fails:
+
+**HTTP/SSE Discovery:**
+```typescript
+// services/satellite/src/services/remote-tool-discovery-manager.ts
+
+// On successful discovery
+this.statusCallback?.(serverSlug, 'online');
+
+// On connection error
+const { status, message } = RemoteToolDiscoveryManager.getStatusFromError(error);
+this.statusCallback?.(serverSlug, status, message);
+```
+
+**Stdio Discovery:**
+```typescript
+// services/satellite/src/services/stdio-tool-discovery-manager.ts
+
+// On successful discovery
+this.statusCallback?.(processId, 'online');
+
+// On discovery error
+this.statusCallback?.(processId, 'error', errorMessage);
+```
+
+## Tool Filtering by Status
+
+### Discovery Filtering
+
+When LLMs call `discover_mcp_tools`, only tools from available servers are returned:
+
+```typescript
+// UnifiedToolDiscoveryManager.getAllTools() filters by status
+const tools = toolDiscoveryManager.getAllTools(); // Only 'online' servers
+
+// Tools from offline/error/requires_reauth servers are hidden
+```
+
+### Execution Blocking
+
+When LLMs attempt to execute tools from unavailable servers:
+
+```typescript
+// services/satellite/src/core/mcp-server-wrapper.ts
+
+const serverSlug = toolPath.split(':')[0];
+const statusEntry = this.toolDiscoveryManager?.getServerStatus(serverSlug);
+
+// Block execution for non-recoverable states
+if (statusEntry?.status === 'requires_reauth') {
+ return {
+ error: `Tool cannot be executed - server requires re-authentication.
+
+Status: ${statusEntry.status}
+The server requires re-authentication. Please re-authorize in the dashboard.
+
+Unavailable server: ${serverSlug}`
+ };
+}
+
+// Allow execution for offline/error (enables recovery detection)
+```
+
+## Status Transition Triggers
+
+### Backend-Triggered (Database Updates)
+
+**Source:** Backend API routes
+
+| Trigger | New Status | When |
+|---------|-----------|------|
+| Installation created | `provisioning` | User installs MCP server |
+| Config updated | `restarting` | User modifies environment vars/args/headers |
+| OAuth callback success | `connecting` | User re-authenticates |
+| Health check fails | `offline` | Server unreachable (3-min interval) |
+| Credential validation fails | `requires_reauth` | OAuth token invalid |
+
+### Satellite-Triggered (Event Emission)
+
+**Source:** Satellite emits `mcp.server.status_changed` events to backend
+
+| Trigger | New Status | When |
+|---------|-----------|------|
+| Configure command received | `command_received` | Satellite polls backend |
+| Server connection starts | `connecting` | Spawning process or HTTP connect |
+| Tool discovery starts | `discovering_tools` | Calling tools/list |
+| Tool discovery succeeds | `online` | Discovery completed successfully |
+| Tool execution fails (3 retries) | `offline`/`error`/`requires_reauth` | Tool call failed after retries |
+| Server recovery detected | `connecting` | Previously offline server responds |
+| Stdio crashes 3 times | `permanently_failed` | 3 crashes within 5 minutes |
+
+## Implementation Components
+
+The status tracking system consists of several integrated components:
+- Database schema for status field
+- Backend event handler for status updates
+- Satellite status event emission
+- Tool availability filtering by status
+- Configuration update status transitions
+- Tool execution status updates with auto-recovery
+
+## Related Documentation
+
+- [Event Emission](/development/satellite/event-emission) - Status change event details
+- [Recovery System](/development/satellite/recovery-system) - Automatic recovery logic
+- [Tool Discovery](/development/satellite/tool-discovery) - How status affects tool discovery
+- [Hierarchical Router](/development/satellite/hierarchical-router) - Status-based tool filtering
diff --git a/development/satellite/tool-discovery.mdx b/development/satellite/tool-discovery.mdx
index b4cdba1..00abbbe 100644
--- a/development/satellite/tool-discovery.mdx
+++ b/development/satellite/tool-discovery.mdx
@@ -281,7 +281,7 @@ The satellite uses `estimateMcpServerTokens()` from `token-counter.ts` to calcul
- Enable frontend tool catalog display with token consumption metrics
- Provide analytics on MCP server complexity and context window usage
-See [Event System](/development/satellite/event-system) for event batching and delivery details.
+For event payload structure and event batching details, see [Event Emission - mcp.tools.discovered](/development/satellite/event-emission#mcp-tools-discovered).
## Development Considerations
@@ -349,4 +349,66 @@ curl http://localhost:3001/api/status/debug
- Detailed usage and performance analytics
- Cache persistence for faster startup (HTTP only)
+## Status Integration
+
+Tool discovery integrates with the status tracking system to filter tools and enable automatic recovery. Discovery managers call status callbacks on success/failure to update installation status in real-time.
+
+See [Status Tracking - Tool Filtering](/development/satellite/status-tracking#tool-filtering-by-status) for complete details on status-based tool filtering and execution blocking.
+
+## Recovery System
+
+When offline servers recover, tool discovery is automatically triggered. The satellite preserves existing tools during re-discovery attempts to prevent tool loss on failure.
+
+See [Recovery System - Recovery Detection](/development/satellite/recovery-system#recovery-detection) for complete recovery logic, retry strategy, and tool preservation implementation.
+
+## Tool Metadata Events
+
+Discovered tools are emitted to backend with token count estimates.
+
+**Event Structure:**
+```typescript
+eventBus.emit('mcp.tools.discovered', {
+ installation_id: string,
+ team_id: string,
+ tools: [{
+ tool_path: string,
+ name: string,
+ description?: string,
+ inputSchema: unknown,
+ token_count: number // Estimated token usage
+ }]
+});
+```
+
+**Token Calculation:**
+- Name + description + input schema serialized
+- Estimated using character count / 4 (approximate tokens)
+- Used for analytics and optimization
+
+See [Event Emission](/development/satellite/event-emission) for complete event types.
+
+## Request Logging
+
+Tool execution is logged with full request/response data for debugging.
+
+**Logged Information:**
+- Tool name and input parameters
+- Full MCP server response (captured)
+- Response time in milliseconds
+- Success/failure status and error messages
+- User attribution (who called the tool)
+
+**Privacy Control:**
+Request logging can be disabled per-installation via `settings.request_logging_enabled = false`.
+
+See [Log Capture](/development/satellite/log-capture) for buffering and storage details.
+
+## Related Documentation
+
+- [Status Tracking](/development/satellite/status-tracking) - Tool filtering by server status
+- [Recovery System](/development/satellite/recovery-system) - Automatic re-discovery on recovery
+- [Event Emission](/development/satellite/event-emission) - Tool metadata events
+- [Log Capture](/development/satellite/log-capture) - Request logging system
+- [Hierarchical Router](/development/satellite/hierarchical-router) - How tools are exposed to MCP clients
+
The unified tool discovery implementation provides a solid foundation for multi-transport MCP server integration while maintaining simplicity and reliability for development and production use.
diff --git a/docs.json b/docs.json
index 47479f6..25fd8ef 100644
--- a/docs.json
+++ b/docs.json
@@ -207,6 +207,15 @@
"/development/satellite/mcp-server-token-injection"
]
},
+ {
+ "group": "Status & Health Tracking",
+ "pages": [
+ "/development/satellite/status-tracking",
+ "/development/satellite/event-emission",
+ "/development/satellite/log-capture",
+ "/development/satellite/recovery-system"
+ ]
+ },
{
"group": "Backend Communication",
"pages": [