Skip to content

Commit 080b90a

Browse files
author
Lasim
committed
docs(satellite): update documentation for status tracking, health checks, and OAuth token handling
1 parent 195f35a commit 080b90a

File tree

13 files changed

+331
-54
lines changed

13 files changed

+331
-54
lines changed

development/backend/plugins.mdx

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -313,8 +313,8 @@ The `databaseExtension` property allows your plugin to:
313313
#### How Plugin Database Tables Work
314314

315315
**Security Architecture:**
316-
- **Phase 1 (Trusted)**: Core migrations run first (static, secure)
317-
- **Phase 2 (Untrusted)**: Plugin tables created dynamically (sandboxed)
316+
- **Stage 1 (Trusted)**: Core migrations run first (static, secure)
317+
- **Stage 2 (Untrusted)**: Plugin tables created dynamically (sandboxed)
318318
- **Clear Separation**: Plugin tables cannot interfere with core database structure
319319

320320
**Dynamic Table Creation:**
@@ -421,7 +421,7 @@ The database initialization follows a strict security-first approach:
421421

422422
```
423423
┌─────────────────────────────────────────┐
424-
Phase 1: Core System (Trusted) │
424+
Stage 1: Core System (Trusted) │
425425
├─────────────────────────────────────────┤
426426
│ 1. Apply core migrations │
427427
│ 2. Create core tables │
@@ -430,7 +430,7 @@ The database initialization follows a strict security-first approach:
430430
431431
▼ Security Boundary
432432
┌─────────────────────────────────────────┐
433-
Phase 2: Plugin System (Sandboxed) │
433+
Stage 2: Plugin System (Sandboxed) │
434434
├─────────────────────────────────────────┤
435435
│ 1. Generate CREATE TABLE SQL │
436436
│ 2. Drop existing plugin tables │

development/backend/satellite/commands.mdx

Lines changed: 49 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ The system supports 5 command types defined in the `command_type` enum:
3232
| `spawn` | Start MCP server process | Launch HTTP proxy or stdio process |
3333
| `kill` | Stop MCP server process | Terminate process gracefully |
3434
| `restart` | Restart MCP server | Stop and start process |
35-
| `health_check` | Verify server health | Call tools/list to check connectivity |
35+
| `health_check` | Verify server health and validate credentials | Check connectivity or validate OAuth tokens |
3636

3737
### Configure Commands
3838

@@ -74,6 +74,30 @@ interface CommandPayload {
7474
}
7575
```
7676

77+
## Status Changes Triggered by Commands
78+
79+
Commands trigger installation status changes through satellite event emission:
80+
81+
| Command | Status Before | Status After | When |
82+
|---------|--------------|--------------|------|
83+
| `configure` (install) | N/A | `provisioning``command_received``connecting` | Installation creation flow |
84+
| `configure` (update) | `online` | `restarting``online` | Configuration change applied |
85+
| `configure` (delete) | Any | Process terminated | Installation removal |
86+
| `health_check` (credential) | `online` | `requires_reauth` | OAuth token invalid |
87+
| `restart` | `online` | `restarting``online` | Manual restart requested |
88+
89+
**Status Lifecycle on Installation**:
90+
1. Backend creates installation → status=`provisioning`
91+
2. Backend sends `configure` command → status=`command_received`
92+
3. Satellite connects to server → status=`connecting`
93+
4. Satellite discovers tools → status=`discovering_tools`
94+
5. Satellite syncs tools to backend → status=`syncing_tools`
95+
6. Process complete → status=`online`
96+
97+
For complete status transition documentation, see [Backend Events - Status Values](/development/backend/satellite/events#mcp-server-status_changed).
98+
99+
---
100+
77101
## Command Event Types
78102

79103
All `configure` commands include an `event` field in the payload for tracking and logging:
@@ -168,6 +192,14 @@ await satelliteCommandService.notifyMcpRecovery(
168192

169193
**Payload**: `event: 'mcp_recovery'`
170194

195+
**Status Flow**:
196+
- Triggered by health check detecting offline installation
197+
- Sets status to `connecting`
198+
- Satellite rediscovers tools
199+
- Status progresses: offline → connecting → discovering_tools → online
200+
201+
For complete recovery system documentation, see [Backend Communication - Auto-Recovery](/development/backend/satellite/communication#auto-recovery-system).
202+
171203
## Critical Pattern
172204

173205
**ALWAYS use the correct convenience method**:
@@ -247,9 +279,22 @@ When satellites receive commands:
247279
3. Execute spawn sequence
248280

249281
**For `health_check` commands**:
250-
1. Call tools/list on target server
251-
2. Verify response
252-
3. Report health status
282+
1. Check `payload.check_type` field:
283+
- `connectivity` (default): Call tools/list to verify server responds
284+
- `credential_validation`: Validate OAuth tokens for installation
285+
2. Execute appropriate validation
286+
3. Report health status via `mcp.server.status_changed` event:
287+
- `online` - Health check passed
288+
- `requires_reauth` - OAuth token expired/revoked
289+
- `error` - Validation failed with error
290+
291+
**Credential Validation Flow**:
292+
- Backend cron job sends `health_check` command with `check_type: 'credential_validation'`
293+
- Satellite validates OAuth token (performs token refresh test)
294+
- Emits status event based on validation result
295+
- Backend updates `mcpServerInstallations.status` and `last_credential_check_at`
296+
297+
For satellite-side credential validation implementation, see [Satellite OAuth Authentication](/development/satellite/oauth-authentication).
253298

254299
## Example Usage
255300

development/backend/satellite/communication.mdx

Lines changed: 182 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -106,20 +106,20 @@ The system uses three distinct communication patterns:
106106

107107
### Security Architecture
108108

109-
The satellite pairing process implements a secure **two-phase JWT-based authentication system** that prevents unauthorized satellite connections. For complete implementation details, see [API Security - Registration Token Authentication](/development/backend/api/security#registration-token-authentication).
109+
The satellite pairing process implements a secure **two-step JWT-based authentication system** that prevents unauthorized satellite connections. For complete implementation details, see [API Security - Registration Token Authentication](/development/backend/api/security#registration-token-authentication).
110110

111-
**Phase 1: Token Generation**
111+
**Step 1: Token Generation**
112112
- Administrators generate temporary registration tokens through admin APIs
113113
- Scope-specific tokens (global vs team) with cryptographic signatures
114114
- Token management endpoints for generation, listing, and revocation
115115

116-
**Phase 2: Satellite Registration**
116+
**Step 2: Satellite Registration**
117117
- Satellites authenticate using `Authorization: Bearer deploystack_satellite_*` headers
118118
- Backend validates JWT tokens with single-use consumption
119119
- Permanent API keys issued after successful token validation
120120
- Token consumed to prevent replay attacks
121121

122-
**Breaking Change**: As of Phase 3 implementation, all new satellite registrations require valid registration tokens. The open registration system has been secured.
122+
**Note**: All new satellite registrations require valid registration tokens. The open registration system has been secured.
123123

124124
### Registration Middleware
125125

@@ -261,6 +261,153 @@ Configuration respects team boundaries and isolation:
261261
- Team-defined security policies
262262
- Internal resource access settings
263263

264+
## Frontend API Endpoints
265+
266+
The backend provides REST and SSE endpoints for frontend access to installation status, logs, and requests.
267+
268+
### Status & Monitoring Endpoints
269+
270+
**GET `/api/teams/{teamId}/mcp/installations/{installationId}/status`**
271+
- Returns current installation status, status message, and last update timestamp
272+
- Used by frontend for real-time status badges and progress indicators
273+
274+
**GET `/api/teams/{teamId}/mcp/installations/{installationId}/logs`**
275+
- Returns paginated server logs (stderr output, connection errors)
276+
- Query params: `limit`, `offset` for pagination
277+
- Limited to 100 lines per installation (enforced by cleanup cron job)
278+
279+
**GET `/api/teams/{teamId}/mcp/installations/{installationId}/requests`**
280+
- Returns paginated request logs (tool execution history)
281+
- Includes request params, duration, success status
282+
- Response data included if `request_logging_enabled=true`
283+
284+
**GET `/api/teams/{teamId}/mcp/installations/{installationId}/requests/{requestId}`**
285+
- Returns detailed request log for specific execution
286+
- Includes full request/response payloads when available
287+
288+
### Settings Management
289+
290+
**PATCH `/api/teams/{teamId}/mcp/installations/{installationId}/settings`**
291+
- Updates installation settings (stored in `mcpServerInstallations.settings` jsonb column)
292+
- Settings distributed to satellites via config endpoint
293+
- Current settings:
294+
- `request_logging_enabled` (boolean) - Controls capture of tool responses
295+
296+
### Real-Time Streaming (SSE)
297+
298+
**GET `/api/teams/{teamId}/mcp/installations/{installationId}/logs/stream`**
299+
- Server-Sent Events endpoint for real-time log streaming
300+
- Frontend subscribes for live stderr output
301+
- Auto-reconnects on connection loss
302+
303+
**GET `/api/teams/{teamId}/mcp/installations/{installationId}/requests/stream`**
304+
- Server-Sent Events endpoint for real-time request log streaming
305+
- Frontend subscribes for live tool execution updates
306+
- Includes duration, status, and optionally response data
307+
308+
**SSE vs REST Comparison**:
309+
| Feature | REST Endpoints | SSE Endpoints |
310+
|---------|---------------|---------------|
311+
| Use Case | Historical data, pagination | Real-time updates |
312+
| Connection | Request/response | Persistent connection |
313+
| Data Flow | Pull (client requests) | Push (server sends) |
314+
| Frontend Usage | Initial load, manual refresh | Live monitoring |
315+
316+
**SSE Controller Implementation**: `services/backend/src/controllers/mcp/sse.controller.ts`
317+
318+
**Routes Implementation**: `services/backend/src/routes/api/teams/mcp/installations.routes.ts`
319+
320+
---
321+
322+
## Health Check & Recovery Systems
323+
324+
### Cumulative Health Check System
325+
326+
**Purpose**: Template-level health aggregation across all installations of an MCP server.
327+
328+
**McpHealthCheckService** (`services/backend/src/services/mcp-health-check.service.ts`):
329+
- Aggregates health status from all installations of each MCP server template
330+
- Updates `mcpServers.health_status` based on installation health
331+
- Provides template-level health visibility in admin dashboard
332+
333+
**Cron Job**: `mcp-health-check` runs every 3 minutes
334+
- Implementation: `services/backend/src/jobs/mcp-health-check.job.ts`
335+
- Checks all MCP server templates
336+
- Updates template health status for admin visibility
337+
338+
### Credential Validation System
339+
340+
**Purpose**: Per-installation OAuth token validation to detect expired/revoked credentials.
341+
342+
**McpCredentialValidationWorker** (`services/backend/src/workers/mcp-credential-validation.worker.ts`):
343+
- Validates OAuth tokens for each installation
344+
- Sends `health_check` command to satellite with `check_type: 'credential_validation'`
345+
- Satellite performs OAuth validation and reports status
346+
347+
**Cron Job**: `mcp-credential-validation` runs every 1 minute
348+
- Implementation: `services/backend/src/jobs/mcp-credential-validation.job.ts`
349+
- Validates installations on 15-minute rotation
350+
- Triggers `requires_reauth` status on validation failure
351+
352+
**Health Check Command Payload**:
353+
```json
354+
{
355+
"commandType": "health_check",
356+
"priority": "immediate",
357+
"payload": {
358+
"check_type": "credential_validation",
359+
"installation_id": "inst_123",
360+
"team_id": "team_xyz"
361+
}
362+
}
363+
```
364+
365+
Satellite validates credentials and emits `mcp.server.status_changed` with status:
366+
- `online` - Credentials valid
367+
- `requires_reauth` - OAuth token expired/revoked
368+
- `error` - Validation failed with error
369+
370+
### Auto-Recovery System
371+
372+
**Recovery Trigger**:
373+
- Health check system detects offline installations
374+
- Backend calls `notifyMcpRecovery(installation_id, team_id)`
375+
- Sends command to satellite: Set status=`connecting`, rediscover tools
376+
- Status progression: offline → connecting → discovering_tools → online
377+
378+
**Tool Execution Recovery**:
379+
- Satellite detects recovery during tool execution (offline server responds)
380+
- Emits immediate status change event (doesn't wait for health check)
381+
- Triggers asynchronous re-discovery
382+
383+
For satellite-side recovery implementation, see [Satellite Recovery System](/development/satellite/recovery-system).
384+
385+
---
386+
387+
## Background Cron Jobs
388+
389+
The backend runs three MCP-related cron jobs for maintenance and monitoring:
390+
391+
**cleanup-mcp-server-logs**:
392+
- **Schedule**: Every 10 minutes
393+
- **Purpose**: Enforce 100-line limit per installation in `mcpServerLogs` table
394+
- **Action**: Deletes oldest logs beyond 100-line limit
395+
- **Implementation**: `services/backend/src/jobs/cleanup-mcp-server-logs.job.ts`
396+
397+
**mcp-health-check**:
398+
- **Schedule**: Every 3 minutes
399+
- **Purpose**: Template-level health aggregation
400+
- **Action**: Updates `mcpServers.health_status` column
401+
- **Implementation**: `services/backend/src/jobs/mcp-health-check.job.ts`
402+
403+
**mcp-credential-validation**:
404+
- **Schedule**: Every 1 minute
405+
- **Purpose**: Detect expired/revoked OAuth tokens
406+
- **Action**: Sends `health_check` commands to satellites
407+
- **Implementation**: `services/backend/src/jobs/mcp-credential-validation.job.ts`
408+
409+
---
410+
264411
## Database Schema Integration
265412

266413
### Core Table Structure
@@ -298,6 +445,37 @@ The satellite system integrates with existing DeployStack schema through 5 speci
298445
- Alert generation and notification triggers
299446
- Historical health trend analysis
300447

448+
### New Columns Added (Status & Health Tracking System)
449+
450+
**mcpServerInstallations** table:
451+
- `status` (text) - Current installation status (11 possible values)
452+
- `status_message` (text, nullable) - Human-readable status context or error details
453+
- `status_updated_at` (timestamp) - Last status change timestamp
454+
- `last_health_check_at` (timestamp, nullable) - Last health check execution time
455+
- `last_credential_check_at` (timestamp, nullable) - Last credential validation time
456+
- `settings` (jsonb, nullable) - Generic settings object (e.g., `request_logging_enabled`)
457+
458+
**mcpServers** table:
459+
- `health_status` (text, nullable) - Template-level aggregated health status
460+
- `last_health_check_at` (timestamp, nullable) - Last template health check time
461+
- `health_check_error` (text, nullable) - Last health check error message
462+
463+
**mcpServerLogs** table:
464+
- Stores batched stderr logs from satellites
465+
- 100-line limit per installation (enforced by cleanup cron job)
466+
- Fields: `installation_id`, `team_id`, `log_level`, `message`, `timestamp`
467+
468+
**mcpRequestLogs** table:
469+
- Stores batched tool execution logs
470+
- `tool_response` (jsonb, nullable) - MCP server response data
471+
- Privacy control: Only captured when `request_logging_enabled=true`
472+
- Fields: `installation_id`, `team_id`, `tool_name`, `request_params`, `tool_response`, `duration_ms`, `success`, `error_message`, `timestamp`
473+
474+
**mcpToolMetadata** table:
475+
- Stores discovered tools with token counts
476+
- Used for hierarchical router token savings calculations
477+
- Fields: `installation_id`, `server_slug`, `tool_name`, `description`, `input_schema`, `token_count`, `discovered_at`
478+
301479
### Team Isolation in Data Model
302480

303481
All satellite data respects team boundaries:

0 commit comments

Comments
 (0)