Skip to content

Commit 195f35a

Browse files
author
Lasim
committed
docs(satellite): add comprehensive status & health tracking documentation
Add 4 new documentation files covering the MCP Status & Health Tracking System (18 implementation phases) and update 6 existing files with cross-references to maintain modular documentation structure. New files: - status-tracking.mdx: 11-state status system, lifecycle flows, tool filtering - event-emission.mdx: Event types, payloads, batching configuration - log-capture.mdx: Server/request logging, buffering, privacy controls - recovery-system.mdx: Automatic recovery detection, retry logic, tool preservation Updated files with cross-links: - architecture.mdx: Add Status Tracking, Event System, Log Capture sections - tool-discovery.mdx: Add Status Integration, Recovery System sections - process-management.mdx: Add Status Events, Log Buffering sections - backend-communication.mdx: Add Events vs Heartbeat, Health Check sections - commands.mdx: Add health_check command documentation - hierarchical-router.mdx: Add Status-Based Tool Filtering section Navigation: - docs.json: Add "Status & Health Tracking" group to Satellite Development tab Technical details: - 11 status values (provisioning → online → offline/error/requires_reauth) - Event batching: 3-second interval, max 20 per batch - Retry logic: exponential backoff (500ms, 1s, 2s) - Log storage: 100-line limit per installation - Request logging privacy control via settings
1 parent 8cfd64b commit 195f35a

File tree

11 files changed

+1787
-112
lines changed

11 files changed

+1787
-112
lines changed

development/satellite/architecture.mdx

Lines changed: 26 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -284,31 +284,36 @@ For complete implementation details, see [Backend Polling Implementation](/devel
284284

285285
### Real-Time Event System
286286

287-
**Event Emission with Batching:**
288-
```
289-
Satellite Operations EventBus Backend
290-
│ │ │
291-
│─── mcp.server.started ──▶│ │
292-
│─── mcp.tool.executed ───▶│ [Queue] │
293-
│─── mcp.client.connected ─▶│ │
294-
│ [Every 3 seconds] │
295-
│ │ │
296-
│ │─── POST /events ───▶│
297-
│ │◀─── 200 OK ─────────│
298-
```
299-
300-
**Event Features:**
301-
- **Immediate Emission**: Events emitted when actions occur (not delayed by 30s heartbeat)
302-
- **Automatic Batching**: Events collected for 3 seconds, then sent as single batch (max 100 events)
303-
- **Memory Management**: In-memory queue (10,000 event limit) with overflow protection
304-
- **Graceful Error Handling**: 429 exponential backoff, 400 drops invalid events, 500/network errors retry
305-
- **10 Event Types**: Server lifecycle, client connections, tool discovery, configuration updates
287+
The satellite emits typed events for status changes, logs, and tool metadata. Events enable real-time monitoring without polling.
306288

307289
**Difference from Heartbeat:**
308290
- **Heartbeat** (every 30s): Aggregate metrics, system health, resource usage
309-
- **Events** (immediate): Point-in-time occurrences, user actions, precise timestamps
291+
- **Events** (immediate): Point-in-time status updates, precise timestamps
292+
293+
See [Event Emission](/development/satellite/event-emission) for complete event types, payloads, and batching configuration.
294+
295+
### Status Tracking System
296+
297+
The satellite tracks MCP server installation health through an 11-state status system that drives tool availability and automatic recovery.
298+
299+
**Status Values:**
300+
- Installation lifecycle: `provisioning`, `command_received`, `connecting`, `discovering_tools`, `syncing_tools`
301+
- Healthy state: `online` (tools available)
302+
- Configuration changes: `restarting`
303+
- Failure states: `offline`, `error`, `requires_reauth`, `permanently_failed`
304+
305+
**Status Integration:**
306+
- **Tool Filtering**: Tools from non-online servers hidden from discovery
307+
- **Auto-Recovery**: Offline servers auto-recover when responsive
308+
- **Event Emission**: Status changes emitted immediately to backend
309+
310+
See [Status Tracking](/development/satellite/status-tracking) for complete status lifecycle and transitions.
311+
312+
### Log Capture System
313+
314+
The satellite captures and batches two types of logs for debugging and monitoring: **server logs** (stderr output) and **request logs** (tool execution with full request/response data).
310315

311-
For complete event system documentation, see [Event System](/development/satellite/event-system).
316+
See [Log Capture](/development/satellite/log-capture) for buffering implementation, batching configuration, backend storage limits, and privacy controls.
312317

313318
## Security Architecture
314319

development/satellite/backend-communication.mdx

Lines changed: 66 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -157,11 +157,11 @@ For detailed event system documentation, see [Event System](/development/satelli
157157
- Performance metrics collection
158158

159159
**Terminate Process:**
160-
- Graceful shutdown with SIGTERM
161-
- Force kill with SIGKILL after timeout
162160
- Resource cleanup and deallocation
163161
- Final status report to Backend
164162

163+
See [Process Management - Graceful Termination](/development/satellite/process-management#graceful-termination) for SIGTERM/SIGKILL shutdown details.
164+
165165
## Internal Architecture
166166

167167
### Five Core Components
@@ -375,6 +375,70 @@ server.log.info({
375375
4. Add comprehensive monitoring and alerting
376376
5. End-to-end testing and performance validation
377377

378+
## Events vs Heartbeat
379+
380+
The satellite communicates status and metrics through two distinct channels:
381+
382+
**Events (Immediate):**
383+
- Emitted when actions occur (not delayed by heartbeat interval)
384+
- Point-in-time status updates with precise timestamps
385+
- Batched automatically (3-second interval, max 20 per batch)
386+
- Types: Status changes, logs, tool metadata, lifecycle events
387+
388+
**Heartbeat (Periodic, every 30s):**
389+
- Aggregate metrics and system health
390+
- Resource usage statistics
391+
- Overall satellite status
392+
393+
See [Event Emission](/development/satellite/event-emission) for complete event types and batching strategy.
394+
395+
## Health Check Command
396+
397+
The backend sends `health_check` commands for credential validation:
398+
399+
**Command Structure:**
400+
```typescript
401+
{
402+
commandType: 'health_check',
403+
priority: 'immediate',
404+
payload: {
405+
check_type: 'credential_validation',
406+
installation_id: string,
407+
team_id: string
408+
}
409+
}
410+
```
411+
412+
**Satellite Action:**
413+
- Calls `tools/list` on MCP server with credentials
414+
- Detects auth errors (401, 403)
415+
- Emits `requires_reauth` status if validation fails
416+
417+
See [Commands](/development/satellite/commands) for complete command reference.
418+
419+
## Recovery Commands
420+
421+
When offline servers recover, backend sends recovery commands:
422+
423+
**Command Structure:**
424+
```typescript
425+
{
426+
commandType: 'configure',
427+
priority: 'high',
428+
payload: {
429+
event: 'mcp_recovery',
430+
installation_id: string,
431+
team_id: string
432+
}
433+
}
434+
```
435+
436+
**Satellite Action:**
437+
- Triggers re-discovery for the recovered server
438+
- Status progresses: `offline``connecting``discovering_tools``online`
439+
440+
See [Recovery System](/development/satellite/recovery-system) for automatic recovery logic.
441+
378442
<Info>
379443
The satellite communication system is designed for enterprise deployment with complete team isolation, resource management, and audit logging while maintaining the developer experience that defines the DeployStack platform.
380444
</Info>

development/satellite/commands.mdx

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,44 @@ Each satellite command contains:
115115
4. Restart affected components
116116
5. Verify system integrity post-update
117117

118+
### health_check
119+
120+
**Purpose**: Validates MCP server credentials and connectivity
121+
122+
**Priority**: `immediate`
123+
124+
**Triggered By**:
125+
- Backend credential validation cron (every 1 minute)
126+
- Manual credential testing
127+
- OAuth token expiration detection
128+
129+
**Payload Structure**:
130+
```json
131+
{
132+
"check_type": "credential_validation",
133+
"installation_id": "installation-uuid",
134+
"team_id": "team-uuid"
135+
}
136+
```
137+
138+
**Satellite Actions**:
139+
1. Find MCP server configuration by installation_id
140+
2. Skip stdio servers (no HTTP credentials to validate)
141+
3. Build HTTP request with configured credentials (headers, query params)
142+
4. Call `tools/list` with 15-second timeout
143+
5. Detect authentication errors:
144+
- HTTP 401/403 responses
145+
- Error messages containing "auth", "unauthorized", "forbidden"
146+
6. Emit status event:
147+
- On auth failure → `requires_reauth` status
148+
- On success → credentials valid (no status change)
149+
150+
**Error Detection Patterns**:
151+
- HTTP status codes: 401, 403
152+
- Response body keywords: "auth", "unauthorized", "forbidden", "token", "credentials"
153+
154+
See [Status Tracking](/development/satellite/status-tracking) for credential validation status flow.
155+
118156
## Command Lifecycle
119157

120158
### Creation

0 commit comments

Comments
 (0)