Skip to content

Commit 12f06af

Browse files
authored
Merge pull request #7 from pvliesdonk/copilot/implement-epic-7-on-pr6
Implement Epic 7: Observability & Operations
2 parents 8b9f501 + 1cc2357 commit 12f06af

File tree

10 files changed

+1253
-16
lines changed

10 files changed

+1253
-16
lines changed

README.md

Lines changed: 123 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,11 @@ MCP DevBench is a Docker container management server that implements the Model C
1515
- **Configuration Management**: Environment-based configuration with Pydantic Settings
1616
- **Structured Logging**: JSON-formatted logging for production observability
1717
- **Docker Integration**: Secure Docker daemon communication with connection pooling
18+
- **Audit Logging**: Complete audit trail for all operations with sensitive data redaction
19+
- **Prometheus Metrics**: Built-in metrics collection for monitoring and alerting
20+
- **Admin Tools**: System health status, container/exec listing, garbage collection, and reconciliation
21+
- **Graceful Shutdown**: Drains active operations before shutdown
22+
- **Automatic Recovery**: Reconciles Docker state with database on startup
1823

1924
## Requirements
2025

@@ -247,6 +252,31 @@ This project has completed **Epic 1: Foundation Layer**, **Epic 2: Command Execu
247252
- Database vacuuming for optimization
248253
- Health monitoring and metrics collection
249254

255+
### Epic 7: Observability & Operations ✅
256+
- [x] Feature 7.1: Structured Audit Logging
257+
- AuditLogger with JSON structured logging for all operations
258+
- Complete audit trail for container, exec, filesystem, security, and transfer events
259+
- Automatic sensitive data redaction (passwords, tokens, keys, secrets)
260+
- ISO8601 timestamps and correlation IDs
261+
- Configurable detail level
262+
- 17 unit tests covering audit functionality
263+
264+
- [x] Feature 7.2: Metrics & Monitoring
265+
- Prometheus metrics collection via MetricsCollector
266+
- Counter metrics: container_spawns_total, exec_total, fs_operations_total
267+
- Histogram metrics: exec_duration_seconds, output_bytes
268+
- Gauge metrics: active_containers, active_attachments, memory_usage_bytes
269+
- `metrics` tool to expose Prometheus-formatted metrics
270+
- 14 unit tests covering metrics collection
271+
272+
- [x] Feature 7.3: Debug & Admin Tools
273+
- `system_status` tool for overall system health
274+
- `list_containers` tool for detailed container information
275+
- `list_execs` tool for active execution listing
276+
- `garbage_collect` tool for manual cleanup
277+
- `reconcile` tool with audit logging (from Epic 6)
278+
- Docker connectivity and database status monitoring
279+
250280
### Current Status
251281
The project now has:
252282
- Full container lifecycle management with image policy enforcement
@@ -256,10 +286,13 @@ The project now has:
256286
- Image allow-list validation and resolution with digest pinning
257287
- Comprehensive security hardening (capability dropping, resource limits, audit logging)
258288
- Warm container pool for fast provisioning (<1s attach time)
259-
- **Graceful shutdown with operation draining**
260-
- **Boot recovery and automatic reconciliation**
261-
- **Background maintenance and health monitoring**
262-
- 170 unit and integration tests passing (100% success rate)
289+
- Graceful shutdown with operation draining
290+
- Boot recovery and automatic reconciliation
291+
- Background maintenance and health monitoring
292+
- **Complete audit logging for all operations with sensitive data redaction**
293+
- **Prometheus metrics collection and exposure**
294+
- **Admin tools for system status, container/exec listing, and manual operations**
295+
- 201 unit and integration tests passing (100% success rate)
263296
- Comprehensive error handling and resource management
264297

265298
## MCP Tools Reference
@@ -460,6 +493,92 @@ This tool performs:
460493
}
461494
```
462495

496+
### Observability & Admin Tools
497+
498+
#### `metrics`
499+
Get Prometheus metrics for monitoring.
500+
501+
Returns current metrics including:
502+
- Container spawn counts by image
503+
- Execution counts and durations
504+
- Filesystem operation counts
505+
- Active container and attachment gauges
506+
- Memory usage by container
507+
508+
**Input:** None
509+
510+
**Output:**
511+
- `metrics` (string): Prometheus-formatted metrics
512+
513+
**Example metrics output:**
514+
```
515+
# HELP mcp_devbench_container_spawns_total Total number of container spawns
516+
# TYPE mcp_devbench_container_spawns_total counter
517+
mcp_devbench_container_spawns_total{image="python:3.11"} 5.0
518+
# HELP mcp_devbench_exec_total Total number of command executions
519+
# TYPE mcp_devbench_exec_total counter
520+
mcp_devbench_exec_total{container_id="c_123",status="success"} 10.0
521+
# HELP mcp_devbench_active_containers Number of active containers
522+
# TYPE mcp_devbench_active_containers gauge
523+
mcp_devbench_active_containers 3.0
524+
```
525+
526+
#### `system_status`
527+
Get system health and status information.
528+
529+
**Input:** None
530+
531+
**Output:**
532+
- `status` (string): Overall system status (healthy, degraded)
533+
- `docker_connected` (boolean): Docker daemon connectivity
534+
- `database_initialized` (boolean): Database initialization status
535+
- `active_containers` (integer): Number of active containers
536+
- `active_attachments` (integer): Number of active client attachments
537+
- `version` (string): Server version
538+
539+
**Example:**
540+
```json
541+
{
542+
"status": "healthy",
543+
"docker_connected": true,
544+
"database_initialized": true,
545+
"active_containers": 3,
546+
"active_attachments": 2,
547+
"version": "0.1.0"
548+
}
549+
```
550+
551+
#### `garbage_collect`
552+
Trigger manual garbage collection.
553+
554+
Cleans up:
555+
- Orphaned transient containers
556+
- Old completed exec records (>24h)
557+
- Abandoned attachments
558+
559+
**Input:** None
560+
561+
**Output:**
562+
- `containers_removed` (integer): Number of containers removed
563+
- `execs_cleaned` (integer): Number of exec records cleaned
564+
- `attachments_cleaned` (integer): Number of attachments cleaned
565+
566+
#### `list_containers`
567+
List all containers with detailed information.
568+
569+
**Input:** None
570+
571+
**Output:**
572+
- `containers` (array): List of container objects with id, docker_id, alias, image, status, persistent, created_at, last_seen
573+
574+
#### `list_execs`
575+
List active command executions.
576+
577+
**Input:** None
578+
579+
**Output:**
580+
- `execs` (array): List of execution objects with exec_id, container_id, cmd, as_root, started_at, status
581+
463582
See [mcp-devbench-work-breakdown.md](mcp-devbench-work-breakdown.md) for the complete implementation roadmap.
464583

465584
## License

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ dependencies = [
1616
"alembic>=1.13.0",
1717
"aiosqlite>=0.19.0",
1818
"python-json-logger>=2.0.7",
19+
"prometheus-client>=0.20.0",
1920
]
2021

2122
[project.optional-dependencies]

src/mcp_devbench/mcp_tools.py

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -198,3 +198,59 @@ class ExecPollOutput(BaseModel):
198198

199199
messages: List[ExecStreamMessage] = Field(..., description="Stream messages")
200200
complete: bool = Field(..., description="Whether execution is complete")
201+
202+
203+
# Admin and Monitoring Tools for Feature 7.2 and 7.3
204+
205+
206+
class MetricsOutput(BaseModel):
207+
"""Output model for metrics tool."""
208+
209+
metrics: str = Field(..., description="Prometheus metrics in text format")
210+
211+
212+
class SystemStatusOutput(BaseModel):
213+
"""Output model for system status tool."""
214+
215+
status: str = Field(..., description="Overall system status")
216+
docker_connected: bool = Field(..., description="Docker daemon connectivity")
217+
database_initialized: bool = Field(..., description="Database initialization status")
218+
active_containers: int = Field(..., description="Number of active containers")
219+
active_attachments: int = Field(..., description="Number of active attachments")
220+
version: str = Field(..., description="Server version")
221+
222+
223+
class ReconcileInput(BaseModel):
224+
"""Input model for reconcile tool."""
225+
226+
force: bool = Field(default=False, description="Force reconciliation even if recently run")
227+
228+
229+
class ReconcileOutput(BaseModel):
230+
"""Output model for reconcile tool."""
231+
232+
discovered: int = Field(..., description="Number of containers discovered")
233+
adopted: int = Field(..., description="Number of containers adopted into state")
234+
cleaned_up: int = Field(..., description="Number of containers cleaned up")
235+
orphaned: int = Field(..., description="Number of orphaned containers found")
236+
errors: int = Field(..., description="Number of errors encountered")
237+
238+
239+
class GarbageCollectOutput(BaseModel):
240+
"""Output model for garbage collection tool."""
241+
242+
containers_removed: int = Field(..., description="Number of containers removed")
243+
execs_cleaned: int = Field(..., description="Number of exec records cleaned")
244+
attachments_cleaned: int = Field(..., description="Number of attachments cleaned")
245+
246+
247+
class ContainerListOutput(BaseModel):
248+
"""Output model for container list tool."""
249+
250+
containers: List[Dict[str, Any]] = Field(..., description="List of container information")
251+
252+
253+
class ExecListOutput(BaseModel):
254+
"""Output model for exec list tool."""
255+
256+
execs: List[Dict[str, Any]] = Field(..., description="List of active executions")

src/mcp_devbench/repositories/execs.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -85,9 +85,9 @@ async def get_old_completed(self, hours: int = 24) -> List[Exec]:
8585
Returns:
8686
List of old completed execs
8787
"""
88-
from datetime import timedelta
88+
from datetime import timedelta, timezone
8989

90-
cutoff = datetime.utcnow() - timedelta(hours=hours)
90+
cutoff = datetime.now(timezone.utc) - timedelta(hours=hours)
9191
stmt = select(Exec).where(Exec.ended_at.is_not(None), Exec.ended_at < cutoff)
9292
result = await self.session.execute(stmt)
9393
return list(result.scalars().all())

0 commit comments

Comments
 (0)