Skip to content

Commit f764960

Browse files
Add --auto-shutdown flag to server command for VPC-based lifecycle management
Monitors VPC existence on each status refresh cycle; exits cleanly when the cluster's VPC is removed. Fixes pre-existing test failures from the K3s DaemonSet sidecar migration and removes the stale per-host field from sidecar failure events.
1 parent 0056bb8 commit f764960

File tree

23 files changed

+439
-286
lines changed

23 files changed

+439
-286
lines changed

docs/integrations/server.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,22 @@ Published every 5 seconds when the cluster is running Cassandra:
236236
| `compactionCompletedPerSec` | double | Compactions completed per second |
237237
| `compactionBytesWrittenPerSec` | double | Compaction write throughput (bytes/sec) |
238238

239+
## Auto-Shutdown on Infrastructure Removal
240+
241+
When running the server in unattended or automated scenarios, you can enable automatic shutdown if the cluster's AWS infrastructure is torn down:
242+
243+
```bash
244+
easy-db-lab server --auto-shutdown
245+
```
246+
247+
When `--auto-shutdown` is set, the server checks whether the cluster VPC still exists on each status refresh cycle (controlled by `--refresh`). If the VPC is no longer found, the server emits a shutdown event and exits cleanly with code 0.
248+
249+
This is useful when:
250+
- Running the server alongside an automated test workflow that tears down infrastructure when done
251+
- Leaving the server running overnight and wanting it to stop automatically after `easy-db-lab down`
252+
253+
**Note:** The check is skipped if no cluster state exists or the VPC name cannot be determined. AWS API errors during the check are logged and ignored — only a confirmed "VPC not found" result triggers shutdown.
254+
239255
## Notes
240256

241257
- The server requires Docker to be installed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
schema: spec-driven
2+
created: 2026-03-26
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
## Context
2+
3+
The `server` command starts a long-lived process that exposes MCP and REST endpoints for cluster management. It is often started in automated or unattended workflows (e.g., running alongside an AI assistant session). When the underlying AWS infrastructure — specifically the VPC — is torn down (via `easy-db-lab down` or external deletion), the server process has no meaningful work to do but continues running indefinitely.
4+
5+
The VPC is used as the sentinel because it is the root of the cluster's AWS infrastructure. If the VPC is gone, all associated EC2 instances, subnets, and security groups are also gone — the cluster no longer exists.
6+
7+
The VPC ID for the current cluster is stored in `ClusterState` and accessible via `ClusterStateManager`.
8+
9+
## Goals / Non-Goals
10+
11+
**Goals:**
12+
- Provide an opt-in `--auto-shutdown` flag on the `server` command
13+
- Check whether the cluster's VPC still exists on each status refresh cycle (`--refresh`)
14+
- Emit a domain event and exit cleanly when the VPC is no longer found
15+
16+
**Non-Goals:**
17+
- Automatic shutdown without the flag (opt-in only)
18+
- Monitoring resources other than the VPC (instances, subnets, etc.)
19+
- Reacting to partial infrastructure removal (only full VPC deletion triggers shutdown)
20+
- Persistent state or restart behavior after shutdown
21+
22+
## Decisions
23+
24+
### Decision 1: Integrate VPC check into `StatusCache` refresh cycle
25+
26+
The VPC existence check runs inside `StatusCache` on each refresh, reusing the existing `--refresh` interval. When `autoShutdown` is enabled and the VPC is not found, `StatusCache` emits `Event.Server.InfrastructureGone` and calls `exitProcess(0)`. The `Server` command passes `autoShutdown` and the cluster VPC name directly to `StatusCache`.
27+
28+
**Alternative considered**: A separate background service on its own timer. Rejected — `StatusCache` already polls AWS on a schedule; a second thread for VPC checks is redundant overhead with no benefit.
29+
30+
### Decision 2: Opt-in via `--auto-shutdown` flag, not default behavior
31+
32+
Auto-shutdown should not happen unexpectedly during interactive use. An explicit flag makes the intent clear and prevents surprises.
33+
34+
**Alternative considered**: Always-on with a `--no-auto-shutdown` flag. Rejected — fail-fast-on-default would surprise users who run the server interactively.
35+
36+
### Decision 3: Use `VpcDiscoveryOperations.findVpcByName()` as the existence check
37+
38+
`findVpcByName()` returns `null` when the VPC doesn't exist. This is already implemented in `VpcService` / `VpcInfrastructure`. The cluster's VPC name is derivable from `ClusterState`.
39+
40+
**Alternative considered**: Describe VPC by ID. Also valid, but the name-based lookup already has the right semantics (null = not found) and is used elsewhere.
41+
42+
### Decision 4: Single consecutive miss triggers shutdown (no retry dampening)
43+
44+
If the VPC is gone, it is gone. VPC deletion in AWS is not transient. There is no need to wait for N consecutive failures before shutting down.
45+
46+
**Alternative considered**: Require 2–3 consecutive misses. Rejected — AWS VPC existence checks are reliable; the complexity of dampening is not warranted.
47+
48+
### Decision 5: Emit a domain event, then exit the JVM
49+
50+
The watchdog emits `Event.Server.InfrastructureGone` and then calls `exitProcess(0)`. This ensures MCP clients and REST consumers see a structured event before the process ends.
51+
52+
## Risks / Trade-offs
53+
54+
- **False positive on transient AWS API error** → The check should distinguish "not found" (null) from an AWS API exception. API exceptions should be logged and the watchdog should continue polling rather than triggering shutdown. Only a confirmed null result (VPC not found) causes shutdown.
55+
- **VPC ID not present in ClusterState** → If the cluster was provisioned before VPC tracking was introduced, `ClusterState.vpcId` may be null. The watchdog should skip polling and log a warning rather than crashing or shutting down.
56+
- **No clean shutdown hook** → The `exitProcess(0)` approach bypasses Ktor's graceful shutdown. This is acceptable for an infra-gone scenario since the cluster is already gone. A future improvement could use a coroutine cancellation signal instead.
57+
58+
## Open Questions
59+
60+
- Should the watchdog also check instance existence as a secondary signal, or is VPC-only sufficient? (Current proposal: VPC-only)
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
## Why
2+
3+
When `easy-db-lab server` is running and the underlying AWS infrastructure is torn down (e.g., the VPC is deleted), the server process continues running indefinitely with no meaningful work to do. This wastes resources and leaves the user with a stale, misleading server process — especially important in automated or unattended scenarios where no human is watching the terminal.
4+
5+
## What Changes
6+
7+
- Add an optional `--auto-shutdown` flag to the `server` command that enables infrastructure watchdog behavior.
8+
- When enabled, a background service checks whether the cluster's VPC still exists in AWS on each status refresh cycle.
9+
- If the VPC is no longer found, the server logs a final shutdown event and exits cleanly.
10+
11+
## Capabilities
12+
13+
### New Capabilities
14+
15+
- `server-infra-watchdog`: Background watchdog that monitors AWS infrastructure health (VPC existence) while the server is running and triggers a clean shutdown if the infrastructure is gone.
16+
17+
### Modified Capabilities
18+
19+
- `server`: The `server` command gains a new `--auto-shutdown` flag and a new background service lifecycle hook.
20+
21+
## Impact
22+
23+
- `commands/Server.kt` — new CLI options
24+
- New background service class (e.g., `InfraWatchdogService`) in `services/` or similar
25+
- AWS VPC existence check via existing EC2 provider
26+
- `events/Event.kt` — new domain event for watchdog shutdown
27+
- `openspec/specs/server/spec.md` — new requirements for auto-shutdown behavior
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Server Infrastructure Watchdog
2+
3+
A background service that monitors whether the cluster's AWS VPC still exists while the server is running, and triggers a clean shutdown if the infrastructure has been removed.
4+
5+
## Requirements
6+
7+
### Requirement: Watchdog monitors VPC existence
8+
The watchdog service SHALL periodically check whether the cluster's VPC still exists in AWS and trigger server shutdown when it is no longer found.
9+
10+
#### Scenario: VPC is present
11+
- **WHEN** the watchdog polls AWS and the cluster VPC is found
12+
- **THEN** the server continues running normally
13+
14+
#### Scenario: VPC is gone
15+
- **WHEN** the watchdog polls AWS and the cluster VPC is not found
16+
- **THEN** the server emits an `Event.Server.InfrastructureGone` event and exits cleanly
17+
18+
#### Scenario: AWS API error during check
19+
- **WHEN** the watchdog poll encounters an AWS API exception (not a not-found result)
20+
- **THEN** the exception is logged, the watchdog continues polling on the next interval, and the server does not shut down
21+
22+
### Requirement: VPC ID not available
23+
The watchdog SHALL gracefully handle the case where no VPC ID is recorded in cluster state.
24+
25+
#### Scenario: No VPC ID in cluster state
26+
- **WHEN** the watchdog starts and `ClusterState.vpcId` is null or blank
27+
- **THEN** the watchdog logs a warning and skips all polling without triggering shutdown
28+
29+
### Requirement: Check runs on status refresh cycle
30+
The VPC existence check SHALL run on the same cadence as the existing status cache refresh (`--refresh` interval).
31+
32+
#### Scenario: Check frequency matches refresh
33+
- **WHEN** the server is running with `--auto-shutdown` and `--refresh 30`
34+
- **THEN** the VPC check runs every 30 seconds alongside the status cache refresh
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
## ADDED Requirements
2+
3+
### Requirement: Auto-shutdown CLI option
4+
The server command SHALL accept an `--auto-shutdown` flag that enables infrastructure watchdog behavior.
5+
6+
#### Scenario: Flag not provided
7+
- **WHEN** the user starts the server without `--auto-shutdown`
8+
- **THEN** no watchdog is started and the server runs indefinitely
9+
10+
#### Scenario: Flag provided
11+
- **WHEN** the user starts the server with `--auto-shutdown`
12+
- **THEN** the infrastructure watchdog service is started as a background service
13+
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
## 1. Events
2+
3+
- [x] 1.1 Add `Event.Server` sealed interface to `Event.kt` with `InfrastructureGone` data class (includes vpcId and a message field)
4+
5+
## 2. StatusCache Integration
6+
7+
- [x] 2.1 Add `autoShutdown: Boolean` and `vpcName: String?` parameters to `StatusCache`
8+
- [x] 2.2 On each refresh, if `autoShutdown` is true and `vpcName` is non-null, call `findVpcByName()`
9+
- [x] 2.3 If VPC is not found (null result), emit `Event.Server.InfrastructureGone` and call `exitProcess(0)`
10+
- [x] 2.4 Handle AWS API exceptions by logging and continuing (no shutdown on exception)
11+
- [x] 2.5 Skip the check with a logged warning if `vpcName` is null or blank
12+
13+
## 3. Server Command
14+
15+
- [x] 3.1 Add `--auto-shutdown` boolean flag to `Server.kt`
16+
- [x] 3.2 Pass `autoShutdown` and the cluster VPC name through to `StatusCache`
17+
18+
## 4. Console Output
19+
20+
- [x] 4.1 Add a `ConsoleEventListener` handler for `Event.Server.InfrastructureGone` that prints a message before exit
21+
22+
## 5. Tests
23+
24+
- [x] 5.1 Unit test `StatusCache`: verify it emits `InfrastructureGone` and calls `exitProcess` when VPC not found
25+
- [x] 5.2 Unit test: verify no shutdown when `findVpcByName` throws an exception
26+
- [x] 5.3 Unit test: verify check is skipped when `vpcName` is null or blank
27+
28+
## 6. Documentation
29+
30+
- [x] 6.1 Update `docs/` server reference page to document the `--auto-shutdown` option

src/main/kotlin/com/rustyrazorblade/easydblab/commands/Server.kt

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,12 @@ class Server : PicoBaseCommand() {
4141
)
4242
var refreshInterval: Long = Constants.Time.DEFAULT_STATUS_REFRESH_SECONDS
4343

44+
@Option(
45+
names = ["--auto-shutdown"],
46+
description = ["Shut down the server if the cluster VPC is removed"],
47+
)
48+
var autoShutdown: Boolean = false
49+
4450
companion object {
4551
private val log = KotlinLogging.logger {}
4652
}
@@ -62,8 +68,15 @@ class Server : PicoBaseCommand() {
6268

6369
log.info { "Starting easy-db-lab server..." }
6470

71+
val vpcName =
72+
if (autoShutdown && clusterStateManager.exists()) {
73+
"easy-db-lab-${clusterStateManager.load().name}"
74+
} else {
75+
null
76+
}
77+
6578
try {
66-
val server = McpServer(refreshInterval)
79+
val server = McpServer(refreshInterval, autoShutdown, vpcName)
6780
server.start(port, bind) { actualPort ->
6881
// Generate .mcp.json with actual port
6982
val config =

src/main/kotlin/com/rustyrazorblade/easydblab/commands/cassandra/Restart.kt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ class Restart : PicoBaseCommand() {
4848
sidecarService
4949
.restart(controlHost)
5050
.onFailure { e ->
51-
eventBus.emit(Event.Cassandra.SidecarRestartFailed("", "${e.message}"))
51+
eventBus.emit(Event.Cassandra.SidecarRestartFailed("${e.message}"))
5252
}
5353
eventBus.emit(Event.Cassandra.SidecarRestarted)
5454
}

src/main/kotlin/com/rustyrazorblade/easydblab/commands/cassandra/Start.kt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ class Start : PicoBaseCommand() {
6666
sidecarService
6767
.deploy(controlHost, sidecarImage)
6868
.onFailure { e ->
69-
eventBus.emit(Event.Cassandra.SidecarStartFailed("", "${e.message}"))
69+
eventBus.emit(Event.Cassandra.SidecarStartFailed("${e.message}"))
7070
}
7171
}
7272

0 commit comments

Comments
 (0)