-
Notifications
You must be signed in to change notification settings - Fork 0
Reference ReadyStackGo Health Spec
The health system of ReadyStackGo (RSGO) should:
- Quickly show on Org / Environment / Stack level:
- Is everything running normally?
- Is something planned to be restricted (Migration / Maintenance)?
- Is something unplanned broken?
- Support first-party stacks (ams.project etc.) with deep integration (NServiceBus, custom health endpoints).
- Sensibly aggregate third-party stacks (Keycloak, Postgres, external products), even if they only provide Docker status or generic HTTP health.
Important principle:
RSGO is the source of truth for the operation mode (Normal, Migration, Maintenance). Containers provide technical states (up/down, errors, bus status).
Enum for the technical state:
HealthyDegradedUnhealthyUnknown
Used for:
- Overall status of a stack
- Bus
- Infra
- Self (Container/Services)
Enum for the operation mode of a stack deployment:
-
Normal→ Normal operation, no planned restriction. -
Migrating→ Planned migration/upgrade running (e.g., DB migrations, stack upgrade). -
Maintenance→ Stack deliberately put into maintenance mode (planned). -
Stopped→ Stack intentionally stopped (not available, but no error). -
Failed(optional) → Last deploy/upgrade/migration failed, manual intervention required.
Important:
OperationMode is controlled by RSGO, not by the containers.
Additional states in the Deployment domain:
-
DeploymentStatus:IdleDeployingUpgradingRollingBackFailed
-
MigrationStatus:NoneRunningSucceededFailed
These values help to set OperationMode consistently.
RSGO creates regular health snapshots per Org/Env/Stack:
enum HealthStatus { Healthy, Degraded, Unhealthy, Unknown }
enum OperationMode { Normal, Migrating, Maintenance, Stopped, Failed }
class HealthSnapshot {
Guid OrgId;
Guid EnvironmentId;
Guid StackId;
DateTime CapturedAtUtc;
HealthStatus Overall;
OperationMode OperationMode;
string? TargetVersion; // e.g., "0.5.0" during upgrade
string? CurrentVersion; // e.g., "0.4.2"
BusHealth? Bus;
InfraHealth? Infra;
SelfHealth Self;
}Primarily used for NServiceBus-based stacks (first-party or other NSB apps):
class BusHealth {
HealthStatus Status; // Healthy/Degraded/Unhealthy/Unknown
string? TransportKey; // e.g., "primary-sql"
bool HasCriticalError;
string? CriticalErrorMessage;
DateTime? LastHealthPingProcessedUtc;
TimeSpan? TimeSinceLastPing;
TimeSpan? UnhealthyAfter; // Config value, when it's considered "too old"
// Optional: affected endpoints
IReadOnlyList<BusEndpointHealth> Endpoints;
}
class BusEndpointHealth {
string EndpointName;
HealthStatus Status;
DateTime? LastPingUtc;
string? Reason; // "NoMessagesRecently", "HeartbeatMissing", etc.
}Generic infrastructure checks (optional per stack):
class InfraHealth {
IReadOnlyList<DatabaseHealth> Databases;
IReadOnlyList<DiskHealth> Disks;
IReadOnlyList<ExternalServiceHealth> ExternalServices;
}
class DatabaseHealth {
string Id; // e.g., "ams_project_db"
HealthStatus Status;
int? LatencyMs;
string? Error;
}
class DiskHealth {
string Mount; // e.g., "/"
HealthStatus Status;
double? FreePercent;
string? Error;
}
class ExternalServiceHealth {
string Id; // e.g., "smtp"
HealthStatus Status;
string? Error;
}State of the containers/services controlled by the stack:
class SelfHealth {
IReadOnlyList<ServiceHealth> Services;
}
class ServiceHealth {
string Name; // e.g., "ams-api"
HealthStatus Status; // From container/health endpoint perspective
string? ContainerId;
string? Reason; // "Restarting", "CrashLoop", "HealthCheckFailed"
int? RestartCount;
}- RSGO connects to the Environment host via Docker API:
- Status of containers (running, exited, restarting)
- Restart counts, exit codes
- Optional: HTTP health URL from manifest:
services:
- name: ams-api
image: ...
health:
type: http
url: http://ams-api:8080/health
timeout: 5sRSGO polls these endpoints and sets ServiceHealth.Status accordingly.
For NServiceBus-based applications:
- Each endpoint uses your package
Wiesenwischer.NServiceBus.EndpointHealthand an ASP.NET Health endpoint (/health). - This provides:
- MessagePump status
HasCriticalErrorLastHealthPingProcessedUtcTransportKey
RSGO:
- knows from the manifest which services are endpoints and where their health URLs are
- retrieves these health endpoints
- aggregates bus status per
TransportKeyand stack.
For non-NSB stacks:
-
BusHealthcan simply benullorStatus=Unknown - or you allow generic "messaging checks" via manifest (e.g., RabbitMQ HTTP API), but this is optional/"Advanced".
- For first-party stacks, the manifest can define DB/service checks:
infra:
databases:
- id: ams_project_db
connectionParam: DB_MAIN
externalServices:
- id: smtp
url: smtp.example.com:587- RSGO can:
- make short connection pings to DBs (configurable)
- query free disk space via agent/host metrics
- For third-party stacks:
- InfraHealth can remain empty or contain only generic checks (e.g., "DB container is running").
Pseudo-rule:
- If
OperationMode == MigratingorMaintenanceorStopped:-
overallat leastDegraded - real errors can elevate
overalltoUnhealthy(e.g., migration failed).
-
- If
OperationMode == Normal:-
overallis the "maximum"/"worst" of Bus/Infra/Self- if one is
Unhealthy→overall = Unhealthy - otherwise if one is
Degraded→overall = Degraded - otherwise
overall = Healthy
- if one is
-
- If no data available:
overall = Unknown
- During Deploy/Upgrade/Migration:
OperationMode = MigratingDeploymentStatus = UpgradingMigrationStatus = Running
- During planned maintenance:
OperationMode = Maintenance
- During deliberate stop:
OperationMode = Stopped
- On failed upgrade/migration:
OperationMode = FailedDeploymentStatus = FailedMigrationStatus = Failed
The health engine always reads OperationMode first and interprets container states in context:
-
OperationMode=Normal+ many broken containers → real incident. -
OperationMode=Migrating+ services down → expected restriction (Degraded).
When switching to maintenance mode, containers are automatically stopped:
Normal → Maintenance: All stack containers are stopped
Maintenance → Normal: All stack containers are started
Exception: Containers with the label rsgo.maintenance=ignore are not stopped/started.
This is useful for infrastructure containers (e.g., databases) that should continue running during maintenance.
Example in docker-compose.yml:
services:
postgres:
image: postgres:16
labels:
rsgo.stack: my-app
rsgo.maintenance: ignore # Container stays active during MaintenanceContainer lifecycle management is done via Docker API and coordinated by the ChangeOperationModeHandler.
-
GET /api/orgs/{orgId}/envs/{envId}/stacks/{stackId}/healthResponse:HealthSnapshotas JSON -
GET /api/orgs/{orgId}/envs/{envId}/health-summaryResponse: List of all stacks +overall+operationMode -
Optional:
GET /api/orgs/{orgId}/envs/{envId}/stacks/{stackId}/health/history→ Last X snapshots to see trends/outages.
Example display for an organization:
-
Org A-
Test– 🟢 Healthy (2 stacks, all Healthy) -
Prod– 🟠 Degraded – Migration running (ams-project 0.4.2 → 0.5.0)
-
Table:
| Stack | Overall | Mode | Bus | Infra | Self |
|---|---|---|---|---|---|
| identity | Healthy | Normal | Healthy | – | 3/3 Services ok |
| ams-project | Degraded (Migration 0.5.0) | Migrating | Unhealthy | DB: ERP busy | 9/10 Services ok |
| monitoring | Healthy | Normal | – | – | 2/2 Services ok |
Header:
🟠 Degraded – Migration running (0.4.2 → 0.5.0) Step 2/4: Database migration.
Below that, tabs:
- Overview (Bus/Infra/Self summarized)
- Services (Container status, Health endpoints)
- Bus (NSB endpoints & TransportKeys)
- Infra (DB, Disk, external services)
For external stacks that are just "regular" containers:
-
Manifest contains only services with image/ports etc.
-
No special health URLs, no NSB, no DB checks.
-
RSGO can still:
- Query container state (running, restarting, exited)
- Derive
SelfHealthfrom it - Calculate
overall(from Self + OperationMode)
→ For these stacks, RSGO at least shows:
- "Are the containers running?"
- "Are we in Maintenance/Stopped?"
- Possibly
Unknown, if no info available.
Many products already have HTTP health:
- Keycloak: e.g.,
/healthor product-specific endpoints - Databases via Admin API
- External web apps with
/healthor/status
In the manifest, you can also define health blocks for third-party services:
services:
- name: keycloak
image: quay.io/keycloak/keycloak:latest
health:
type: http
url: http://keycloak:8080/health
timeout: 5sRSGO treats them the same as your own services – just without NSB/bus specifics.
Since RSGO does not migrate external products itself (e.g., complex SAP/ERP applications):
-
OperationMode = Migratingis usually only set if:- RSGO knows an own upgrade/migration workflow for this stack
- or the admin manually puts the stack into a maintenance/migration mode.
You can offer for example:
- Button: "Put Stack in Maintenance"
OperationMode = Maintenance- optional: RSGO automatically stops certain services
This way you can also clearly signal for external stacks:
- "Degraded (Maintenance)" instead of "Unhealthy".
For important external products, you can later offer "adapter manifest extensions":
- Special health definitions:
-
kind: "keycloak"→ RSGO knows where to check sensibly. -
kind: "prometheus"→ specific status endpoints.
-
- Possibly also simple lifecycle commands:
- e.g., "Reload config", "Restart gracefully".
But this is not necessary for the first version – basic health functionality already works with:
- Docker status
- optional HTTP health URLs
-
RSGO controls the operation mode ("OperationMode"):
- Normal, Migrating, Maintenance, Stopped, Failed
-
Containers/Services provide technical health data:
- via Docker status, HTTP health, NSB EndpointHealth, etc.
-
Overall health is a combination of:
- OperationMode + Bus + Infra + Self
-
Third-party stacks:
- work at least with Docker status
- can be integrated via optional HTTP health
- can be manually put into Maintenance
-
First-party stacks:
- additionally use BusHealth (NServiceBus + WatchDog)
- can run migrations/upgrades fully automated in RSGO mode "Migrating".
This concept allows:
- a simple, clear UX ("green/yellow/red + Migration/Maintenance"),
- clean integration of your existing EndpointHealth/WatchDog ideas,
- and a sensible health display even for stacks that don't come from you.
Getting Started
Architecture
Configuration
Security
Setup Wizard
Development
Operations
CI/CD
Reference
- Roadmap
- API Reference
- Configuration Reference
- Manifest Schema
- Multi-Environment
- Stack Sources
- Plugin System
- Technical Specification
- Full Specification
Specifications
Release Notes