-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
P2: MediumMedium priority - fix when possibleMedium priority - fix when possibleenhancementNew feature or requestNew feature or request
Description
Problem
The health check endpoint (likely GET / and GET /version) only returns a static response without verifying that critical components are actually functional:
- Backend connectivity not checked
- Signing keys not verified
- Cache not tested
- dstack KMS not validated
This means the health check can return "healthy" even when the service cannot process requests.
Impact
High - Load balancers and orchestrators (Kubernetes, etc.) rely on health checks. A false positive can route traffic to a broken instance.
Solution
1. Add comprehensive health endpoint
// src/routes/health.rs
#[derive(Serialize)]
struct HealthStatus {
status: String,
version: String,
checks: HealthChecks,
}
#[derive(Serialize)]
struct HealthChecks {
signing: CheckResult,
backend: CheckResult,
cache: CheckResult,
}
#[derive(Serialize)]
struct CheckResult {
status: String,
message: Option<String>,
latency_ms: Option<u64>,
}
pub async fn health_check(State(state): State<AppState>) -> impl IntoResponse {
let mut checks = HealthChecks {
signing: check_signing(&state.signing).await,
backend: check_backend(&state).await,
cache: check_cache(&state.cache).await,
};
let all_healthy = checks.signing.status == "healthy"
&& checks.backend.status == "healthy"
&& checks.cache.status == "healthy";
let status_code = if all_healthy {
StatusCode::OK
} else {
StatusCode::SERVICE_UNAVAILABLE
};
(status_code, Json(HealthStatus {
status: if all_healthy { "healthy" } else { "unhealthy" },
version: state.config.git_rev.clone(),
checks,
}))
}
async fn check_signing(signing: &Arc<SigningPair>) -> CheckResult {
let start = Instant::now();
match signing.sign_chat("health-check") {
Ok(_) => CheckResult {
status: "healthy".to_string(),
message: None,
latency_ms: Some(start.elapsed().as_millis() as u64),
},
Err(e) => CheckResult {
status: "unhealthy".to_string(),
message: Some(e.to_string()),
latency_ms: None,
},
}
}
async fn check_backend(state: &AppState) -> CheckResult {
let start = Instant::now();
let timeout = Duration::from_secs(2);
match tokio::time::timeout(timeout, state.http_client.get(&state.config.models_url).send()).await {
Ok(Ok(resp)) if resp.status().is_success() => CheckResult {
status: "healthy".to_string(),
message: None,
latency_ms: Some(start.elapsed().as_millis() as u64),
},
Ok(Ok(resp)) => CheckResult {
status: "degraded".to_string(),
message: Some(format!("Backend returned {}", resp.status())),
latency_ms: Some(start.elapsed().as_millis() as u64),
},
Ok(Err(e)) | Err(_) => CheckResult {
status: "unhealthy".to_string(),
message: Some(format!("Backend unreachable: {}", e)),
latency_ms: None,
},
}
}
async fn check_cache(cache: &Arc<ChatCache>) -> CheckResult {
let test_key = "__health_check__";
let test_value = "ok";
cache.set_chat(test_key, test_value);
let retrieved = cache.get_chat(test_key);
if retrieved.as_deref() == Some(test_value) {
CheckResult {
status: "healthy".to_string(),
message: None,
latency_ms: Some(1),
}
} else {
CheckResult {
status: "unhealthy".to_string(),
message: Some("Cache read/write failed".to_string()),
latency_ms: None,
}
}
}2. Add separate liveness and readiness probes
GET /health/live- Liveness (is the process running?)GET /health/ready- Readiness (can it serve traffic?)
Configuration
Add environment variable to configure health check timeout:
HEALTH_CHECK_TIMEOUT_SECS(default: 2)
File Location
src/routes/health.rs
References
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P2: MediumMedium priority - fix when possibleMedium priority - fix when possibleenhancementNew feature or requestNew feature or request