Skip to content

Health check endpoint too simplistic - doesn't verify critical dependencies #9

@Evrard-Nil

Description

@Evrard-Nil

Problem

The health check endpoint (likely GET / and GET /version) only returns a static response without verifying that critical components are actually functional:

  • Backend connectivity not checked
  • Signing keys not verified
  • Cache not tested
  • dstack KMS not validated

This means the health check can return "healthy" even when the service cannot process requests.

Impact

High - Load balancers and orchestrators (Kubernetes, etc.) rely on health checks. A false positive can route traffic to a broken instance.

Solution

1. Add comprehensive health endpoint

// src/routes/health.rs

#[derive(Serialize)]
struct HealthStatus {
    status: String,
    version: String,
    checks: HealthChecks,
}

#[derive(Serialize)]
struct HealthChecks {
    signing: CheckResult,
    backend: CheckResult,
    cache: CheckResult,
}

#[derive(Serialize)]
struct CheckResult {
    status: String,
    message: Option<String>,
    latency_ms: Option<u64>,
}

pub async fn health_check(State(state): State<AppState>) -> impl IntoResponse {
    let mut checks = HealthChecks {
        signing: check_signing(&state.signing).await,
        backend: check_backend(&state).await,
        cache: check_cache(&state.cache).await,
    };
    
    let all_healthy = checks.signing.status == "healthy"
        && checks.backend.status == "healthy"
        && checks.cache.status == "healthy";
    
    let status_code = if all_healthy {
        StatusCode::OK
    } else {
        StatusCode::SERVICE_UNAVAILABLE
    };
    
    (status_code, Json(HealthStatus {
        status: if all_healthy { "healthy" } else { "unhealthy" },
        version: state.config.git_rev.clone(),
        checks,
    }))
}

async fn check_signing(signing: &Arc<SigningPair>) -> CheckResult {
    let start = Instant::now();
    match signing.sign_chat("health-check") {
        Ok(_) => CheckResult {
            status: "healthy".to_string(),
            message: None,
            latency_ms: Some(start.elapsed().as_millis() as u64),
        },
        Err(e) => CheckResult {
            status: "unhealthy".to_string(),
            message: Some(e.to_string()),
            latency_ms: None,
        },
    }
}

async fn check_backend(state: &AppState) -> CheckResult {
    let start = Instant::now();
    let timeout = Duration::from_secs(2);
    
    match tokio::time::timeout(timeout, state.http_client.get(&state.config.models_url).send()).await {
        Ok(Ok(resp)) if resp.status().is_success() => CheckResult {
            status: "healthy".to_string(),
            message: None,
            latency_ms: Some(start.elapsed().as_millis() as u64),
        },
        Ok(Ok(resp)) => CheckResult {
            status: "degraded".to_string(),
            message: Some(format!("Backend returned {}", resp.status())),
            latency_ms: Some(start.elapsed().as_millis() as u64),
        },
        Ok(Err(e)) | Err(_) => CheckResult {
            status: "unhealthy".to_string(),
            message: Some(format!("Backend unreachable: {}", e)),
            latency_ms: None,
        },
    }
}

async fn check_cache(cache: &Arc<ChatCache>) -> CheckResult {
    let test_key = "__health_check__";
    let test_value = "ok";
    
    cache.set_chat(test_key, test_value);
    let retrieved = cache.get_chat(test_key);
    
    if retrieved.as_deref() == Some(test_value) {
        CheckResult {
            status: "healthy".to_string(),
            message: None,
            latency_ms: Some(1),
        }
    } else {
        CheckResult {
            status: "unhealthy".to_string(),
            message: Some("Cache read/write failed".to_string()),
            latency_ms: None,
        }
    }
}

2. Add separate liveness and readiness probes

  • GET /health/live - Liveness (is the process running?)
  • GET /health/ready - Readiness (can it serve traffic?)

Configuration

Add environment variable to configure health check timeout:

  • HEALTH_CHECK_TIMEOUT_SECS (default: 2)

File Location

src/routes/health.rs

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2: MediumMedium priority - fix when possibleenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions