Skip to content

Health Check Endpoint Should Not Depend on Event Processing Activity #918

@rob-ellison-jet

Description

@rob-ellison-jet

Summary

The current health check endpoint for the Snowplow Enrich image determines service health based on whether events have been processed within a configured time window. This creates false negatives in non-production environments where event traffic is sparse or non-existent, causing healthy services to appear unhealthy.

Current Behavior

The health check endpoint monitors the timestamp of the last processed event and marks the service as unhealthy if no events have been processed within the configured time threshold. This approach has the following issues:

  • False negatives in low-traffic environments: Non-production environments (dev, staging, QA) often have little to no event traffic, causing healthy services to fail health checks
  • Orchestration problems: Kubernetes/container orchestrators may unnecessarily restart healthy pods that aren't processing events simply due to lack of traffic
  • Difficult testing: Makes it challenging to validate deployment health without generating synthetic event traffic
  • Masks real issues: When health checks fail due to lack of traffic, it becomes harder to identify actual service degradation

Proposed Solution

Replace or supplement the event-processing-based health check with checks that verify the service's readiness to process events rather than whether it has processed events recently. The health check should validate:

  1. Message Broker Connectivity: Verify connection to the input source (Kafka, Kinesis, PubSub, NSQ)

    • Can the service successfully poll/consume from the input topic/stream?
    • Are consumer group assignments healthy?
    • Is the connection authenticated and authorized?
  2. Enrichment Dependencies: Verify that all configured enrichments can be initialized and are operational

    • Database connections (if applicable)
    • External API availability (for API-based enrichments)
    • Local file resources (MaxMind databases, custom enrichment files)
  3. Output Sink Connectivity: Verify connection to output destinations

    • Can the service write to good/bad event topics/streams?
    • Are credentials valid and permissions sufficient?
  4. JVM Health: Basic application health indicators

    • Service is running and accepting requests
    • No critical errors in initialization
    • Sufficient memory/resources available

Implementation Approach

Option 1 (Recommended): Dual Health Check Strategy

  • /health/liveness - Basic JVM health (is the process running?)
  • /health/readiness - Comprehensive connectivity and dependency checks (can it process events?)
  • /health/processing - Current behavior (is it actively processing events?) - useful for monitoring but not for container orchestration

Option 2: Configurable Health Check Mode

  • Add configuration to switch between "connectivity-based" and "processing-based" health checks
  • Default to connectivity-based for better operator experience

Benefits

  • ✅ Health checks pass in low-traffic environments when service is genuinely healthy
  • ✅ Container orchestrators can make better decisions about pod lifecycle
  • ✅ Easier to validate deployments without synthetic traffic
  • ✅ More accurate representation of service health
  • ✅ Better separation of concerns (health vs. monitoring metrics)

Additional Context

The current processing-based metric is still valuable for monitoring and alerting purposes but should not be the primary determinant of service health for container orchestration. This metric should be exposed via metrics endpoints (Prometheus, etc.) rather than the health check endpoint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions