-
Notifications
You must be signed in to change notification settings - Fork 46
Description
Summary
The current health check endpoint for the Snowplow Enrich image determines service health based on whether events have been processed within a configured time window. This creates false negatives in non-production environments where event traffic is sparse or non-existent, causing healthy services to appear unhealthy.
Current Behavior
The health check endpoint monitors the timestamp of the last processed event and marks the service as unhealthy if no events have been processed within the configured time threshold. This approach has the following issues:
- False negatives in low-traffic environments: Non-production environments (dev, staging, QA) often have little to no event traffic, causing healthy services to fail health checks
- Orchestration problems: Kubernetes/container orchestrators may unnecessarily restart healthy pods that aren't processing events simply due to lack of traffic
- Difficult testing: Makes it challenging to validate deployment health without generating synthetic event traffic
- Masks real issues: When health checks fail due to lack of traffic, it becomes harder to identify actual service degradation
Proposed Solution
Replace or supplement the event-processing-based health check with checks that verify the service's readiness to process events rather than whether it has processed events recently. The health check should validate:
-
Message Broker Connectivity: Verify connection to the input source (Kafka, Kinesis, PubSub, NSQ)
- Can the service successfully poll/consume from the input topic/stream?
- Are consumer group assignments healthy?
- Is the connection authenticated and authorized?
-
Enrichment Dependencies: Verify that all configured enrichments can be initialized and are operational
- Database connections (if applicable)
- External API availability (for API-based enrichments)
- Local file resources (MaxMind databases, custom enrichment files)
-
Output Sink Connectivity: Verify connection to output destinations
- Can the service write to good/bad event topics/streams?
- Are credentials valid and permissions sufficient?
-
JVM Health: Basic application health indicators
- Service is running and accepting requests
- No critical errors in initialization
- Sufficient memory/resources available
Implementation Approach
Option 1 (Recommended): Dual Health Check Strategy
/health/liveness- Basic JVM health (is the process running?)/health/readiness- Comprehensive connectivity and dependency checks (can it process events?)/health/processing- Current behavior (is it actively processing events?) - useful for monitoring but not for container orchestration
Option 2: Configurable Health Check Mode
- Add configuration to switch between "connectivity-based" and "processing-based" health checks
- Default to connectivity-based for better operator experience
Benefits
- ✅ Health checks pass in low-traffic environments when service is genuinely healthy
- ✅ Container orchestrators can make better decisions about pod lifecycle
- ✅ Easier to validate deployments without synthetic traffic
- ✅ More accurate representation of service health
- ✅ Better separation of concerns (health vs. monitoring metrics)
Additional Context
The current processing-based metric is still valuable for monitoring and alerting purposes but should not be the primary determinant of service health for container orchestration. This metric should be exposed via metrics endpoints (Prometheus, etc.) rather than the health check endpoint.