Add health check endpoint#395
Conversation
For now the only source of 'health' is a source, e.g. in this draft for kinesis: * We keep track of events currently being processed downstream. If there is any message in memory that hasn't been acked for a while (exceeding certain configured threshold) == unhealthy. * We check if underlying kinsumer client attempts to fetch records from Kinesis. Even we have no records on input - it's fine, we just have to know if kinsumer is not stuck for some unknown reason. If there is no fetch coming from kinsumer in a while (exceeding certain configured threshold) == unhealthy. Health is exposed through `/health` HTTP endpoint.
5108ac9 to
77c8e17
Compare
Ian Streeter (istreeter)
left a comment
There was a problem hiding this comment.
In your implementation, two different things contribute to bad health:
- Problems receiving events from the external source
- Events stuck in the app, i.e. a processing problem or sink problem.
For 1, it completely makes sense to implement it in the source.
But for 2... isn't this a helpful health check for all sources? You could implement 2 exactly the same in the pubsub source and kafka source. But then you would be repeating the same code in all sources.
In other words, is there any part of this that can be moved out of the source?
I appreciate this is a difficult problem! Because I have struggled with these same questions in common-streams.
| os.Exit(1) | ||
| } | ||
|
|
||
| func runHealthServer(source sourceiface.Source) { |
There was a problem hiding this comment.
Sharing this just for interest:
In common-streams the health probe actually responds to all requests with a ok. Not just requests to the /health endpoint.
That might be a mistake in common-streams: time will tell! I did it because it felt strange to hard-code the completely arbitrary string /health.
|
|
||
| log *log.Entry | ||
| statsReceiver *kinsumerActivityRecorder | ||
| unackedMsgs map[string]int64 |
There was a problem hiding this comment.
Did you consider making this map[uuid.UUID]int64? It looks like the keys are always stringified UUIDs.
| checkpointer() | ||
| ks.removeUnacked(randomUUID) |
There was a problem hiding this comment.
How deeply to you understand what checkpointer() does? Does it block until this record is actually checkpointed to dynamodb? Or does it return immediately, so kinsumer can checkpoint it later?
From conversations I've had with others, I think it might be the latter.
If it fails to checkpoint later, then what does kinsumer do next? Does it stop calling the EventsFromKinesis function?
Does any of this matter? I think possibly no.... because I think your health checkpoint endpoint works correctly anyway. But it's worth thinking about.
There was a problem hiding this comment.
I understand it quite well but did have to refresh my memory!
You are correct, it's the latter. When checkpointer() is called, kinsumer keeps record of it but only commits the checkpoint if every sequence number up to that one has been checkpointed.
So for example if you checkpoint a record with sequence number 5, it'll wait for sequence number 1-4 before it commits to DDB. This can cause blocking something like how you describe, but that's a slightly different case - if Snowbridge never calls checkpointer() on sequence number 4, then kinsumer will block.
There is a separate process to commit the checkpoint to DDB on regular intervals. If that goes wrong, kinsumer will return an error. This will in turn cause Snowbridge to encounter a source error and the app will crash. On reboot it will re-start and the new instance of kinsumer will start at the last sequence number that did get committed to DDB.
There was a problem hiding this comment.
and the app will crash
Imagine it didn't crash. For some unexpected reason. Bearing in mind... the whole reason you are working on this feature is to allow for unexpected scenarios.
What I'm getting at is.... would the health endpoint become unhealthy in that scenario?
62edbc1 to
f7dccba
Compare
f7dccba to
767d4d9
Compare
PDP-1557
For now the only source of 'health' is a source, e.g. in this draft for kinesis:
Health is exposed through
/healthHTTP endpoint.