Skip to content

Add health check endpoint#395

Draft
Piotr Poniedziałek (pondzix) wants to merge 1 commit intomasterfrom
spike/health_checking
Draft

Add health check endpoint#395
Piotr Poniedziałek (pondzix) wants to merge 1 commit intomasterfrom
spike/health_checking

Conversation

@pondzix
Copy link
Contributor

@pondzix Piotr Poniedziałek (pondzix) commented Jan 29, 2025

PDP-1557

For now the only source of 'health' is a source, e.g. in this draft for kinesis:

  • We keep track of events currently being processed downstream. If there is any message in memory that hasn't been acked for a while (exceeding certain configured threshold) == unhealthy.
  • We check if underlying kinsumer client attempts to fetch records from Kinesis. Even we have no records on input - it's fine, we just have to know if kinsumer is not stuck for some unknown reason. If there is no fetch coming from kinsumer in a while (exceeding certain configured threshold) == unhealthy.

Health is exposed through /health HTTP endpoint.

For now the only source of 'health' is a source, e.g. in this draft for kinesis:

* We keep track of events currently being processed downstream. If there is any message in memory that hasn't been acked for a while (exceeding certain configured threshold) == unhealthy.
* We check if underlying kinsumer client attempts to fetch records from Kinesis. Even we have no records on input - it's fine, we just have to know if kinsumer is not stuck for some unknown reason. If there is no fetch coming from kinsumer in a while (exceeding certain configured threshold) == unhealthy.

Health is exposed through `/health` HTTP endpoint.
Copy link

@istreeter Ian Streeter (istreeter) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your implementation, two different things contribute to bad health:

  1. Problems receiving events from the external source
  2. Events stuck in the app, i.e. a processing problem or sink problem.

For 1, it completely makes sense to implement it in the source.

But for 2... isn't this a helpful health check for all sources? You could implement 2 exactly the same in the pubsub source and kafka source. But then you would be repeating the same code in all sources.

In other words, is there any part of this that can be moved out of the source?

I appreciate this is a difficult problem! Because I have struggled with these same questions in common-streams.

os.Exit(1)
}

func runHealthServer(source sourceiface.Source) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharing this just for interest:

In common-streams the health probe actually responds to all requests with a ok. Not just requests to the /health endpoint.

That might be a mistake in common-streams: time will tell! I did it because it felt strange to hard-code the completely arbitrary string /health.


log *log.Entry
statsReceiver *kinsumerActivityRecorder
unackedMsgs map[string]int64

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider making this map[uuid.UUID]int64? It looks like the keys are always stringified UUIDs.

Comment on lines 242 to +243
checkpointer()
ks.removeUnacked(randomUUID)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How deeply to you understand what checkpointer() does? Does it block until this record is actually checkpointed to dynamodb? Or does it return immediately, so kinsumer can checkpoint it later?

From conversations I've had with others, I think it might be the latter.

If it fails to checkpoint later, then what does kinsumer do next? Does it stop calling the EventsFromKinesis function?

Does any of this matter? I think possibly no.... because I think your health checkpoint endpoint works correctly anyway. But it's worth thinking about.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand it quite well but did have to refresh my memory!

You are correct, it's the latter. When checkpointer() is called, kinsumer keeps record of it but only commits the checkpoint if every sequence number up to that one has been checkpointed.

So for example if you checkpoint a record with sequence number 5, it'll wait for sequence number 1-4 before it commits to DDB. This can cause blocking something like how you describe, but that's a slightly different case - if Snowbridge never calls checkpointer() on sequence number 4, then kinsumer will block.

There is a separate process to commit the checkpoint to DDB on regular intervals. If that goes wrong, kinsumer will return an error. This will in turn cause Snowbridge to encounter a source error and the app will crash. On reboot it will re-start and the new instance of kinsumer will start at the last sequence number that did get committed to DDB.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the app will crash

Imagine it didn't crash. For some unexpected reason. Bearing in mind... the whole reason you are working on this feature is to allow for unexpected scenarios.

What I'm getting at is.... would the health endpoint become unhealthy in that scenario?

@colmsnowplow colmsnowplow force-pushed the develop branch 2 times, most recently from 62edbc1 to f7dccba Compare February 14, 2025 14:22
Base automatically changed from develop to master March 5, 2025 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants