Add health check endpoint by pondzix · Pull Request #395 · snowplow/snowbridge

Piotr Poniedziałek (pondzix) · 2025-01-29T07:07:31Z

PDP-1557

For now the only source of 'health' is a source, e.g. in this draft for kinesis:

We keep track of events currently being processed downstream. If there is any message in memory that hasn't been acked for a while (exceeding certain configured threshold) == unhealthy.
We check if underlying kinsumer client attempts to fetch records from Kinesis. Even we have no records on input - it's fine, we just have to know if kinsumer is not stuck for some unknown reason. If there is no fetch coming from kinsumer in a while (exceeding certain configured threshold) == unhealthy.

Health is exposed through /health HTTP endpoint.

For now the only source of 'health' is a source, e.g. in this draft for kinesis: * We keep track of events currently being processed downstream. If there is any message in memory that hasn't been acked for a while (exceeding certain configured threshold) == unhealthy. * We check if underlying kinsumer client attempts to fetch records from Kinesis. Even we have no records on input - it's fine, we just have to know if kinsumer is not stuck for some unknown reason. If there is no fetch coming from kinsumer in a while (exceeding certain configured threshold) == unhealthy. Health is exposed through `/health` HTTP endpoint.

Ian Streeter (istreeter)

In your implementation, two different things contribute to bad health:

Problems receiving events from the external source
Events stuck in the app, i.e. a processing problem or sink problem.

For 1, it completely makes sense to implement it in the source.

But for 2... isn't this a helpful health check for all sources? You could implement 2 exactly the same in the pubsub source and kafka source. But then you would be repeating the same code in all sources.

In other words, is there any part of this that can be moved out of the source?

I appreciate this is a difficult problem! Because I have struggled with these same questions in common-streams.

Ian Streeter (istreeter) · 2025-02-03T09:21:01Z

cmd/cli/cli.go

 	os.Exit(1)
 }
+
+func runHealthServer(source sourceiface.Source) {


Sharing this just for interest:

In common-streams the health probe actually responds to all requests with a ok. Not just requests to the /health endpoint.

That might be a mistake in common-streams: time will tell! I did it because it felt strange to hard-code the completely arbitrary string /health.

Ian Streeter (istreeter) · 2025-02-03T09:41:48Z

pkg/source/kinesis/kinesis_source.go

-
-	log *log.Entry
+	statsReceiver    *kinsumerActivityRecorder
+	unackedMsgs      map[string]int64


Did you consider making this map[uuid.UUID]int64? It looks like the keys are always stringified UUIDs.

Ian Streeter (istreeter) · 2025-02-03T10:07:32Z

pkg/source/kinesis/kinesis_source.go

 			checkpointer()
+			ks.removeUnacked(randomUUID)


How deeply to you understand what checkpointer() does? Does it block until this record is actually checkpointed to dynamodb? Or does it return immediately, so kinsumer can checkpoint it later?

From conversations I've had with others, I think it might be the latter.

If it fails to checkpoint later, then what does kinsumer do next? Does it stop calling the EventsFromKinesis function?

Does any of this matter? I think possibly no.... because I think your health checkpoint endpoint works correctly anyway. But it's worth thinking about.

I understand it quite well but did have to refresh my memory!

You are correct, it's the latter. When checkpointer() is called, kinsumer keeps record of it but only commits the checkpoint if every sequence number up to that one has been checkpointed.

So for example if you checkpoint a record with sequence number 5, it'll wait for sequence number 1-4 before it commits to DDB. This can cause blocking something like how you describe, but that's a slightly different case - if Snowbridge never calls checkpointer() on sequence number 4, then kinsumer will block.

There is a separate process to commit the checkpoint to DDB on regular intervals. If that goes wrong, kinsumer will return an error. This will in turn cause Snowbridge to encounter a source error and the app will crash. On reboot it will re-start and the new instance of kinsumer will start at the last sequence number that did get committed to DDB.

and the app will crash

Imagine it didn't crash. For some unexpected reason. Bearing in mind... the whole reason you are working on this feature is to allow for unexpected scenarios.

What I'm getting at is.... would the health endpoint become unhealthy in that scenario?

Piotr Poniedziałek (pondzix) force-pushed the spike/health_checking branch from 5108ac9 to 77c8e17 Compare January 29, 2025 07:09

Piotr Poniedziałek (pondzix) requested a review from colmsnowplow January 29, 2025 17:12

Ian Streeter (istreeter) reviewed Feb 3, 2025

View reviewed changes

colmsnowplow force-pushed the develop branch 2 times, most recently from 62edbc1 to f7dccba Compare February 14, 2025 14:22

Piotr Poniedziałek (pondzix) force-pushed the develop branch from f7dccba to 767d4d9 Compare February 17, 2025 12:01

colmsnowplow force-pushed the develop branch from 008aba2 to e9ee584 Compare March 5, 2025 10:19

Base automatically changed from develop to master March 5, 2025 11:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add health check endpoint#395

Add health check endpoint#395
Piotr Poniedziałek (pondzix) wants to merge 1 commit intomasterfrom
spike/health_checking

Piotr Poniedziałek (pondzix) commented Jan 29, 2025 •

edited

Loading

Uh oh!

Ian Streeter (istreeter) left a comment

Uh oh!

Ian Streeter (istreeter) Feb 3, 2025

Uh oh!

Ian Streeter (istreeter) Feb 3, 2025

Uh oh!

Ian Streeter (istreeter) Feb 3, 2025

Uh oh!

colmsnowplow Feb 10, 2025

Uh oh!

Ian Streeter (istreeter) Feb 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Piotr Poniedziałek (pondzix) commented Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ian Streeter (istreeter) left a comment

Choose a reason for hiding this comment

Uh oh!

Ian Streeter (istreeter) Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

Ian Streeter (istreeter) Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

Ian Streeter (istreeter) Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

colmsnowplow Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

Ian Streeter (istreeter) Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Piotr Poniedziałek (pondzix) commented Jan 29, 2025 •

edited

Loading