feat(kafka): Implement basic async kafka consumer #17

john-z-yang · 2024-11-08T00:55:29Z

Overview

Implements a basic asynchronous kafa consumer

TODO

Test rebalancing correctness. After arbitrary number of shutdowns/rebalances, the messages must either in kafka or in the db, never both, and only once. This requires us to do some kind of validation on all the *.sqlite files after the consumer has shutdown. It would be really great if we can automate this into part of our CI.
DLQ messages by producing the kafka instead of printing them to stderr
Do some performance tuning on insert throughput

Local testing

Increase the number of partitions on the hackweek to 16 to observe more interesting rebalancing behaviour (defaults to 1):

docker exec -it sentry_kafka bash

kafka-topics --bootstrap-server localhost:9092 --alter --topic hackweek  --partitions 16

Start a django shell (make sure to use the hackweek demo branch)

sentry django shell

Send some messages to kafka

for i in range(1_000_000):
    say_hello.delay('hello world')

Start the consumer on multiple terminals

cargo run

Should see some output about inserting into the store

2024-11-08T05:42:38.680512Z  INFO reduce:flush_reducer: taskworker::consumer::inflight_activation_writer: Inserted 2048 entries
2024-11-08T05:42:38.680657Z DEBUG commit: taskworker::consumer::kafka: Storing offsets
2024-11-08T05:42:39.686230Z  INFO reduce:flush_reducer: taskworker::consumer::inflight_activation_writer: Inserted 2048 entries
2024-11-08T05:42:39.686397Z DEBUG commit: taskworker::consumer::kafka: Storing offsets
2024-11-08T05:42:40.685072Z  INFO reduce:flush_reducer: taskworker::consumer::inflight_activation_writer: Inserted 2048 entries
2024-11-08T05:42:40.685220Z DEBUG commit: taskworker::consumer::kafka: Storing offsets
2024-11-08T05:42:41.671410Z  INFO reduce:flush_reducer: taskworker::consumer::inflight_activation_writer: Inserted 2048 entries
2024-11-08T05:42:41.671492Z DEBUG commit: taskworker::consumer::kafka: Storing offsets

Use this sqlite query to check for exactly once delivery, delta=0 for a partition means that the (minimum offset - maximum offset + 1) is equal to the number of messages.

SELECT
    partition,
    (max(offset) - min(offset)) + 1 AS offset_diff,
    count(offset) AS occ,
    (max(offset) - min(offset)) + 1 - count(offset) AS delta
FROM (
    SELECT DISTINCT * FROM inflight_taskactivations
)
GROUP BY partition
ORDER BY partition;

john-z-yang · 2024-11-08T01:03:35Z

src/consumer/kafka.rs

+macro_rules! processing_strategy {
+    (
+        @reducers,
+        ($reduce:expr),
+        $prev_receiver:ident,
+        $err_sender:ident,
+        $shutdown_signal:ident,
+        $handles:ident,
+    ) => {{


I don't love this macro, but I can't figure out for the life of me, how create a function A that accept another function B, where function B returns a impl trait without using dyn.

I don't have enough experience with rust + rdkafka consumers to give useful detailed feedback yet. Building a custom macro strikes me as complex solution though.

markstory · 2024-11-08T16:47:08Z

src/consumer/deserialize_activation.rs

+        partition: msg.partition(),
+        offset: msg.offset(),
+        added_at: Utc::now(),
+        deadletter_at: None,


We'll eventually need to read this from configuration, but we can add that in later.

Okay, I made some changes to the typing, so now we can pass in configs into the parsing function

let deadletter_duration = ...; processing_strategy!({ map: deserialize_activation::new(DeserializerConfig { deadletter_duration, }), reduce: InflightTaskWriter::new( store.clone(), InflightTaskWriterConfig { max_buf_len: 2048, flush_interval: Some(Duration::from_secs(1)), when_full_behaviour: ReducerWhenFullBehaviour::Backpressure, shutdown_behaviour: ReduceShutdownBehaviour::Flush, } ), err: OsStreamWriter::new( Duration::from_secs(1), OsStream::StdErr, ), }),

By the way, how come the deadletter duration is configured in the config from the consumer, instead of being defined in python, and passed along in kafka?

src/consumer/inflight_activation_writer.rs

markstory · 2024-11-08T16:51:01Z

src/consumer/inflight_activation_writer.rs

+    fn is_full(&self) -> bool {
+        self.buffer.len() >= self.max_buf_len


Couldn't this change based on how full the inflight store is? I think we'd want the reduce batch to only take as many messages as we could store in sqlite.

Yes, I agree. I think this behaviour sort of depends on what kind of insert behaviour works best for sqlite. This was originally built with clickhouse in mind, where it cannot accept insert faster than once per second, and there are no constraints for the number of rows already in the database. So this reduce step batches messages up to a predetermined amount and write to it.

I see 2 options here:

We batch to a predetermined amount. Each time we try to write (in the flush function), we check if the db is full, if it's not we write, otherwise we sleep (which emits backpressure).

We batch to an dynamic amount. In this is_full function, we query the db to calculate how many rows we can accept, but I feel like this makes the code more unpredictable and we may need to cache the row count. Because the is_full call is on a very hot path (called when each message from kafka is sent to the reducer)

SQLite should be able to take row-by-row writes. We're likely going to have small sqlite stores (probably less 1000 pending tasks at any given time). Batching pre-determined amounts into the batch buffer and then checking for db fullness on write seems like a good approach. That would let us build limit how frequently the 'is-full' query is run and if there is no room in the DB we can sleep/backpressure.

I've also been thinking about the partial batch fit problem we talked about last week. If we've built up a batch of messages in memory and when we get to flushing to sqlite, if the batch doesn't fully fit, I'm thinking we should proceed with the insert and overfill sqlite vs give up on the batch and have to reprocess messages again. If after flushing to sqlite, the db is at capacity we could have the consumer sleep before consuming more messages.

Nice! Okay, so the insert config now looks like this

reduce: InflightTaskWriter::new( store.clone(), InflightTaskWriterConfig { max_buf_len: 128, max_pending_tasks: 2048, flush_interval: None, when_full_behaviour: ReducerWhenFullBehaviour::Flush, shutdown_behaviour: ReduceShutdownBehaviour::Drop, } ),

It has an internal buffer of max_buf_len, and targets max_pending_tasks number of pending tasks in the storage.

The is_full function is re-defined as follow:

async fn is_full(&self) -> bool { self.buffer.len() >= self.config.max_buf_len || self .store .count_pending_activations() .await .expect("Error communicating with activation store") + self.buffer.len() >= self.config.max_pending_tasks }

It accept messages only if is_full return false, and only flushes (writes to db) when is_full returns true

I added this logic as a separate commit, because there's a chance that I got something wrong somewhere

markstory · 2024-11-08T17:20:36Z

src/consumer/kafka.rs

+macro_rules! processing_strategy {
+    (
+        @reducers,
+        ($reduce:expr),
+        $prev_receiver:ident,
+        $err_sender:ident,
+        $shutdown_signal:ident,
+        $handles:ident,
+    ) => {{


I don't have enough experience with rust + rdkafka consumers to give useful detailed feedback yet. Building a custom macro strikes me as complex solution though.

evanh

I'm going to rubber stamp this but I can't say I really understand a lot of what is happening here.

src/consumer/kafka.rs

Co-authored-by: Evan Hicks <[email protected]>

markstory · 2024-11-12T18:25:14Z

src/consumer/inflight_activation_writer.rs

+pub struct Config {
+    pub max_buf_len: usize,
+    pub max_pending_tasks: usize,
+    pub flush_interval: Option<Duration>,
+    pub when_full_behaviour: ReducerWhenFullBehaviour,
+    pub shutdown_behaviour: ReduceShutdownBehaviour,
+}


Is your plan that we can generate these configuration structs from the application configuration?

Yes, it's being used like so right now in main.rs

InflightActivationWriter::new( store.clone(), inflight_activation_writer::Config { max_buf_len: 128, max_pending_activations: 2048, flush_interval: None, when_full_behaviour: ReducerWhenFullBehaviour::Flush, shutdown_behaviour: ReduceShutdownBehaviour::Drop, }

In the future, we can add a cli argument to supply max_buf_len and max_pending_tasks, so we can do

InflightActivationWriter::new( store.clone(), inflight_activation_writer::Config { max_buf_len: config.max_batch_size, max_pending_activations: config.max_pending_activations, flush_interval: None, when_full_behaviour: ReducerWhenFullBehaviour::Flush, shutdown_behaviour: ReduceShutdownBehaviour::Drop, }

markstory · 2024-11-12T18:28:34Z

src/inflight_activation_store.rs

        let mut query_builder = QueryBuilder::<Sqlite>::new(
            "INSERT INTO inflight_taskactivations \
-            (id, activation, offset, added_at, deadletter_at, processing_deadline_duration, status)",
+            (id, activation, partition, offset, added_at, deadletter_at, processing_deadline_duration, status)",


Why do we need to store the partition? If a consumer is rebalanced, don't we still need to deliver all the tasks that were added to sqlite even if they came from historical partitions?

Yeah, I just thought if we're storing offset we should store the partition as well, because with just an offset, we can't do much without a partition (since an offset can belong to any number of partitions). I think this is useful in the future if we want to commit when the tasks are completely.

But I don't feel really strongly either way.

markstory

Lets move forward with this. We can adapt and evolve it as necessary.

john-z-yang requested a review from markstory November 8, 2024 00:59

john-z-yang commented Nov 8, 2024

View reviewed changes

john-z-yang force-pushed the kafka-consumer branch 9 times, most recently from 51bc418 to 7327f42 Compare November 8, 2024 06:31

markstory reviewed Nov 8, 2024

View reviewed changes

john-z-yang force-pushed the kafka-consumer branch 5 times, most recently from 9ba3600 to a9dc8ca Compare November 8, 2024 22:24

john-z-yang marked this pull request as ready for review November 8, 2024 23:56

john-z-yang requested a review from a team November 8, 2024 23:57

john-z-yang force-pushed the kafka-consumer branch from 685d84c to 70b900a Compare November 11, 2024 08:11

john-z-yang added 6 commits November 12, 2024 10:23

feat(kafka): Implement basic async kafka consumer

91df04c

only reduce messages when inflight activation store has space

b25ceec

rename symbols

722dacb

fix map actor shutdown logging

428f76e

add a timer in reduce loop so reducer is_full can be re-checked

0bdef74

spawn a tokio task that removes data from activation store

80bb9de

john-z-yang force-pushed the kafka-consumer branch from 980eeab to 80bb9de Compare November 12, 2024 19:58

evanh approved these changes Nov 12, 2024

View reviewed changes

src/consumer/kafka.rs Outdated Show resolved Hide resolved

Fix typo in logs

9d45eb7

Co-authored-by: Evan Hicks <[email protected]>

markstory mentioned this pull request Nov 12, 2024

Build a simple topic consumer in rust #2

Closed

markstory reviewed Nov 13, 2024

View reviewed changes

john-z-yang added 2 commits November 13, 2024 10:42

Merge branch 'main' into kafka-consumer

e6ef8c9

rename max_pending_tasks -> max_pending_activations

cf990d6

markstory approved these changes Nov 13, 2024

View reviewed changes

john-z-yang merged commit 9019744 into main Nov 13, 2024
3 checks passed

john-z-yang deleted the kafka-consumer branch November 13, 2024 19:09

john-z-yang linked an issue Nov 13, 2024 that may be closed by this pull request

Create a consumer that works locally #26

Closed

john-z-yang mentioned this pull request Nov 18, 2024

fix(consumer): Fix deadlock when consumer event handling and kafka client polling is scheduled onto the same POSIX thread #32

Merged

		fn is_full(&self) -> bool {
		self.buffer.len() >= self.max_buf_len

Uh oh!

feat(kafka): Implement basic async kafka consumer #17

feat(kafka): Implement basic async kafka consumer #17

Uh oh!

Conversation

john-z-yang commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

TODO

Local testing

Uh oh!

john-z-yang Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

john-z-yang Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

john-z-yang Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

john-z-yang Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

evanh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

john-z-yang Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markstory left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

john-z-yang commented Nov 8, 2024 •

edited

Loading

john-z-yang Nov 8, 2024 •

edited

Loading

john-z-yang Nov 8, 2024 •

edited

Loading

john-z-yang Nov 8, 2024 •

edited

Loading

john-z-yang Nov 8, 2024 •

edited

Loading

john-z-yang Nov 13, 2024 •

edited

Loading