Skip to content

feat: add a backlog-queue to the snuba consumers#7856

Merged
kylemumma merged 14 commits intomasterfrom
krm/blq-router
Apr 8, 2026
Merged

feat: add a backlog-queue to the snuba consumers#7856
kylemumma merged 14 commits intomasterfrom
krm/blq-router

Conversation

@kylemumma
Copy link
Copy Markdown
Member

@kylemumma kylemumma commented Apr 1, 2026

https://www.notion.so/sentry/project-snuba-backlog-queue-BLQ-3148b10e4b5d809ba443ed6b79606bc6

this adds a BLQRouter step to snuba consumers, if a message is fresh it gets passed along as usual, if a message is stale it is routed onto the backlog queue. We are using the DLQ as the backlog queue. It is guarded behind a feature flag that is off by default.

todo: the following are todos I will ensure get done as follow ups before enabling

  • add alerting and observability so we know when things are written to the dlq topic
  • ensure we have a consumer set up to consume from the dlq topic

@kylemumma kylemumma marked this pull request as ready for review April 2, 2026 17:04
@kylemumma kylemumma requested a review from a team as a code owner April 2, 2026 17:04
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug target for the consumer in vscode, runs maturin to build rust as pre-step

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine this time but in the future let's split out vscode changes (if you have to revert the consumer change, you're going to have to revert this too)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this my new arroyo strategy for the blq, see factory_v2 for usage

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is where i steal the info about the dlq and use it for the blq. if there is a dlq it will be used for the blq.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I wire up our existing consumer pipeline to have the BLQ step at the beginning

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here is my feature flag, thanks @kenzoengineer

Copy link
Copy Markdown
Contributor

@onewland onewland Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even though in general I like controlling things via feature flag, does this actually make management of this any easier? if you still need to restart the process to affect the change (because the ProcessingStrategy is built on startup), that feels like maybe it should be a CLI argument

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: ConcurrencyConfig runtime dropped before strategy uses it
    • Moved BLQ concurrency to a long-lived blq_concurrency field on ConsumerStrategyFactoryV2 and used it when constructing Produce so the runtime outlives the strategy chain.
  • ✅ Fixed: Duplicated BLQ router setup across two methods
    • Extracted the repeated BLQ feature-flag/producer/router block into a shared wrap_with_blq_router helper used by both pipeline constructors.

Create PR

Or push these changes by commenting:

@cursor push 58d1d747f9
Preview (58d1d747f9)
diff --git a/rust_snuba/benches/processors.rs b/rust_snuba/benches/processors.rs
--- a/rust_snuba/benches/processors.rs
+++ b/rust_snuba/benches/processors.rs
@@ -103,6 +103,7 @@
         join_timeout_ms: None,
         health_check: "arroyo".to_string(),
         use_row_binary: false,
+        blq_concurrency: ConcurrencyConfig::with_runtime(concurrency, RUNTIME.handle().to_owned()),
         blq_producer_config: None,
         blq_topic: None,
     };

diff --git a/rust_snuba/src/consumer.rs b/rust_snuba/src/consumer.rs
--- a/rust_snuba/src/consumer.rs
+++ b/rust_snuba/src/consumer.rs
@@ -285,6 +285,7 @@
         join_timeout_ms,
         health_check: health_check.to_string(),
         use_row_binary,
+        blq_concurrency: ConcurrencyConfig::new(10),
         blq_producer_config: dlq_producer_config.clone(),
         blq_topic: dlq_topic,
     };

diff --git a/rust_snuba/src/factory_v2.rs b/rust_snuba/src/factory_v2.rs
--- a/rust_snuba/src/factory_v2.rs
+++ b/rust_snuba/src/factory_v2.rs
@@ -69,6 +69,7 @@
     pub join_timeout_ms: Option<u64>,
     pub health_check: String,
     pub use_row_binary: bool,
+    pub blq_concurrency: ConcurrencyConfig,
     pub blq_producer_config: Option<KafkaConfig>,
     pub blq_topic: Option<Topic>,
 }
@@ -267,43 +268,7 @@
             Some(Duration::from_millis(self.join_timeout_ms.unwrap_or(0))),
         );
 
-        let blq_enabled_flag = options("snuba")
-            .ok()
-            .and_then(|o| o.get("consumer.blq_enabled").ok())
-            .and_then(|v| v.as_bool())
-            .unwrap_or(false);
-        let next_step: Box<dyn ProcessingStrategy<KafkaPayload>> =
-            if let (true, Some(blq_producer_config), Some(blq_topic)) =
-                (blq_enabled_flag, &self.blq_producer_config, self.blq_topic)
-            {
-                let stale_threshold = TimeDelta::minutes(30);
-                let static_friction = TimeDelta::minutes(2);
-                tracing::info!(
-                "Routing all messages older than {:?} to the topic {:?} with static_friction {:?}",
-                stale_threshold,
-                self.blq_topic,
-                static_friction
-            );
-                let concurrency = ConcurrencyConfig::new(10);
-                let blq_producer = Produce::new(
-                    CommitOffsets::new(Duration::from_millis(250)),
-                    KafkaProducer::new(blq_producer_config.clone()),
-                    &concurrency,
-                    TopicOrPartition::Topic(blq_topic),
-                );
-                Box::new(
-                    BLQRouter::new(
-                        next_step,
-                        blq_producer,
-                        stale_threshold,
-                        Some(static_friction),
-                    )
-                    .expect("invalid BLQRouter config"),
-                )
-            } else {
-                tracing::info!("Not using a backlog-queue",);
-                Box::new(next_step)
-            };
+        let next_step = self.wrap_with_blq_router(Box::new(next_step));
 
         if let Some(path) = &self.health_check_file {
             {
@@ -324,6 +289,49 @@
 }
 
 impl ConsumerStrategyFactoryV2 {
+    fn wrap_with_blq_router(
+        &self,
+        next_step: Box<dyn ProcessingStrategy<KafkaPayload>>,
+    ) -> Box<dyn ProcessingStrategy<KafkaPayload>> {
+        let blq_enabled_flag = options("snuba")
+            .ok()
+            .and_then(|o| o.get("consumer.blq_enabled").ok())
+            .and_then(|v| v.as_bool())
+            .unwrap_or(false);
+
+        if let (true, Some(blq_producer_config), Some(blq_topic)) =
+            (blq_enabled_flag, &self.blq_producer_config, self.blq_topic)
+        {
+            let stale_threshold = TimeDelta::minutes(30);
+            let static_friction = TimeDelta::minutes(2);
+            tracing::info!(
+                "Routing all messages older than {:?} to the topic {:?} with static_friction {:?}",
+                stale_threshold,
+                self.blq_topic,
+                static_friction
+            );
+
+            let blq_producer = Produce::new(
+                CommitOffsets::new(Duration::from_millis(250)),
+                KafkaProducer::new(blq_producer_config.clone()),
+                &self.blq_concurrency,
+                TopicOrPartition::Topic(blq_topic),
+            );
+            Box::new(
+                BLQRouter::new(
+                    next_step,
+                    blq_producer,
+                    stale_threshold,
+                    Some(static_friction),
+                )
+                .expect("invalid BLQRouter config"),
+            )
+        } else {
+            tracing::info!("Not using a backlog-queue",);
+            next_step
+        }
+    }
+
     fn create_row_binary_pipeline<
         T: clickhouse::Row + serde::Serialize + Clone + Send + Sync + 'static,
     >(
@@ -420,43 +428,7 @@
             Some(Duration::from_millis(self.join_timeout_ms.unwrap_or(0))),
         );
 
-        let blq_enabled_flag = options("snuba")
-            .ok()
-            .and_then(|o| o.get("consumer.blq_enabled").ok())
-            .and_then(|v| v.as_bool())
-            .unwrap_or(false);
-        let next_step: Box<dyn ProcessingStrategy<KafkaPayload>> =
-            if let (true, Some(blq_producer_config), Some(blq_topic)) =
-                (blq_enabled_flag, &self.blq_producer_config, self.blq_topic)
-            {
-                let stale_threshold = TimeDelta::minutes(30);
-                let static_friction = TimeDelta::minutes(2);
-                tracing::info!(
-                "Routing all messages older than {:?} to the topic {:?} with static_friction {:?}",
-                stale_threshold,
-                self.blq_topic,
-                static_friction
-            );
-                let concurrency = ConcurrencyConfig::new(10);
-                let blq_producer = Produce::new(
-                    CommitOffsets::new(Duration::from_millis(250)),
-                    KafkaProducer::new(blq_producer_config.clone()),
-                    &concurrency,
-                    TopicOrPartition::Topic(blq_topic),
-                );
-                Box::new(
-                    BLQRouter::new(
-                        next_step,
-                        blq_producer,
-                        stale_threshold,
-                        Some(static_friction),
-                    )
-                    .expect("invalid BLQRouter config"),
-                )
-            } else {
-                tracing::info!("Not using a backlog-queue",);
-                Box::new(next_step)
-            };
+        let next_step = self.wrap_with_blq_router(Box::new(next_step));
 
         if let Some(path) = &self.health_check_file {
             if self.health_check == "snuba" {

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

@kylemumma kylemumma marked this pull request as draft April 2, 2026 17:59
Copy link
Copy Markdown
Member

@kenzoengineer kenzoengineer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. i'd like to see tests that validate behaviour when the option value is changed (using the thread local override guard)
  2. make sure that the codepath always includes init_with_schemas first

@kylemumma kylemumma marked this pull request as ready for review April 2, 2026 21:58
@kylemumma kylemumma requested a review from kenzoengineer April 2, 2026 22:31
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

@kylemumma kylemumma marked this pull request as draft April 6, 2026 16:45
@kylemumma kylemumma marked this pull request as ready for review April 6, 2026 22:31
Copy link
Copy Markdown
Member

@kenzoengineer kenzoengineer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving the sentry-options portion of this pr

Copy link
Copy Markdown
Contributor

@onewland onewland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code changes seem fine, but I would just ask: have you actually run snuba rust-consumer with this option enabled locally? because I feel like that often reveals some trivial detail that's broken (and requires you to revert) but isn't exercised by the tests

Copy link
Copy Markdown
Contributor

@onewland onewland Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even though in general I like controlling things via feature flag, does this actually make management of this any easier? if you still need to restart the process to affect the change (because the ProcessingStrategy is built on startup), that feels like maybe it should be a CLI argument

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine this time but in the future let's split out vscode changes (if you have to revert the consumer change, you're going to have to revert this too)

@kylemumma
Copy link
Copy Markdown
Member Author

yes i have tested the feature locally to verify. I agree about the feature flag, it would require a restart anyways so its not so helpful. I will merge as-is and follow up with a PR to change the feature flag to a cli option.

@kylemumma kylemumma merged commit dacc4f4 into master Apr 8, 2026
63 of 65 checks passed
@kylemumma kylemumma deleted the krm/blq-router branch April 8, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants