feat(cbrs): load based routing strategy #7519

xurui-c · 2025-11-09T23:35:44Z

Basically does what OutcomesBasedRoutingStrategy does except when cluster load is low enough, we let the query through even if allocation policies say no

Rollout plan:

only deploy this strategy for org 1 (Sentry) via Snuba Admin by setting storage_routing_config_override = {'{"1": {"version": 1, "config": {"LoadBasedRoutingStrategy": 1.0}}}',
deploy to 50% of our customers via setting default_storage_routing_config = '{"version": 1, "config": {"LoadBasedRoutingStrategy": 0.5, "OutcomesBasedRoutingStrategy": 0.5}}'

sentry · 2025-11-10T17:39:17Z

snuba/web/rpc/storage_routing/routing_strategies/load_based.py

+        if load_info.cluster_load < pass_through_threshold:
+            routing_decision.can_run = True
+            routing_decision.is_throttled = False
+            routing_decision.clickhouse_settings["max_threads"] = pass_through_max_threads
+            routing_decision.routing_context.extra_info["load_based_pass_through"] = {
+                "threshold": pass_through_threshold,
+                "max_threads": pass_through_max_threads,
+            }


Bug: New logic incorrectly allows queries to bypass allocation policies when cluster_load retrieval fails and returns -1.0.
_{Severity: CRITICAL | Confidence: 1.00}

🔍 Detailed Analysis

When get_cluster_loadinfo() fails to retrieve cluster load information, it returns a LoadInfo object with cluster_load set to -1.0. The new code checks if load_info is None but does not account for this specific sentinel value. As a result, the condition load_info.cluster_load < pass_through_threshold evaluates to True (e.g., -1.0 < 20), causing queries to bypass all allocation policies and run, despite the unavailability of actual load data. This occurs silently without explicit error handling in the new logic.

💡 Suggested Fix

Modify the LoadBasedRoutingStrategy to explicitly check for the -1.0 sentinel value in load_info.cluster_load when determining if load information is available, or ensure load_info is None on failure.

🤖 Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: snuba/web/rpc/storage_routing/routing_strategies/load_based.py#L46-L53 Potential issue: When `get_cluster_loadinfo()` fails to retrieve cluster load information, it returns a `LoadInfo` object with `cluster_load` set to `-1.0`. The new code checks `if load_info is None` but does not account for this specific sentinel value. As a result, the condition `load_info.cluster_load < pass_through_threshold` evaluates to `True` (e.g., `-1.0 < 20`), causing queries to bypass all allocation policies and run, despite the unavailability of actual load data. This occurs silently without explicit error handling in the new logic.

_{Did we get this right? 👍 / 👎 to inform future reviews.}

onewland · 2025-11-12T18:06:19Z

snuba/web/rpc/storage_routing/routing_strategies/load_based.py

+class LoadBasedRoutingStrategy(OutcomesBasedRoutingStrategy):
+    """
+    If cluster load is under a threshold, ignore recommendations and allow the query to pass through with the tier decided based on outcomes-based routing.
+    """


why is this inheriting from OutcomesBasedRoutingStrategy? shouldn't it inherit BaseRoutingStrategy?

I think it's mixing concerns weirdly to make the load-based routing in any way aware or coupled to outcomes-based routing. Some third entity/module should chain the two together if that's necessary.

Rachel Chen and others added 4 commits November 8, 2025 15:49

idk

1d2cd49

feat(cbrs): load based routing strategy

f85a644

revert file

300ca77

fix

0c56783

xurui-c marked this pull request as ready for review November 10, 2025 17:37

xurui-c requested review from a team as code owners November 10, 2025 17:37

sentry bot reviewed Nov 10, 2025

View reviewed changes

onewland reviewed Nov 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(cbrs): load based routing strategy #7519

feat(cbrs): load based routing strategy #7519

Uh oh!

xurui-c commented Nov 9, 2025 •

edited

Loading

Uh oh!

sentry bot Nov 10, 2025

Uh oh!

onewland Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

feat(cbrs): load based routing strategy #7519

Are you sure you want to change the base?

feat(cbrs): load based routing strategy #7519

Uh oh!

Conversation

xurui-c commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sentry bot Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

onewland Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xurui-c commented Nov 9, 2025 •

edited

Loading