Skip to content

Conversation

@xurui-c
Copy link
Member

@xurui-c xurui-c commented Nov 9, 2025

Basically does what OutcomesBasedRoutingStrategy does except when cluster load is low enough, we let the query through even if allocation policies say no

Rollout plan:

  1. only deploy this strategy for org 1 (Sentry) via Snuba Admin by setting storage_routing_config_override = {'{"1": {"version": 1, "config": {"LoadBasedRoutingStrategy": 1.0}}}',
  2. deploy to 50% of our customers via setting default_storage_routing_config = '{"version": 1, "config": {"LoadBasedRoutingStrategy": 0.5, "OutcomesBasedRoutingStrategy": 0.5}}'

@xurui-c xurui-c marked this pull request as ready for review November 10, 2025 17:37
@xurui-c xurui-c requested review from a team as code owners November 10, 2025 17:37
Comment on lines +46 to +53
if load_info.cluster_load < pass_through_threshold:
routing_decision.can_run = True
routing_decision.is_throttled = False
routing_decision.clickhouse_settings["max_threads"] = pass_through_max_threads
routing_decision.routing_context.extra_info["load_based_pass_through"] = {
"threshold": pass_through_threshold,
"max_threads": pass_through_max_threads,
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: New logic incorrectly allows queries to bypass allocation policies when cluster_load retrieval fails and returns -1.0.
Severity: CRITICAL | Confidence: 1.00

🔍 Detailed Analysis

When get_cluster_loadinfo() fails to retrieve cluster load information, it returns a LoadInfo object with cluster_load set to -1.0. The new code checks if load_info is None but does not account for this specific sentinel value. As a result, the condition load_info.cluster_load < pass_through_threshold evaluates to True (e.g., -1.0 < 20), causing queries to bypass all allocation policies and run, despite the unavailability of actual load data. This occurs silently without explicit error handling in the new logic.

💡 Suggested Fix

Modify the LoadBasedRoutingStrategy to explicitly check for the -1.0 sentinel value in load_info.cluster_load when determining if load information is available, or ensure load_info is None on failure.

🤖 Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: snuba/web/rpc/storage_routing/routing_strategies/load_based.py#L46-L53

Potential issue: When `get_cluster_loadinfo()` fails to retrieve cluster load
information, it returns a `LoadInfo` object with `cluster_load` set to `-1.0`. The new
code checks `if load_info is None` but does not account for this specific sentinel
value. As a result, the condition `load_info.cluster_load < pass_through_threshold`
evaluates to `True` (e.g., `-1.0 < 20`), causing queries to bypass all allocation
policies and run, despite the unavailability of actual load data. This occurs silently
without explicit error handling in the new logic.

Did we get this right? 👍 / 👎 to inform future reviews.

Comment on lines +12 to +15
class LoadBasedRoutingStrategy(OutcomesBasedRoutingStrategy):
"""
If cluster load is under a threshold, ignore recommendations and allow the query to pass through with the tier decided based on outcomes-based routing.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this inheriting from OutcomesBasedRoutingStrategy? shouldn't it inherit BaseRoutingStrategy?

I think it's mixing concerns weirdly to make the load-based routing in any way aware or coupled to outcomes-based routing. Some third entity/module should chain the two together if that's necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants