-
-
Notifications
You must be signed in to change notification settings - Fork 61
feat(cbrs): load based routing strategy #7519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| if load_info.cluster_load < pass_through_threshold: | ||
| routing_decision.can_run = True | ||
| routing_decision.is_throttled = False | ||
| routing_decision.clickhouse_settings["max_threads"] = pass_through_max_threads | ||
| routing_decision.routing_context.extra_info["load_based_pass_through"] = { | ||
| "threshold": pass_through_threshold, | ||
| "max_threads": pass_through_max_threads, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: New logic incorrectly allows queries to bypass allocation policies when cluster_load retrieval fails and returns -1.0.
Severity: CRITICAL | Confidence: 1.00
🔍 Detailed Analysis
When get_cluster_loadinfo() fails to retrieve cluster load information, it returns a LoadInfo object with cluster_load set to -1.0. The new code checks if load_info is None but does not account for this specific sentinel value. As a result, the condition load_info.cluster_load < pass_through_threshold evaluates to True (e.g., -1.0 < 20), causing queries to bypass all allocation policies and run, despite the unavailability of actual load data. This occurs silently without explicit error handling in the new logic.
💡 Suggested Fix
Modify the LoadBasedRoutingStrategy to explicitly check for the -1.0 sentinel value in load_info.cluster_load when determining if load information is available, or ensure load_info is None on failure.
🤖 Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: snuba/web/rpc/storage_routing/routing_strategies/load_based.py#L46-L53
Potential issue: When `get_cluster_loadinfo()` fails to retrieve cluster load
information, it returns a `LoadInfo` object with `cluster_load` set to `-1.0`. The new
code checks `if load_info is None` but does not account for this specific sentinel
value. As a result, the condition `load_info.cluster_load < pass_through_threshold`
evaluates to `True` (e.g., `-1.0 < 20`), causing queries to bypass all allocation
policies and run, despite the unavailability of actual load data. This occurs silently
without explicit error handling in the new logic.
Did we get this right? 👍 / 👎 to inform future reviews.
| class LoadBasedRoutingStrategy(OutcomesBasedRoutingStrategy): | ||
| """ | ||
| If cluster load is under a threshold, ignore recommendations and allow the query to pass through with the tier decided based on outcomes-based routing. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this inheriting from OutcomesBasedRoutingStrategy? shouldn't it inherit BaseRoutingStrategy?
I think it's mixing concerns weirdly to make the load-based routing in any way aware or coupled to outcomes-based routing. Some third entity/module should chain the two together if that's necessary.
Basically does what
OutcomesBasedRoutingStrategydoes except when cluster load is low enough, we let the query through even if allocation policies say noRollout plan:
storage_routing_config_override={'{"1": {"version": 1, "config": {"LoadBasedRoutingStrategy": 1.0}}}',default_storage_routing_config= '{"version": 1, "config": {"LoadBasedRoutingStrategy": 0.5, "OutcomesBasedRoutingStrategy": 0.5}}'