Skip to content

SAI-6021 : Fix slow node detection metric context leaks#311

Open
patsonluk wants to merge 3 commits intofs/branch_9_7from
patsonluk/SAI-6021-slow-node-detection-metric-context
Open

SAI-6021 : Fix slow node detection metric context leaks#311
patsonluk wants to merge 3 commits intofs/branch_9_7from
patsonluk/SAI-6021-slow-node-detection-metric-context

Conversation

@patsonluk
Copy link
Collaborator

@patsonluk patsonluk commented Jan 20, 2026

Description

There are 2 potential SolrMetricsContextLeak (1 discovered by AI and another by @magibney, assuming he's not AI 🤖 )

  1. TimeLimitingHttpShardHandlerFactory#initializeMetrics calls both super.initializeMetrics(parentContext, scope); AND solrMetricsContext = parentContext.getChildContext(this);. Both calls create new metrics, but only the latter one got closed via the SolrMetricProducer.super.close()
  2. SlowNodeDetector registers gauge metrics via SolrMetricManager directly and never unregister them. This technically does not cause any leaks (if the parent context is closed properly), however it's probably better to call SolrMetricsContext#gauge method instead.

In reality this should be rather minimal in impact:

  1. As @magibney pointed out, there's only a single synthetic core created on QA node. So only one instance of TimeLimitingHttpShardHandlerFactory and SlowNodeDetector (the core container one is not using the same config, hence not a TimeLimitingHttpShardHandlerFactory). TimeLimitingHttpShardHandlerFactory is not applied to non QA node (which could have way more cores hence more problematic)
  2. The metrics created in TimeLimitingHttpShardHandlerFactory are only counters. They are likely not an issue for memory as:
    1. SolrMetricsContext#unregister only unregister gauges and not care about other metrics. My guess is that other metrics are just not a concern?
    2. Further to point 1. Existing usages of counters in other classes do not manually unregister them neither

Solution

  1. TimeLimitingHttpShardHandlerFactory should no longer call parentContext.getChildContext(this). It should just use the context from its parent class HttpShardHandlerFactory. Remove the overridden getSolrMetricsContext method
  2. SlowNodeDetector will no longer directly register gauges via the manager, it should use the create a child context from the parent context (in this case from TimeLimitingHttpShardHandlerFactory)'s gauge method, which should take care of the gauges on closing. The overridden getSolrMetricsContext should return this new child context instead of null. Take note that in theory SlowNodeDetector can just use the parent context from TimeLimitingHttpShardHandlerFactory since they share the same life cycle (for now). However, I don't really like that coupling, and is probably better oof having its own dedicated context

Note

Low Risk
Primarily refactors metrics registration/cleanup to use SolrMetricsContext correctly and adds explicit close handling; minimal behavioral impact outside metrics lifecycle management.

Overview
Fixes potential metrics context leaks in slow-node detection and query shard handling.

TimeLimitingHttpShardHandlerFactory now uses the metrics context created by HttpShardHandlerFactory (removing its extra child context) and ensures SlowNodeDetector is closed during factory shutdown. SlowNodeDetector switches from direct SolrMetricManager.registerGauge calls to a dedicated child SolrMetricsContext with gauge(...), and returns that context via getSolrMetricsContext() so gauges are properly unregistered on close.

Written by Cursor Bugbot for commit 9983ffc. This will update automatically on new commits. Configure here.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

@Override
public SolrMetricsContext getSolrMetricsContext() { // using the same context as parent
return null;
return metricsContext;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading comment on method returning child context

Low Severity

The comment // using the same context as parent in getSolrMetricsContext() is inaccurate. The method now returns a dedicated child metricsContext instead of the parent or null, which could mislead future maintainers.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant