Skip to content

Conversation

@kfaraz
Copy link
Contributor

@kfaraz kfaraz commented Dec 3, 2025

Description

With the support of multiple supervisors ingesting into the same datasource, it becomes difficult to distinguish the metrics originating from the tasks of two different supervisors.

Changes

  • Add dimension supervisorId to all metrics emitted by streaming tasks
  • Simplify constructor of TaskRealtimeMetricsMonitor
  • Remove TaskRealtimeMetricsMonitorBuilder

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

}

@Override
protected ServiceMetricEvent.Builder getMetricBuilder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is off here. super.getMetricBuilder() just returns metricBuilder, so what's happening in this method is redundant. It would be equivalent to do this in the constructor rather than getMetricBuilder. i.e. simply do metricBuilder.setDimension(DruidMetrics.SUPERVISOR_ID, supervisorId).

However, this highlights a problem: the metricBuilder is not thread safe, yet it appears to be used from multiple threads with the pattern emitter.emit(getMetricBuilder().setMetric(metric, value)). For example, as far as I can tell, ingest/segments/count is emitted in a push thread but certain other metrics are emitted in the main thread. It probably is called infrequently enough that this doesn't cause problems in practice. But it's still a thread safety bug and we should fix it.

How about redefining this method to return a new instance of ServiceMetricEvent.Builder each time it's called? That should fix it, and then this implementation you have here would be correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I have updated TaskRealtimeMetricsMonitor accordingly.

DruidMetrics.TASK_TYPE, new String[]{task.getType()},
DruidMetrics.GROUP_ID, new String[]{task.getGroupId()}
),
getMetricDimensions(task),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Map<String, String[]> dimensions on TaskRealtimeMetricsMonitor could be changed to ServiceMetricEvent.Builder, then it could be passed task.getMetricBuilder() and avoid the duplicated logic / instanceof check below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good 👍🏻 .

@kfaraz kfaraz requested a review from gianm December 3, 2025 06:13
@kfaraz
Copy link
Contributor Author

kfaraz commented Dec 3, 2025

Merging as the IT failure is due to the flaky test ITNestedQueryPushDownTest.

@kfaraz kfaraz merged commit 71fbba6 into apache:master Dec 3, 2025
101 of 104 checks passed
@kfaraz kfaraz deleted the add_supervisor_id_dimension branch December 3, 2025 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants