Add Kafka Connect as a built‑in JMX metrics target #15561

aaaugustine29 · 2025-12-06T18:08:52Z

Overview:
This change introduces Kafka Connect as a first‑class JMX target system in the JMX metrics library. It adds a ruleset and documentation that cover both Apache Kafka Connect and Confluent Platform variants from the outset, so users can enable Kafka Connect monitoring without custom YAML.

Details:
Added kafka-connect.yaml JMX rules that map worker, rebalance, connector, task, source/sink task, and task-error MBeans into OpenTelemetry metrics, including Apache‑only metrics (e.g., worker rebalance protocol, per‑connector task counts, predicate/transform metadata, converter metadata, source transaction sizes, sink record lag max).
Defined connector and task status as state metrics using the superset of status values across Apache and Confluent, to avoid vendor‑specific enum mismatches.
Documented the new target in kafka-connect.md, including metric groups, attributes, and the dual‑vendor compatibility model (no renames; Apache list as a superset of Confluent docs).
Added self‑contained tests for the Kafka Connect rules that load the YAML, build metric definitions, and validate key state mappings and metric presence, ensuring the new target is ready to consume from day one.

Testing:
./gradlew -Dorg.gradle.configuration-cache.parallel=false instrumentation:jmx-metrics:library:test

linux-foundation-easycla · 2025-12-06T18:08:59Z

The committers listed above are authorized under a signed CLA.

✅ login: aaaugustine29 / name: Aaron Augustine (d20153d, 2f25096, 31b44c8, 471c9a7, 6293d68, 6d3af5f, 7ceb694, 81fa539, 9ccdc08, a2fca88, c08109f, c985e86, d46eb77, dd17633)

laurit · 2025-12-08T11:42:52Z

@SylvainJuge could you review this

SylvainJuge

Hi @aaaugustine29, thanks for opening this!

There are quite a lot of metrics added here, so it makes it quite challenging to review them all.

I don't have any expertise in Kafka Connect, so you are probably more knowledgeable here.

I would suggest to :

implement test with a real instance of the target system, ideally the two apache/confluent variants
as a first step, focus on the "essential" metrics, do not include everything that is available, this is where your knowledge might be useful
try to simplify the the maximum by using metric attributes to provide breakdown when possible if the metrics represent a partition (for example on state).

SylvainJuge · 2025-12-08T15:25:01Z

...s/library/src/test/java/io/opentelemetry/instrumentation/jmx/rules/KafkaConnectRuleTest.java

+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.extension.RegisterExtension;
+
+class KafkaConnectRuleTest {


This tests mocks the actual kafka connect instance JMX, while this makes testing fast and easier to run without the target kafka connect system, this also means the tests and metrics definitions can easily drift from the actual implementation.

So, the tests here only test the metric mapping is what we expect, not that this mapping actually works as expected on a real kafka connect instance. In order to solve this I would recommend to add a test with a real kafka connect like we have with other systems.

Ok will do, I was trying to avoid adding too heavy of tests but I'm glad to hear that's acceptable. I'll make sure to get those in there soon! In this case, I'll look into the lightest way to do so and commit.

instrumentation/jmx-metrics/library/kafka-connect.md

aaaugustine29 · 2025-12-08T22:33:54Z

@SylvainJuge Thanks for your help and guidance. At this point, the metrics have been reduced to the minimum set without losing any information. That being said, that doesn't mean we need to keep everything. In particular, your previous comment brings up the opportunity for consolidating some of them with metric attributes. However, there will be a loss of info for a niche and advanced group. What's your guidance on this?

And to clarify your comment about testing, having tests that actually instantiate a kafka connect cluster will be very heavy, I could emulate what the apache jmx server would produce, would that be sufficient?

SylvainJuge · 2025-12-10T13:11:13Z

instrumentation/jmx-metrics/library/src/main/resources/jmx/rules/kafka-connect.yaml

+      connector-startup-attempts-total:
+        metric: connector.startup.attempts
+        type: counter
+        unit: "{attempt}"
+        desc: The total number of connector startups that this worker has attempted.
+      # kafka.connect.worker.connector.startup.failure.total
+      connector-startup-failure-total:
+        metric: connector.startup.failures
+        type: counter
+        unit: "{startup}"
+        desc: The total number of connector starts that failed.
+      # kafka.connect.worker.connector.startup.success.total
+      connector-startup-success-total:
+        metric: connector.startup.successes
+        type: counter
+        unit: "{startup}"
+        desc: The total number of connector starts that succeeded.


From the metrics definitions we can assume that connector-startup-attempts-total = connector-startup-failure-total + connector-startup-success-total.
This assumption is confirmed when we look for the probable implementation in https://github.com/apache/kafka/blob/83dc0d7eae2940ea26781276b3dfee5ed65dba15/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerMetricsGroup.java#L80

From the implementation we know the following:

the total startup attempt count is incremented when the startup is known to be a failure/success, so after the startup attempt is completed/failed. There is no state where the attempt is "in progress"

the sum of failure + success will always be equal to the total, thus we don't need an extra metric

This means that we could capture a single kafka.connect.worker.connector.startup metric with a breakdown of the startup result by kafka.connect.worker.connector.startup.result = failure | success

With such a metric, we have the following:

the total number of startup attempts is provided by discarding the kafka.connect.worker.connector.startup.result attribute and aggregating (sum)

the total number of startup success/failures is provided by filtering on the value of kafka.connect.worker.connector.startup.result attribute.

Also, we can use yaml anchors to avoid duplication and ensure this is captured in the same metric. The implementation should look like this:

Suggested change

connector-startup-attempts-total:

metric: connector.startup.attempts

type: counter

unit: "{attempt}"

desc: The total number of connector startups that this worker has attempted.

# kafka.connect.worker.connector.startup.failure.total

connector-startup-failure-total:

metric: connector.startup.failures

type: counter

unit: "{startup}"

desc: The total number of connector starts that failed.

# kafka.connect.worker.connector.startup.success.total

connector-startup-success-total:

metric: connector.startup.successes

type: counter

unit: "{startup}"

desc: The total number of connector starts that succeeded.

# kafka.connect.worker.connector.startup

connector-startup-failure-total:

metric: &metric connector.startup

type: &type counter

unit: &unit "{startup}"

desc: &desc The total number of connector starts.

metricAttribute:

kafka.connect.worker.connector.startup.result: const(failure)

# kafka.connect.worker.connector.startup.success.total

connector-startup-success-total:

metric: *metric

type: *type

unit: &unit

desc: &desc

metricAttribute:

kafka.connect.worker.connector.startup.result: const(success)

SylvainJuge · 2025-12-10T13:15:16Z

instrumentation/jmx-metrics/library/src/main/resources/jmx/rules/kafka-connect.yaml

+      # kafka.connect.worker.task.startup.attempts
+      task-startup-attempts-total:
+        metric: task.startup.attempts
+        type: counter
+        unit: "{attempt}"
+        desc: The total number of task startups that this worker has attempted.
+      # kafka.connect.worker.task.startup.failure.total
+      task-startup-failure-total:
+        metric: task.startup.failures
+        type: counter
+        unit: "{startup}"
+        desc: The total number of task starts that failed.
+      # kafka.connect.worker.task.startup.success.total
+      task-startup-success-total:
+        metric: task.startup.successes
+        type: counter
+        unit: "{startup}"
+        desc: The total number of task starts that succeeded.


We can do the same here with a single kafka.connect.worker.task.startup metric.

SylvainJuge · 2025-12-10T13:24:06Z

instrumentation/jmx-metrics/library/src/main/resources/jmx/rules/kafka-connect.yaml

+      # kafka.connect.worker.connector.task.destroyed
+      connector-destroyed-task-count:
+        metric: destroyed
+        desc: The number of destroyed tasks of the connector on the worker.
+      # kafka.connect.worker.connector.task.failed
+      connector-failed-task-count:
+        metric: failed
+        desc: The number of failed tasks of the connector on the worker.
+      # kafka.connect.worker.connector.task.paused
+      connector-paused-task-count:
+        metric: paused
+        desc: The number of paused tasks of the connector on the worker.
+      # kafka.connect.worker.connector.task.restarting
+      connector-restarting-task-count:
+        metric: restarting
+        desc: The number of restarting tasks of the connector on the worker.
+      # kafka.connect.worker.connector.task.running
+      connector-running-task-count:
+        metric: running
+        desc: The number of running tasks of the connector on the worker.
+      # kafka.connect.worker.connector.task.total
+      connector-total-task-count:
+        metric: total
+        desc: The number of tasks of the connector on the worker.
+      # kafka.connect.worker.connector.task.unassigned
+      connector-unassigned-task-count:
+        metric: unassigned
+        desc: The number of unassigned tasks of the connector on the worker.


Those metrics have been added about 6 years ago with this PR: https://github.com/apache/kafka/pull/6843/files

A quick look at the implementation seems to indicate that those metrics are produced by iterating over a list of tasks and counting each state. That means the list of states that is expressed here is a complete partition over the all the possible task states.

So here we can replace those 7 metrics with a single kafka.connect.woker.connector.task with a breakdown on kafka.connect.worker.connector.task.state metric attribute.

aaaugustine29 requested a review from a team as a code owner December 6, 2025 18:08

aaaugustine29 added 6 commits December 6, 2025 13:28

initial commit with changes

6293d68

slight change to support both apache and confluent

d46eb77

fix small error and unit tests

31b44c8

fix linting issue

2f25096

fix spotless issues

81fa539

fix readme

dd17633

aaaugustine29 force-pushed the main branch from ded4cef to dd17633 Compare December 6, 2025 18:29

aaaugustine29 and others added 2 commits December 6, 2025 13:29

Merge branch 'main' into main

d20153d

finally fix markdown

c985e86

aaaugustine29 added 2 commits December 8, 2025 10:28

change from state to updowncounter for wider compatibility

9ccdc08

Apply spotless

471c9a7

SylvainJuge reviewed Dec 8, 2025

View reviewed changes

aaaugustine29 added 4 commits December 8, 2025 11:42

update for simplicity and further compatibility

7ceb694

apply spotless

6d3af5f

remove worker startup percentages

a2fca88

remove unnecessary metrics

c08109f

SylvainJuge reviewed Dec 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Kafka Connect as a built‑in JMX metrics target #15561

Add Kafka Connect as a built‑in JMX metrics target #15561

aaaugustine29 commented Dec 6, 2025

Uh oh!

linux-foundation-easycla bot commented Dec 6, 2025 •

edited

Loading

Uh oh!

laurit commented Dec 8, 2025

Uh oh!

SylvainJuge left a comment

Uh oh!

SylvainJuge Dec 8, 2025

Uh oh!

aaaugustine29 Dec 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aaaugustine29 commented Dec 8, 2025 •

edited

Loading

Uh oh!

SylvainJuge Dec 10, 2025

Uh oh!

SylvainJuge Dec 10, 2025

Uh oh!

SylvainJuge Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add Kafka Connect as a built‑in JMX metrics target #15561

Are you sure you want to change the base?

Add Kafka Connect as a built‑in JMX metrics target #15561

Conversation

aaaugustine29 commented Dec 6, 2025

Uh oh!

linux-foundation-easycla bot commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laurit commented Dec 8, 2025

Uh oh!

SylvainJuge left a comment

Choose a reason for hiding this comment

Uh oh!

SylvainJuge Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

aaaugustine29 Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aaaugustine29 commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SylvainJuge Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

SylvainJuge Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

SylvainJuge Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

linux-foundation-easycla bot commented Dec 6, 2025 •

edited

Loading

aaaugustine29 commented Dec 8, 2025 •

edited

Loading