Skip to content

Emit JMX metrics using OTEL SDK instead of APM Agent#142133

Closed
mamazzol wants to merge 17 commits intoelastic:mainfrom
mamazzol:otel-dependencies
Closed

Emit JMX metrics using OTEL SDK instead of APM Agent#142133
mamazzol wants to merge 17 commits intoelastic:mainfrom
mamazzol:otel-dependencies

Conversation

@mamazzol
Copy link
Contributor

@mamazzol mamazzol commented Feb 9, 2026

In this PR, we introduce OTEL SDK and use it to emit the registered metrics using the OTEL SDK directly.

Summary

  • OTEL Version bump: this is extracted to Bump OpenTelemetry dependencies #142323 to separate the possible side effect investigation. I will merge main back when that is merged.
  • The switch between APM Agent and OTEL is driven by a sysprop that enables the OTEL SDK initialisation and disables the APM Agent metrics via setting. This allows to still attach the agent for traces, if needed.
  • APMMeterService is agnostic after creation, as it just has a Supplier. The new OTelSdkMeterSupplier will be responsible for the initialisation and setup. For now this is minimal and mostly default.
  • SystemMetrics class exists to close the gap between the two metrics set. Eventually, we can consider what we want to support, but for migration purposes we are trying to cover all the JMX metrics emitted by APM Agent.

When possible, I used the names from the OTEL Semantic convention so that we'll already be one extra step closer to supporting that.

APM Agent vs OTEL SDK metrics

I collected the APM Agent emitted metrics by using --with-apm-agent but a list is also available here

  • system.process.cpu.total.norm.pct: system CPU time in non-idle/non-iowait states, normalized by number of cores.
  • system.cpu.total.norm.pct: system CPU time in non-idle/non-iowait states, normalized by number of cores.
  • system.process.memory.size: total virtual memory size of the process.
  • system.memory.actual.free: “actual free” memory
  • system.memory.total: total physical memory.
  • jvm.thread.count: current number of live JVM threads (daemon + non-daemon).
  • jvm.fd.used / jvm.fd.max: current / maximum number of open file descriptors.
  • jvm.gc.alloc: approximation of total heap bytes allocated.
  • jvm.gc.count: total number of GC collections that have occurred (by GC name).
    • attribute : name="G1 Concurrent GC"
  • jvm.gc.time: approximate accumulated GC elapsed time in milliseconds (by GC name).
    • attribute : name="G1 Concurrent GC"
  • jvm.memory.heap.used / jvm.memory.heap.committed / jvm.memory.heap.max: used / committed / max heap memory.
  • jvm.memory.non_heap.used / jvm.memory.non_heap.committed / jvm.memory.non_heap.max
  • jvm.memory.heap.pool.used / jvm.memory.heap.pool.committed / jvm.memory.heap.pool.max: used / committed / max heap memory in a heap pool (by pool name)
    • attribute : name="G1 Survivor Space"
  • jvm.memory.non_heap.pool.used / jvm.memory.non_heap.pool.committed / jvm.memory.non_heap.pool.max: used / committed / max non-heap memory in a heap pool (by pool name)
    • attribute : name="Metaspace" / name="CodeHeap 'profiled nmethods'"

The OTEL SDK with the code currently added, emits the following metrics.

  1. Manually added in SystemMetrics:
  • system.cou.total.norm.pct: The percentage of CPU time in states other than Idle.
  • system.memory.actual.free: Actual free memory in bytes, calculated based on the OS.
  • system.memory.total: Total memory.
  • system.process.memory.size: The total virtual memory the process has.
  • jvm.file_descriptor.count/ jvm.file_descriptor.limit: The number of opened / max file descriptors.
  • jvm.gc.count: The total number of collections that have occurred.
    • attribute : name="G1 Concurrent GC"
  • jvm.gc.time: The approximate accumulated collection elapsed time in milliseconds.
    • attribute : name="G1 Concurrent GC"
  • system.process.cgroup.memory.mem.limit.bytes: Memory limit for current cgroup slice. (always 0 when not supported)
  • system.process.cgroup.memory.mem.usage.bytes: Memory usage in current cgroup slice. (always 0 when not supported)
  1. Emitted by the JMX from the RuntimeMetrics library:
  • jvm.class.count: Number of classes currently loaded.
  • jvm.class.loaded: Number of classes loaded since JVM start.
  • jvm.class.unloaded: Number of classes unloaded since JVM start.
  • jvm.cpu.count: Number of processors available to the Java virtual machine.
  • jvm.cpu.recent_utilization: Recent CPU utilization for the process as reported by the JVM.
  • jvm.cpu.time: CPU time used by the process as reported by the JVM.
  • jvm.gc.alloc: An approximation of the total amount of memory, in bytes, allocated in heap memory.
  • jvm.gc.duration: Duration of JVM garbage collection actions.
  • jvm.memory.committed: Measure of memory committed.
  • jvm.memory.limit: Measure of max obtainable memory.
  • jvm.memory.used: Measure of memory used.
  • jvm.memory.used_after_last_gc: Measure of memory used, as measured after the most recent garbage collection event on this pool.
  • jvm.thread.count: Number of executing platform threads.

Mapping between APM Agent and OTEL SDK metrics

  • Most metrics are either emitted with same or similar name (e.g. SystemMetrics).
  • JVM memory metrics are emitted in OTEL under a single metric (jvm.memory.used) and the differentiation between heap/non_heap or with the pools is done with attributes.
  • jvm.thread.count in OTEL is split in two attributes (jvm.thread.daemon and jvm.thread.state).
  • system.cpu.total.norm.pct and system.process.cpu.total.norm.pct need to be evaluated closely to understand if jvm.system.cpu.utilization and jvm.cpu.recent.utilization are good enough replacements or if we need to emit/calculate them differently.

ES-14013

@mamazzol mamazzol requested review from a team as code owners February 9, 2026 13:33
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.4.0 labels Feb 9, 2026
@mamazzol mamazzol added >refactoring :Core/Infra/Metrics Metrics and metering infrastructure Team:Core/Infra Meta label for core/infra team and removed needs:triage Requires assignment of a team area label v9.4.0 labels Feb 9, 2026
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@mamazzol mamazzol added the v9.4.0 label Feb 9, 2026
Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I wonder if we could do this slightly differently. Instead of blanket disabling the metrics within the agent, we could decide in APMJvmOptions whether we want to use them, based on a system property. The same system property would then be used in the apm module to conditionally add the new collection (and it also doesn't even need to be in the apm module then, it could be in server and be registered like any other metrics?).

Also, please write a thorough PR description and a more descriptive title. It would be nice to know exactly what jvm metrics existed before this PR and which will exist after.

private static final OperatingSystemMXBean OS_MX_BEAN = ManagementFactory.getOperatingSystemMXBean();
private static final ThreadMXBean THREAD_MX_BEAN = ManagementFactory.getThreadMXBean();

private static final MethodHandle GET_MAX_FILE_DESCRIPTOR_COUNT = osLongHandle("getMaxFileDescriptorCount");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most or probably all of these should already be available with the probe classes that we have for info/stats apis. For example, this one is available via ProcessProbe.getInstance().getMaxFileDescriptorCount().

I think we should aim to keep those probes as the entry point to system stats/info (so we don't duplicate that work), and use this class as a place for the glue to producing metrics from those classes.

@Override
protected void doStop() {
meterRegistry.setProvider(createNoopMeter());
if (otelMeterSupplier instanceof OtelSdkMeterSupplier otelSdk) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this would be less clunky if checking for Closeable.

NodeScope
);

public static final Setting<Boolean> TELEMETRY_OTEL_METRICS_ENABLED_SETTING = Setting.boolSetting(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we intend this setting to last after we have migrated to the otel sdk? If not, I think this would be better as a system property, otherwise proper deprecation/removal would be needed.

.registerMetricReader(reader)
.build();
var otelSdk = OpenTelemetrySdk.builder().setMeterProvider(meterProvider).build();
runtimeMetrics = RuntimeMetrics.builder(otelSdk).enableAllFeatures().emitExperimentalTelemetry().build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which telemetry is experimental? Should we be cautious in relying on it?

if (serverUrl == null || serverUrl.isEmpty()) {
throw new IllegalStateException("telemetry.otel.metrics.enabled=true requires telemetry.agent.server_url to be configured");
}
String endpoint = serverUrl + (serverUrl.endsWith("/") ? "" : "/") + "v1/metrics";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this kind of adaptation or can we expect the url configuration to be exactly the endpoint we should send to?

public Meter get() {
if (meterProvider == null) {
var exporter = createOTLPExporter();
TimeValue intervalTimeValue = settings.getAsTime("telemetry.agent.metrics_interval", TimeValue.timeValueSeconds(10));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the agent is gone, I suspect we won't want to use an "agent" setting, so could we consider a new namespace for our own collection that has settings like this interval?

}

private OtlpHttpMetricExporter createOTLPExporter() {
String serverUrl = APMAgentSettings.APM_AGENT_SETTINGS.getConcreteSetting("telemetry.agent.server_url").get(settings);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I think we should avoid reusing old agent settings, let's have a clear set of settings with the new system that don't have any legacy restrictions/expectations.

- elastic.apm.application_packages
- elastic.apm.stack_trace_limit
- elastic.apm.span_stack_trace_min_duration
- files:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these necessary when we have our new collection disabled, using the agent's global supplier but now from the apm module, so it needs these permissions? If so, can we please add a comment here that these can be removed in the future once the agent is gone (probably also true for the system property reads above)?

def otelVersion = '1.31.0'
def otelSemconvVersion = '1.21.0-alpha'
def otelVersion = '1.53.0'
def otelInstrumentationVersion = '2.23.0-alpha'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a major bump, should we do that in a separate change to ensure it doesn't have any side effects, apart from this PR? Also, why are we using "alpha" versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the first things I touched, thanks for highlighting, it needs more work!

For the version bumps, there's a couple of reasons:

  1. Between 1.31.0 and 1.53.0 (but now 1.59.0) there are quite a few quality of life improvements around OTLP exporter, but most importantly 1.52.0 contains this fix for flushing metrics, which I think is important for our use case.
  2. For the instrumentation version, I want to use the release that targets the specific SDK version picked, which would be 2.19.0 as recommended here

I am happy to separate this into it's own PR to be merged before all this if it's better!

@mamazzol mamazzol changed the title Otel dependencies Emit JMX metrics using OTEL SDK instead of APM Agent Feb 12, 2026
@mamazzol mamazzol requested review from a team as code owners February 25, 2026 12:55
@mamazzol mamazzol requested a review from jeramysoucy February 25, 2026 12:55
@mamazzol mamazzol requested review from a team and removed request for a team and jeramysoucy February 25, 2026 13:02
@consulthys
Copy link
Contributor

consulthys commented Feb 25, 2026

@mamazzol
Along with the .monitoring-* template changes, you also need to bump the STACK_MONITORING_REGISTRY_VERSION field in https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/MonitoringTemplateRegistry.java#L80

@mamazzol mamazzol closed this Feb 25, 2026
@mamazzol mamazzol deleted the otel-dependencies branch February 25, 2026 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Metrics Metrics and metering infrastructure >refactoring Team:Core/Infra Meta label for core/infra team v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants