Emit JMX metrics using OTEL SDK instead of APM Agent#142133
Emit JMX metrics using OTEL SDK instead of APM Agent#142133mamazzol wants to merge 17 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-core-infra (Team:Core/Infra) |
rjernst
left a comment
There was a problem hiding this comment.
Overall, I wonder if we could do this slightly differently. Instead of blanket disabling the metrics within the agent, we could decide in APMJvmOptions whether we want to use them, based on a system property. The same system property would then be used in the apm module to conditionally add the new collection (and it also doesn't even need to be in the apm module then, it could be in server and be registered like any other metrics?).
Also, please write a thorough PR description and a more descriptive title. It would be nice to know exactly what jvm metrics existed before this PR and which will exist after.
| private static final OperatingSystemMXBean OS_MX_BEAN = ManagementFactory.getOperatingSystemMXBean(); | ||
| private static final ThreadMXBean THREAD_MX_BEAN = ManagementFactory.getThreadMXBean(); | ||
|
|
||
| private static final MethodHandle GET_MAX_FILE_DESCRIPTOR_COUNT = osLongHandle("getMaxFileDescriptorCount"); |
There was a problem hiding this comment.
Most or probably all of these should already be available with the probe classes that we have for info/stats apis. For example, this one is available via ProcessProbe.getInstance().getMaxFileDescriptorCount().
I think we should aim to keep those probes as the entry point to system stats/info (so we don't duplicate that work), and use this class as a place for the glue to producing metrics from those classes.
| @Override | ||
| protected void doStop() { | ||
| meterRegistry.setProvider(createNoopMeter()); | ||
| if (otelMeterSupplier instanceof OtelSdkMeterSupplier otelSdk) { |
There was a problem hiding this comment.
nit: this would be less clunky if checking for Closeable.
| NodeScope | ||
| ); | ||
|
|
||
| public static final Setting<Boolean> TELEMETRY_OTEL_METRICS_ENABLED_SETTING = Setting.boolSetting( |
There was a problem hiding this comment.
Do we intend this setting to last after we have migrated to the otel sdk? If not, I think this would be better as a system property, otherwise proper deprecation/removal would be needed.
| .registerMetricReader(reader) | ||
| .build(); | ||
| var otelSdk = OpenTelemetrySdk.builder().setMeterProvider(meterProvider).build(); | ||
| runtimeMetrics = RuntimeMetrics.builder(otelSdk).enableAllFeatures().emitExperimentalTelemetry().build(); |
There was a problem hiding this comment.
Which telemetry is experimental? Should we be cautious in relying on it?
| if (serverUrl == null || serverUrl.isEmpty()) { | ||
| throw new IllegalStateException("telemetry.otel.metrics.enabled=true requires telemetry.agent.server_url to be configured"); | ||
| } | ||
| String endpoint = serverUrl + (serverUrl.endsWith("/") ? "" : "/") + "v1/metrics"; |
There was a problem hiding this comment.
Do we need this kind of adaptation or can we expect the url configuration to be exactly the endpoint we should send to?
| public Meter get() { | ||
| if (meterProvider == null) { | ||
| var exporter = createOTLPExporter(); | ||
| TimeValue intervalTimeValue = settings.getAsTime("telemetry.agent.metrics_interval", TimeValue.timeValueSeconds(10)); |
There was a problem hiding this comment.
After the agent is gone, I suspect we won't want to use an "agent" setting, so could we consider a new namespace for our own collection that has settings like this interval?
| } | ||
|
|
||
| private OtlpHttpMetricExporter createOTLPExporter() { | ||
| String serverUrl = APMAgentSettings.APM_AGENT_SETTINGS.getConcreteSetting("telemetry.agent.server_url").get(settings); |
There was a problem hiding this comment.
Same here, I think we should avoid reusing old agent settings, let's have a clear set of settings with the new system that don't have any legacy restrictions/expectations.
| - elastic.apm.application_packages | ||
| - elastic.apm.stack_trace_limit | ||
| - elastic.apm.span_stack_trace_min_duration | ||
| - files: |
There was a problem hiding this comment.
Are these necessary when we have our new collection disabled, using the agent's global supplier but now from the apm module, so it needs these permissions? If so, can we please add a comment here that these can be removed in the future once the agent is gone (probably also true for the system property reads above)?
| def otelVersion = '1.31.0' | ||
| def otelSemconvVersion = '1.21.0-alpha' | ||
| def otelVersion = '1.53.0' | ||
| def otelInstrumentationVersion = '2.23.0-alpha' |
There was a problem hiding this comment.
This seems like a major bump, should we do that in a separate change to ensure it doesn't have any side effects, apart from this PR? Also, why are we using "alpha" versions?
There was a problem hiding this comment.
This is one of the first things I touched, thanks for highlighting, it needs more work!
For the version bumps, there's a couple of reasons:
- Between 1.31.0 and 1.53.0 (but now 1.59.0) there are quite a few quality of life improvements around OTLP exporter, but most importantly 1.52.0 contains this fix for flushing metrics, which I think is important for our use case.
- For the instrumentation version, I want to use the release that targets the specific SDK version picked, which would be 2.19.0 as recommended here
I am happy to separate this into it's own PR to be merged before all this if it's better!
Co-authored-by: Cursor <cursoragent@cursor.com>
4e8d035 to
9048950
Compare
|
@mamazzol |
In this PR, we introduce OTEL SDK and use it to emit the registered metrics using the OTEL SDK directly.
Summary
APMMeterServiceis agnostic after creation, as it just has a Supplier. The newOTelSdkMeterSupplierwill be responsible for the initialisation and setup. For now this is minimal and mostly default.SystemMetricsclass exists to close the gap between the two metrics set. Eventually, we can consider what we want to support, but for migration purposes we are trying to cover all the JMX metrics emitted by APM Agent.When possible, I used the names from the OTEL Semantic convention so that we'll already be one extra step closer to supporting that.
APM Agent vs OTEL SDK metrics
I collected the APM Agent emitted metrics by using --with-apm-agent but a list is also available here
system.process.cpu.total.norm.pct: system CPU time in non-idle/non-iowait states, normalized by number of cores.system.cpu.total.norm.pct: system CPU time in non-idle/non-iowait states, normalized by number of cores.system.process.memory.size: total virtual memory size of the process.system.memory.actual.free: “actual free” memorysystem.memory.total: total physical memory.jvm.thread.count: current number of live JVM threads (daemon + non-daemon).jvm.fd.used/jvm.fd.max: current / maximum number of open file descriptors.jvm.gc.alloc: approximation of total heap bytes allocated.jvm.gc.count: total number of GC collections that have occurred (by GC name).attribute: name="G1 Concurrent GC"jvm.gc.time: approximate accumulated GC elapsed time in milliseconds (by GC name).attribute: name="G1 Concurrent GC"jvm.memory.heap.used/jvm.memory.heap.committed/jvm.memory.heap.max: used / committed / max heap memory.jvm.memory.non_heap.used/jvm.memory.non_heap.committed/jvm.memory.non_heap.maxjvm.memory.heap.pool.used/jvm.memory.heap.pool.committed/jvm.memory.heap.pool.max: used / committed / max heap memory in a heap pool (by pool name)attribute: name="G1 Survivor Space"jvm.memory.non_heap.pool.used/jvm.memory.non_heap.pool.committed/jvm.memory.non_heap.pool.max: used / committed / max non-heap memory in a heap pool (by pool name)attribute: name="Metaspace" / name="CodeHeap 'profiled nmethods'"The OTEL SDK with the code currently added, emits the following metrics.
SystemMetrics:system.cou.total.norm.pct: The percentage of CPU time in states other than Idle.system.memory.actual.free: Actual free memory in bytes, calculated based on the OS.system.memory.total: Total memory.system.process.memory.size: The total virtual memory the process has.jvm.file_descriptor.count/jvm.file_descriptor.limit: The number of opened / max file descriptors.jvm.gc.count: The total number of collections that have occurred.attribute: name="G1 Concurrent GC"jvm.gc.time: The approximate accumulated collection elapsed time in milliseconds.attribute: name="G1 Concurrent GC"system.process.cgroup.memory.mem.limit.bytes: Memory limit for current cgroup slice. (always 0 when not supported)system.process.cgroup.memory.mem.usage.bytes: Memory usage in current cgroup slice. (always 0 when not supported)jvm.class.count: Number of classes currently loaded.jvm.class.loaded: Number of classes loaded since JVM start.jvm.class.unloaded: Number of classes unloaded since JVM start.jvm.cpu.count: Number of processors available to the Java virtual machine.jvm.cpu.recent_utilization: Recent CPU utilization for the process as reported by the JVM.jvm.cpu.time: CPU time used by the process as reported by the JVM.jvm.gc.alloc: An approximation of the total amount of memory, in bytes, allocated in heap memory.jvm.gc.duration: Duration of JVM garbage collection actions.jvm.memory.committed: Measure of memory committed.jvm.memory.limit: Measure of max obtainable memory.jvm.memory.used: Measure of memory used.jvm.memory.used_after_last_gc: Measure of memory used, as measured after the most recent garbage collection event on this pool.jvm.thread.count: Number of executing platform threads.Mapping between APM Agent and OTEL SDK metrics
jvm.memory.used) and the differentiation between heap/non_heap or with the pools is done with attributes.jvm.thread.countin OTEL is split in two attributes (jvm.thread.daemonandjvm.thread.state).system.cpu.total.norm.pctandsystem.process.cpu.total.norm.pctneed to be evaluated closely to understand ifjvm.system.cpu.utilizationandjvm.cpu.recent.utilizationare good enough replacements or if we need to emit/calculate them differently.ES-14013