Skip to content

Conversation

@jhayes2-chwy
Copy link
Contributor

As outlined in #13407 and #14047, the thread.name attribute can create very high cardinality in many cases, and also contributes to a memory leak in the collection mechanism of those metrics. This PR is aimed at fixing both issues.

The affected metrics are:

  • jvm.memory.allocation
  • jvm.cpu.longlock
  • jvm.network.io
  • jvm.network.time

@jhayes2-chwy jhayes2-chwy requested a review from a team as a code owner June 18, 2025 15:29
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jun 18, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

Comment on lines 18 to 26
// Use an access-ordered LinkedHashMap so we get a bounded LRU cache
private final Map<String, Consumer<RecordedEvent>> perThread =
new LinkedHashMap<String, Consumer<RecordedEvent>>(16, 0.75F, true) {
@Override
protected boolean removeEldestEntry(Map.Entry<String, Consumer<RecordedEvent>> eldest) {
// Bound this map to prevent memory leaks with fast-cycling thread frameworks
return size() > 512;
}
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this map needed now that the thread name isn't being on the metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly not for correctness reasons. I didn't see any explicit documentation around this map, but after reading through the code my impression was this was mostly a performance optimization to reduce allocations of Consumer<RecordedEvent> instances.

Since this was previously using an unsynchronized hashmap, it does appear to me that the invocation of these consumers is all single-threaded (I haven't worked directly with JFR before, so maybe that's not true?); it smells to me like there's no contention or throughput reasons to have this cache other than to reduce allocations.

Copy link
Contributor Author

@jhayes2-chwy jhayes2-chwy Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that jives with your understanding, I could simply remove the cache. While all the little allocations of the PerThread*Handler inner classes probably aren't a problem for the use-cases I'm coming from (high-scale ecommerce), I imagine there are definitely existing OTEL use-cases where it would be, especially on older and/or smaller JVMs.

With that in mind, I'd actually prefer to jump all the way to inlining the PerThread*Handler-based logic directly into the AbstractThreadDispatchingHandler-subclasses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that in mind, I'd actually prefer to jump all the way to inlining the PerThread*Handler-based logic directly into the AbstractThreadDispatchingHandler-subclasses.

I've gone ahead and done this, should be easy to revert if needed.

// FIXME only handles substrings of contiguous digits -> a single `x`, but should be good
// enough for now
@Nullable
public String groupedName(RecordedEvent ev) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used anywhere now that AbstractThreadDispatchingHandler was deleted?

Copy link
Contributor Author

@jhayes2-chwy jhayes2-chwy Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, addressed.

@laurit
Copy link
Contributor

laurit commented Jun 19, 2025

@jhayes2-chwy you need to sign the CLA in order to get the PR merged

@jhayes2-chwy
Copy link
Contributor Author

@jhayes2-chwy you need to sign the CLA in order to get the PR merged

Indeed; I've been working with my company to determine if we have a Corporate CLA, so I'm still waiting on that.

public Consumer<RecordedEvent> createPerThreadSummarizer(String threadName) {
return new PerThreadLongLockHandler(histogram, threadName);
public void accept(RecordedEvent recordedEvent) {
if (recordedEvent.hasField(EVENT_THREAD)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check needed? I think that previously this check was always true because ThreadGrouper would have returned null if the thread was missing. I'd remove this check unless anybody sees a reason why to keep it.

Copy link
Contributor Author

@jhayes2-chwy jhayes2-chwy Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it may have been needed in order to add an attribute for the monitor's className, but I'd suspect that would have the same high-cardinality concerns as threadName, presuming it was used on Metrics.

Though I can personally see where that could provide a lot of value without cardinality issues, e.g. as an attribute on a Span Event or Profile. But unless there's active plans to build out that logic super soon, IMHO it'd be a premature optimization to keep the check in, at least for this specific reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have gone ahead and removed it.

@otelbot-java-instrumentation
Copy link
Contributor

🔧 The result from spotlessApply was committed to the PR branch.

@jhayes2-chwy
Copy link
Contributor Author

jhayes2-chwy commented Jun 20, 2025

Is there a place I can see when the next release is and what'll be in it? This fix is something we're banking on having relatively soon (within roughly a calendar month or so). If the next full stable release is quite a ways away, I think we'd be okay even with an RC or pre-release to get us through the next few months.

@trask
Copy link
Member

trask commented Jun 20, 2025

this repo releases monthly: https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/RELEASING.md#release-cadence

there are snapshots published with every merge to main if that helps: https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/CONTRIBUTING.md#snapshot-builds

@laurit
Copy link
Contributor

laurit commented Jun 20, 2025

Is there a place I can see when the next release is and what'll be in it? This fix is something we're banking on having relatively soon (within roughly a calendar month or so). If the next full stable release is quite a ways away, I think we'd be okay even with an RC or pre-release to get us through the next few months.

Since these metrics are not enabled in the default configuration you could consider not enabling them as a workaround.

@jhayes2-chwy
Copy link
Contributor Author

Looks like the latest deps test for Spring Cloud Gateway is failing because there's a change to property keys and nullibility:

Caused by: org.springframework.boot.context.properties.bind.validation.BindValidationException: Binding validation errors on spring.cloud.gateway.server.webflux
       - Field error in object 'spring.cloud.gateway.server.webflux' on field 'routes[0].uri': rejected value [null]; codes [NotNull.spring.cloud.gateway.server.webflux.routes[0].uri,NotNull.spring.cloud.gateway.server.webflux.routes.uri,NotNull.routes[0].uri,NotNull.routes.uri,NotNull.uri,NotNull.java.net.URI,NotNull]; arguments [org.springframework.context.support.DefaultMessageSourceResolvable: codes [spring.cloud.gateway.server.webflux.routes[0].uri,routes[0].uri]; arguments []; default message [routes[0].uri]]; default message [must not be null]

This shouldn't have been from my changes, and neither are the failing markdown linter checks. Is there any other step I'm missing to get approval?

@trask
Copy link
Member

trask commented Jun 23, 2025

Looks like the latest deps test for Spring Cloud Gateway is failing because there's a change to property keys and nullibility:

Caused by: org.springframework.boot.context.properties.bind.validation.BindValidationException: Binding validation errors on spring.cloud.gateway.server.webflux
       - Field error in object 'spring.cloud.gateway.server.webflux' on field 'routes[0].uri': rejected value [null]; codes [NotNull.spring.cloud.gateway.server.webflux.routes[0].uri,NotNull.spring.cloud.gateway.server.webflux.routes.uri,NotNull.routes[0].uri,NotNull.routes.uri,NotNull.uri,NotNull.java.net.URI,NotNull]; arguments [org.springframework.context.support.DefaultMessageSourceResolvable: codes [spring.cloud.gateway.server.webflux.routes[0].uri,routes[0].uri]; arguments []; default message [routes[0].uri]]; default message [must not be null]

This shouldn't have been from my changes, and neither are the failing markdown linter checks. Is there any other step I'm missing to get approval?

ya, you can ignore these failures, they won't prevent review or merging

@jhayes2-chwy
Copy link
Contributor Author

jhayes2-chwy commented Jun 27, 2025

@trask @laurit have I missed a step in getting approvals or any other "merge-readiness" tasks? I just want to confirm if the ball is in my court or yours (or I guess the codeowners in general); this is my first contribution to a significant OSS project.

@jhayes2-chwy
Copy link
Contributor Author

Addressed the comments for the lingering empty-args super constructor calls. Also did a full run through on the current diff, I didn't see any other remaining cleanliness issues leftover from the refactorings.

@trask trask merged commit 8bcdeff into open-telemetry:main Jun 30, 2025
89 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants