Make GetSpan return a statically allocated invalid span #3037

michaelvanstraten · 2024-08-19T07:15:08Z

I’ve observed that the current implementation of GetSpan performs a new allocation each time it is called if no active span is found. Since DefaultSpan is immutable (or at least intended to be), it seems reasonable to return a statically allocated invalid span instead.

This approach could reduce unnecessary heap allocations and improve performance. However, it also introduces the possibility of someone inadvertently replacing the default invalid span with a valid one.

I’m not certain whether this change aligns with the intended design or if the allocation was deliberately chosen to provide more isolation for the default span. An alternative could be to initialize an invalid span at the start of the program if the goal is to eliminate the allocation overhead in production. Nonetheless, there might be other implications to consider.

Please consider this PR as a suggestion and an opportunity for discussion.

linux-foundation-easycla · 2024-08-19T07:15:13Z

The committers listed above are authorized under a signed CLA.

✅ login: michaelvanstraten / name: Michael (61bea77)
✅ login: lalitb / name: Lalit Kumar Bhasin (7ff6178, 1de2123, 93bfdb0)

michaelvanstraten · 2024-08-19T07:27:24Z

By initializing the invalid span at process start time, I mean setting it up like this:

void main() {
    std::shared_ptr<Span> invalid_span =
      std::make_shared<DefaultSpan>(SpanContext::GetInvalid());
    opentelemetry::trace::SetSpan(invalid_span);
    // ...
}

codecov · 2024-08-19T08:42:37Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.87%. Comparing base (497eaf4) to head (1de2123).
Report is 152 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3037      +/-   ##
==========================================
+ Coverage   87.12%   87.87%   +0.75%     
==========================================
  Files         200      195       -5     
  Lines        6109     6138      +29     
==========================================
+ Hits         5322     5393      +71     
+ Misses        787      745      -42

Files with missing lines	Coverage Δ
api/include/opentelemetry/trace/context.h	`92.31% <100.00%> (+0.65%)`	⬆️

... and 99 files with indirect coverage changes

lalitb · 2024-08-19T12:45:58Z

Just curious, where are you getting issue in usage of GetSpan?

One place it is used internally is during creation of new span in Tracer:StartSpan(), and usually object of type SpanContext (and not Context) is passed as the options.parent argument. This eliminates the call to GetSpan internally.
Another place it is used is in the TextMapPropagator::Inject() implementation. The application can maintain a static invalid span, and pass it to Inject method whenever the current context is invalid.

Just couple of thoughts came to my mind as the potential solution/workarounds to avoid any changes in GetSpan.

michaelvanstraten · 2024-08-23T22:13:46Z

Just curious, where are you getting issue in usage of GetSpan?

I wouldn't necessarily call it an "issue," but it seems suboptimal for GetSpan to allocate a new span each time a default span is needed. Ideally, it would return a statically allocated span when no active span is found in the current context.

Are there any potential downsides to returning a statically allocated span as the default?

I'm currently working on integrating OpenTelemetry into parts of Gecko (Firefox), and depending on the level of instrumentation, this could potentially impact performance.

lalitb · 2024-08-26T17:58:03Z

Are there any potential downsides to returning a statically allocated span as the default?

I believe not. It should be fine to return static allocation for an invalid span. I just don't understand how can the flow possibly reach there. Feel free to make it ready for review.

michaelvanstraten · 2024-08-27T14:36:49Z

I just don't understand how can the flow possibly reach there. Feel free to make it ready for review.

If you don't have a root span in main, then this will happen (since we primarily use contexts to control when tracing is active).

void main() {
    for (int i = 0; i < 100,000; i++) {
        foo(); // Will result in 100,000 allocations
    }

    auto provider = opentelemetry::trace::Provider::GetTracerProvider();
    auto tracer = provider->GetTracer("foo_library", "1.0.0");
    
    auto span = tracer->StartSpan("foo");
    auto scope = tracer->WithActiveSpan(span);
    
    for (int i = 0; i < 100,000; i++) {
        foo(); // Will result in 0 allocations
    }
};

void foo() {
    auto active_span = opentelemetry::trace::GetSpan(
        opentelemetry::context::RuntimeContext::GetCurrent());
    active_span.AddEvent("something in foo");
}

michaelvanstraten · 2024-08-31T21:46:55Z

Sorry for the confusion, my college mentioned some potential raise conditions which turned out to be not applicable here.

marcalff · 2024-09-04T20:17:46Z

Sorry for the confusion, my college mentioned some potential raise conditions which turned out to be not applicable here.

Just to clarify, should we continue with this PR, or should it be closed (because of the race found) ?

As a side note, if the main concern is performances, the best course of action could be to avoid reaching this case in the first place (i.e., provide a root span in main), instead of optimizing non nominal use cases.

[Edit 2024-09-27]

As a side note, if the main concern is performances, the best course of action could be to avoid reaching this case in the first place (i.e., provide a root span in main), instead of optimizing non nominal use cases.

It turns out this is a nominal case, when linking an instrumented library in a non instrumented application, where by definition adding a top level span is not an option.

michaelvanstraten · 2024-09-04T20:44:03Z

Yes, I've marked the PR as a draft for the time being where the race was unclear, but it does not apply here. So please continue.

The reason for this PR is that for users not familiar with the lib this might not be obvious.

If this change is too invasive, we can close this PR and add a note in the docs.

marcalff · 2024-09-27T20:27:31Z

Use case

About the use case itself, i.e., to understand when this code path is involved:

Assume a library libL, instrumented with OpenTelemetry.

This library instruments some spans, so it invokes opentelemetry::trace::GetSpan() as reported.

The library does not create a top level span, because it does not have a main() function.

Now, assume an application that links with libL.

Instrumented application

When an application is instrumented for opentelemetry-cpp, the application created a top level span, in main() or in a major entry point.

During execution, there will be a trace in the runtime context, so the code mentioned in this report will not be executed.

Non instrumented application

When an application is not instrumented (no SDK, no exporters, nothing in the runtime trace context), all the calls from the instrumented library libL will use noop code.

Because there is no current span, the instrumentation code in the library will hit the code path reported.

This kind of deployment:

instrumented library (making API calls)
non instrumented application (no SDK installed)

is valid, and will happen a lot, so performances in this case is a valid concern.

Code change

What the code change proposes is to trade memory allocation for a shared singleton object.

Note however that this global object is referred to with a nostd::shared_ptr<Span>.

Each thread will use it's own copy of nostd::shared_ptr<Span>, pointing to a global, shared, object.

My concern is that the fix is not as simple as it seem, and it may actually make performance worse.

Behind a shared_ptr, there is a control block object, that keeps the reference count and the object.

This control block object is shared between all instances of shared_ptr, so that for a global span singleton, there will be an associated global control block.

Now, on each creation of a shared_ptr object, the reference count in the control block will be incremented, say with an atomic operation.
For each object destruction, there will be an atomic decrement.

All the atomic increment and decrement, for all threads in the process, will all operate on the same memory, which is the globally shared control block.

This can lead to stalls for all threads, and can affect any part of the application code, in every place where a span is instrumented.

See issue #3010, the number of CPU cores for the machine used is 132.

On a machine with 132 CPU cores, every thread performing atomic operations, from every place in the code base that instruments a span, will make concurrent access on the same, global, unique, span control block.

While I don't have measurements to compare the two implementations (malloc+free per thread, versus atomic increment and decrement on the same global), the concern is that the proposed solution will not scale and causes bottlenecks.

marcalff

Keeping the PR open for discussion.
Blocking the merge due to the scalability concern with many CPU cores.

michaelvanstraten · 2024-09-27T22:29:25Z

Okay, I get the concern, make this a thread local should not be a problem.

michaelvanstraten · 2024-09-27T22:31:31Z

Another question the, if we use a static here in the method and the method gets inlined, do we have separate statics for each inlining?

I will check compiler explorer maybe.

netlify · 2024-10-07T14:47:48Z

✅ Deploy Preview for opentelemetry-cpp-api-docs canceled.

Name	Link
🔨 Latest commit	`1de2123`
🔍 Latest deploy log	https://app.netlify.com/sites/opentelemetry-cpp-api-docs/deploys/6725483eb01dfc00085b1103

michaelvanstraten · 2024-10-07T14:53:23Z

@marcalff This should elevate the concerns with parallelism, since each thread gets its own instance.

Another question the, if we use a static here in the method and the method gets inlined, do we have separate statics for each inlining?

Apparently, the answer is no; the static variable is not monomorphized and remains a single instance shared across all inlinings.

Source: StackOverflow answer

owent · 2024-10-08T02:42:35Z

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

michaelvanstraten · 2024-10-09T08:40:18Z

If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

Can you suggest an alternative?

lalitb · 2024-10-09T16:41:13Z

I wouldn't necessarily call it an "issue," but it seems suboptimal for GetSpan to allocate a new span each time a default span is needed. Ideally, it would return a statically allocated span when no active span is found in the current context.
Are there any potential downsides to returning a statically allocated span as the default?
I'm currently working on integrating OpenTelemetry into parts of Gecko (Firefox), and depending on the level of instrumentation, this could potentially impact performance.

@michaelvanstraten Sorry for bringing this up again. Could you please clarify in which specific hot-path scenario this method is becoming a bottleneck? While I agree that returning a statically allocated default span would be ideal, I believe it's important to avoid prematurely fixing this as the potential bottleneck without more context about the impact. Specifically more, when there are changes at the API code.

owent · 2024-10-10T13:32:24Z

If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

Can you suggest an alternative?

I have no idea about this problem. In my understanding, copy the instance may be a better solution if the cost of copy is not bottleneck.

michaelvanstraten · 2024-10-18T10:14:44Z

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

This should not pose an issue because the object in the thread_local storage is wrapped in a shared_ptr. Even if the DLL is unloaded, as long as there are still references to the object elsewhere (via shared_ptr), the object will not be deallocated.

michaelvanstraten · 2024-10-18T10:16:17Z

Could you please clarify in which specific hot-path scenario this method is becoming a bottleneck? While I agree that returning a statically allocated default span would be ideal, I believe it's important to avoid prematurely fixing this as the potential bottleneck without more context about the impact. Specifically more, when there are changes at the API code.

@lalitb I think that @marcalff summarized the issue well in his #3037 (comment).

owent · 2024-10-22T06:49:51Z

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

This should not pose an issue because the object in the thread_local storage is wrapped in a shared_ptr. Even if the DLL is unloaded, as long as there are still references to the object elsewhere (via shared_ptr), the object will not be deallocated.

The reference counter will not decrease but the memory segments may be unload according to PE ABI on Windows.

michaelvanstraten · 2024-10-22T14:24:59Z

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

This should not pose an issue because the object in the thread_local storage is wrapped in a shared_ptr. Even if the DLL is unloaded, as long as there are still references to the object elsewhere (via shared_ptr), the object will not be deallocated.

The reference counter will not decrease but the memory segments may be unload according to PE ABI on Windows.

Do you mean the stack memory containing the thread local containing a pointer?

owent · 2024-10-24T07:24:51Z

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

This should not pose an issue because the object in the thread_local storage is wrapped in a shared_ptr. Even if the DLL is unloaded, as long as there are still references to the object elsewhere (via shared_ptr), the object will not be deallocated.

The reference counter will not decrease but the memory segments may be unload according to PE ABI on Windows.

Do you mean the stack memory containing the thread local containing a pointer?

It's not related to thread local storage. It's related to the address pointer being under another heap management, which may be destroyed when unloading DLLs.

michaelvanstraten · 2024-11-01T19:19:43Z

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

This should not pose an issue because the object in the thread_local storage is wrapped in a shared_ptr. Even if the DLL is unloaded, as long as there are still references to the object elsewhere (via shared_ptr), the object will not be deallocated.

The reference counter will not decrease but the memory segments may be unload according to PE ABI on Windows.

Do you mean the stack memory containing the thread local containing a pointer?

It's not related to thread local storage. It's related to the address pointer being under another heap management, which may be destroyed when unloading DLLs.

Wouldn't that be a general problem, since in the other case we also allocate within the DLL. Could we make this a lazy statci?

lalitb · 2024-11-01T20:00:25Z

@lalitb I think that @marcalff summarized the issue well in his #3037 (comment).

The summarization assumes that GetSpan() is called while creating spans. Which is not entirely correct. Again, @michaelvanstraten please provide the exact scenario where you see performance overhead because of GetSpan() getting called in the hot path. The place I know is while using trace propagators, and user can always create custom propagators to avoid that.

lalitb

Blocking the PR to get the actual use-case: #3037 (comment)

owent · 2024-11-04T16:13:59Z

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

This should not pose an issue because the object in the thread_local storage is wrapped in a shared_ptr. Even if the DLL is unloaded, as long as there are still references to the object elsewhere (via shared_ptr), the object will not be deallocated.

The reference counter will not decrease but the memory segments may be unload according to PE ABI on Windows.

Do you mean the stack memory containing the thread local containing a pointer?

It's not related to thread local storage. It's related to the address pointer being under another heap management, which may be destroyed when unloading DLLs.

Wouldn't that be a general problem, since in the other case we also allocate within the DLL. Could we make this a lazy statci?

Most modules free objects in the same modules where it's created. The old implementation copy the memory and won't make this "shared" object be free in another module, but if we use static or thread_local here, it may happens.

marcalff · 2024-11-05T23:28:18Z

@michaelvanstraten , All,

To summarize the state of this discussion:

There are some remaining questions, on the use case itself, to clarify whether there is an issue on a main code path (affecting a lot of users), or on a degraded test cases (with less impact then, and possible mitigations). This is mostly the comments from @lalitb
There are some concerns about whether the fix, as written, actually improve things, or if there is a risk of performance degradation because of scaling issues when using a shared pointer. This is mostly my own (@marcalff) comments.
There are also some concerns on how the patch can be changed to alleviate this, using thread local storage, per thread singletons, all this in a header only include file. This area is extremely sensitive in the API, and there is no proposed solution as of now, just a discussion.

Overall, my personal opinion, shared with also other maintainers, is that it is better to put this issue to rest.

Optimizing the opentelemetry-cpp code for performances is a very valuable goal. There are very likely other areas that can be improved, possibly giving better return in term of performances, and at a much lesser cost in term of development effort or risk, compared to touching GetSpan() in the API, even when it appears to be trivial with 4 lines.

Having this discussion was valuable, but I don't think the outcome can result in a fix that can be merged, therefore this PR will now be closed.

What is direly missing in opentelemetry-cpp is performance benchmark to start with, to help making fact based decisions based on measurements, instead of making assumptions about which part of the code is responsible for overhead.

https://wiki.c2.com/?PrematureOptimization

michaelvanstraten marked this pull request as ready for review August 27, 2024 14:32

michaelvanstraten requested a review from a team August 27, 2024 14:32

michaelvanstraten marked this pull request as draft August 28, 2024 16:22

michaelvanstraten marked this pull request as ready for review August 31, 2024 21:45

marcalff requested changes Sep 27, 2024

View reviewed changes

michaelvanstraten and others added 2 commits October 7, 2024 16:47

Make GetSpan return a statically allocated invalid span

61bea77

Merge branch 'main' into make-get_span-return-static-invalid

93bfdb0

michaelvanstraten force-pushed the make-get_span-return-static-invalid branch from ebb0bca to 93bfdb0 Compare October 7, 2024 14:47

michaelvanstraten requested a review from a team as a code owner October 7, 2024 14:47

michaelvanstraten requested a review from marcalff October 7, 2024 14:54

Merge branch 'main' into make-get_span-return-static-invalid

7ff6178

lalitb requested changes Nov 1, 2024

View reviewed changes

Merge branch 'main' into make-get_span-return-static-invalid

1de2123

marcalff closed this Nov 5, 2024

Make GetSpan return a statically allocated invalid span #3037

Make GetSpan return a statically allocated invalid span #3037

Uh oh!

Conversation

michaelvanstraten commented Aug 19, 2024

Uh oh!

linux-foundation-easycla bot commented Aug 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelvanstraten commented Aug 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lalitb commented Aug 19, 2024

Uh oh!

michaelvanstraten commented Aug 23, 2024

Uh oh!

lalitb commented Aug 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelvanstraten commented Aug 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelvanstraten commented Aug 31, 2024

Uh oh!

marcalff commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelvanstraten commented Sep 4, 2024

Uh oh!

marcalff commented Sep 27, 2024

Use case

Instrumented application

Non instrumented application

Code change

Uh oh!

marcalff left a comment

Choose a reason for hiding this comment

Uh oh!

michaelvanstraten commented Sep 27, 2024

Uh oh!

michaelvanstraten commented Sep 27, 2024

Uh oh!

netlify bot commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for opentelemetry-cpp-api-docs canceled.

Uh oh!

michaelvanstraten commented Oct 7, 2024

Uh oh!

owent commented Oct 8, 2024

Uh oh!

michaelvanstraten commented Oct 9, 2024

Uh oh!

lalitb commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

owent commented Oct 10, 2024

Uh oh!

michaelvanstraten commented Oct 18, 2024

Uh oh!

michaelvanstraten commented Oct 18, 2024

Uh oh!

owent commented Oct 22, 2024

Uh oh!

michaelvanstraten commented Oct 22, 2024

Uh oh!

owent commented Oct 24, 2024

Uh oh!

michaelvanstraten commented Nov 1, 2024

Uh oh!

lalitb commented Nov 1, 2024

Uh oh!

lalitb left a comment

Choose a reason for hiding this comment

Uh oh!

owent commented Nov 4, 2024

Uh oh!

linux-foundation-easycla bot commented Aug 19, 2024 •

edited

Loading

michaelvanstraten commented Aug 19, 2024 •

edited

Loading

codecov bot commented Aug 19, 2024 •

edited

Loading

lalitb commented Aug 26, 2024 •

edited

Loading

michaelvanstraten commented Aug 27, 2024 •

edited

Loading

marcalff commented Sep 4, 2024 •

edited

Loading

netlify bot commented Oct 7, 2024 •

edited

Loading

lalitb commented Oct 9, 2024 •

edited

Loading

marcalff commented Nov 5, 2024 •

edited

Loading