Skip to content

Conversation

michaelvanstraten
Copy link

I’ve observed that the current implementation of GetSpan performs a new allocation each time it is called if no active span is found. Since DefaultSpan is immutable (or at least intended to be), it seems reasonable to return a statically allocated invalid span instead.

This approach could reduce unnecessary heap allocations and improve performance. However, it also introduces the possibility of someone inadvertently replacing the default invalid span with a valid one.

I’m not certain whether this change aligns with the intended design or if the allocation was deliberately chosen to provide more isolation for the default span. An alternative could be to initialize an invalid span at the start of the program if the goal is to eliminate the allocation overhead in production. Nonetheless, there might be other implications to consider.

Please consider this PR as a suggestion and an opportunity for discussion.

Copy link

linux-foundation-easycla bot commented Aug 19, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@michaelvanstraten
Copy link
Author

michaelvanstraten commented Aug 19, 2024

By initializing the invalid span at process start time, I mean setting it up like this:

void main() {
    std::shared_ptr<Span> invalid_span =
      std::make_shared<DefaultSpan>(SpanContext::GetInvalid());
    opentelemetry::trace::SetSpan(invalid_span);
    // ...
}

Copy link

codecov bot commented Aug 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.87%. Comparing base (497eaf4) to head (1de2123).
Report is 152 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3037      +/-   ##
==========================================
+ Coverage   87.12%   87.87%   +0.75%     
==========================================
  Files         200      195       -5     
  Lines        6109     6138      +29     
==========================================
+ Hits         5322     5393      +71     
+ Misses        787      745      -42     
Files with missing lines Coverage Δ
api/include/opentelemetry/trace/context.h 92.31% <100.00%> (+0.65%) ⬆️

... and 99 files with indirect coverage changes

@lalitb
Copy link
Member

lalitb commented Aug 19, 2024

Just curious, where are you getting issue in usage of GetSpan?

  • One place it is used internally is during creation of new span in Tracer:StartSpan(), and usually object of type SpanContext (and not Context) is passed as the options.parent argument. This eliminates the call to GetSpan internally.
  • Another place it is used is in the TextMapPropagator::Inject() implementation. The application can maintain a static invalid span, and pass it to Inject method whenever the current context is invalid.

Just couple of thoughts came to my mind as the potential solution/workarounds to avoid any changes in GetSpan.

@michaelvanstraten
Copy link
Author

Just curious, where are you getting issue in usage of GetSpan?

I wouldn't necessarily call it an "issue," but it seems suboptimal for GetSpan to allocate a new span each time a default span is needed. Ideally, it would return a statically allocated span when no active span is found in the current context.

Are there any potential downsides to returning a statically allocated span as the default?

I'm currently working on integrating OpenTelemetry into parts of Gecko (Firefox), and depending on the level of instrumentation, this could potentially impact performance.

@lalitb
Copy link
Member

lalitb commented Aug 26, 2024

Are there any potential downsides to returning a statically allocated span as the default?

I believe not. It should be fine to return static allocation for an invalid span. I just don't understand how can the flow possibly reach there. Feel free to make it ready for review.

@michaelvanstraten michaelvanstraten marked this pull request as ready for review August 27, 2024 14:32
@michaelvanstraten michaelvanstraten requested a review from a team August 27, 2024 14:32
@michaelvanstraten
Copy link
Author

michaelvanstraten commented Aug 27, 2024

I just don't understand how can the flow possibly reach there. Feel free to make it ready for review.

If you don't have a root span in main, then this will happen (since we primarily use contexts to control when tracing is active).

void main() {
    for (int i = 0; i < 100,000; i++) {
        foo(); // Will result in 100,000 allocations
    }

    auto provider = opentelemetry::trace::Provider::GetTracerProvider();
    auto tracer = provider->GetTracer("foo_library", "1.0.0");
    
    auto span = tracer->StartSpan("foo");
    auto scope = tracer->WithActiveSpan(span);
    
    for (int i = 0; i < 100,000; i++) {
        foo(); // Will result in 0 allocations
    }
};

void foo() {
    auto active_span = opentelemetry::trace::GetSpan(
        opentelemetry::context::RuntimeContext::GetCurrent());
    active_span.AddEvent("something in foo");
}

@michaelvanstraten michaelvanstraten marked this pull request as draft August 28, 2024 16:22
@michaelvanstraten michaelvanstraten marked this pull request as ready for review August 31, 2024 21:45
@michaelvanstraten
Copy link
Author

Sorry for the confusion, my college mentioned some potential raise conditions which turned out to be not applicable here.

@marcalff
Copy link
Member

marcalff commented Sep 4, 2024

Sorry for the confusion, my college mentioned some potential raise conditions which turned out to be not applicable here.

Just to clarify, should we continue with this PR, or should it be closed (because of the race found) ?

As a side note, if the main concern is performances, the best course of action could be to avoid reaching this case in the first place (i.e., provide a root span in main), instead of optimizing non nominal use cases.


[Edit 2024-09-27]

As a side note, if the main concern is performances, the best course of action could be to avoid reaching this case in the first place (i.e., provide a root span in main), instead of optimizing non nominal use cases.

It turns out this is a nominal case, when linking an instrumented library in a non instrumented application, where by definition adding a top level span is not an option.

@michaelvanstraten
Copy link
Author

Yes, I've marked the PR as a draft for the time being where the race was unclear, but it does not apply here. So please continue.

The reason for this PR is that for users not familiar with the lib this might not be obvious.

If this change is too invasive, we can close this PR and add a note in the docs.

@marcalff
Copy link
Member

Use case

About the use case itself, i.e., to understand when this code path is involved:

Assume a library libL, instrumented with OpenTelemetry.

This library instruments some spans, so it invokes opentelemetry::trace::GetSpan() as reported.

The library does not create a top level span, because it does not have a main() function.

Now, assume an application that links with libL.

Instrumented application

When an application is instrumented for opentelemetry-cpp, the application created a top level span, in main() or in a major entry point.

During execution, there will be a trace in the runtime context, so the code mentioned in this report will not be executed.

Non instrumented application

When an application is not instrumented (no SDK, no exporters, nothing in the runtime trace context), all the calls from the instrumented library libL will use noop code.

Because there is no current span, the instrumentation code in the library will hit the code path reported.

This kind of deployment:

  • instrumented library (making API calls)
  • non instrumented application (no SDK installed)

is valid, and will happen a lot, so performances in this case is a valid concern.

Code change

What the code change proposes is to trade memory allocation for a shared singleton object.

Note however that this global object is referred to with a nostd::shared_ptr<Span>.

Each thread will use it's own copy of nostd::shared_ptr<Span>, pointing to a global, shared, object.

My concern is that the fix is not as simple as it seem, and it may actually make performance worse.

Behind a shared_ptr, there is a control block object, that keeps the reference count and the object.

This control block object is shared between all instances of shared_ptr, so that for a global span singleton, there will be an associated global control block.

Now, on each creation of a shared_ptr object, the reference count in the control block will be incremented, say with an atomic operation.
For each object destruction, there will be an atomic decrement.

All the atomic increment and decrement, for all threads in the process, will all operate on the same memory, which is the globally shared control block.

This can lead to stalls for all threads, and can affect any part of the application code, in every place where a span is instrumented.

See issue #3010, the number of CPU cores for the machine used is 132.

On a machine with 132 CPU cores, every thread performing atomic operations, from every place in the code base that instruments a span, will make concurrent access on the same, global, unique, span control block.

While I don't have measurements to compare the two implementations (malloc+free per thread, versus atomic increment and decrement on the same global), the concern is that the proposed solution will not scale and causes bottlenecks.

Copy link
Member

@marcalff marcalff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the PR open for discussion.
Blocking the merge due to the scalability concern with many CPU cores.

@michaelvanstraten
Copy link
Author

Okay, I get the concern, make this a thread local should not be a problem.

@michaelvanstraten
Copy link
Author

Another question the, if we use a static here in the method and the method gets inlined, do we have separate statics for each inlining?

I will check compiler explorer maybe.

@michaelvanstraten michaelvanstraten force-pushed the make-get_span-return-static-invalid branch from ebb0bca to 93bfdb0 Compare October 7, 2024 14:47
@michaelvanstraten michaelvanstraten requested a review from a team as a code owner October 7, 2024 14:47
Copy link

netlify bot commented Oct 7, 2024

Deploy Preview for opentelemetry-cpp-api-docs canceled.

Name Link
🔨 Latest commit 1de2123
🔍 Latest deploy log https://app.netlify.com/sites/opentelemetry-cpp-api-docs/deploys/6725483eb01dfc00085b1103

@michaelvanstraten
Copy link
Author

@marcalff This should elevate the concerns with parallelism, since each thread gets its own instance.

Another question the, if we use a static here in the method and the method gets inlined, do we have separate statics for each inlining?

Apparently, the answer is no; the static variable is not monomorphized and remains a single instance shared across all inlinings.

Source: StackOverflow answer

@owent
Copy link
Member

owent commented Oct 8, 2024

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

@michaelvanstraten
Copy link
Author

If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

Can you suggest an alternative?

@lalitb
Copy link
Member

lalitb commented Oct 9, 2024

I wouldn't necessarily call it an "issue," but it seems suboptimal for GetSpan to allocate a new span each time a default span is needed. Ideally, it would return a statically allocated span when no active span is found in the current context.
Are there any potential downsides to returning a statically allocated span as the default?
I'm currently working on integrating OpenTelemetry into parts of Gecko (Firefox), and depending on the level of instrumentation, this could potentially impact performance.

@michaelvanstraten Sorry for bringing this up again. Could you please clarify in which specific hot-path scenario this method is becoming a bottleneck? While I agree that returning a statically allocated default span would be ideal, I believe it's important to avoid prematurely fixing this as the potential bottleneck without more context about the impact. Specifically more, when there are changes at the API code.

@owent
Copy link
Member

owent commented Oct 10, 2024

If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

Can you suggest an alternative?

I have no idea about this problem. In my understanding, copy the instance may be a better solution if the cost of copy is not bottleneck.

@michaelvanstraten
Copy link
Author

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

This should not pose an issue because the object in the thread_local storage is wrapped in a shared_ptr. Even if the DLL is unloaded, as long as there are still references to the object elsewhere (via shared_ptr), the object will not be deallocated.

@michaelvanstraten
Copy link
Author

Could you please clarify in which specific hot-path scenario this method is becoming a bottleneck? While I agree that returning a statically allocated default span would be ideal, I believe it's important to avoid prematurely fixing this as the potential bottleneck without more context about the impact. Specifically more, when there are changes at the API code.

@lalitb I think that @marcalff summarized the issue well in his #3037 (comment).

@owent
Copy link
Member

owent commented Oct 22, 2024

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

This should not pose an issue because the object in the thread_local storage is wrapped in a shared_ptr. Even if the DLL is unloaded, as long as there are still references to the object elsewhere (via shared_ptr), the object will not be deallocated.

The reference counter will not decrease but the memory segments may be unload according to PE ABI on Windows.

@michaelvanstraten
Copy link
Author

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

This should not pose an issue because the object in the thread_local storage is wrapped in a shared_ptr. Even if the DLL is unloaded, as long as there are still references to the object elsewhere (via shared_ptr), the object will not be deallocated.

The reference counter will not decrease but the memory segments may be unload according to PE ABI on Windows.

Do you mean the stack memory containing the thread local containing a pointer?

@owent
Copy link
Member

owent commented Oct 24, 2024

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

This should not pose an issue because the object in the thread_local storage is wrapped in a shared_ptr. Even if the DLL is unloaded, as long as there are still references to the object elsewhere (via shared_ptr), the object will not be deallocated.

The reference counter will not decrease but the memory segments may be unload according to PE ABI on Windows.

Do you mean the stack memory containing the thread local containing a pointer?

It's not related to thread local storage. It's related to the address pointer being under another heap management, which may be destroyed when unloading DLLs.

@michaelvanstraten
Copy link
Author

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

This should not pose an issue because the object in the thread_local storage is wrapped in a shared_ptr. Even if the DLL is unloaded, as long as there are still references to the object elsewhere (via shared_ptr), the object will not be deallocated.

The reference counter will not decrease but the memory segments may be unload according to PE ABI on Windows.

Do you mean the stack memory containing the thread local containing a pointer?

It's not related to thread local storage. It's related to the address pointer being under another heap management, which may be destroyed when unloading DLLs.

Wouldn't that be a general problem, since in the other case we also allocate within the DLL. Could we make this a lazy statci?

@lalitb
Copy link
Member

lalitb commented Nov 1, 2024

@lalitb I think that @marcalff summarized the issue well in his #3037 (comment).

The summarization assumes that GetSpan() is called while creating spans. Which is not entirely correct. Again, @michaelvanstraten please provide the exact scenario where you see performance overhead because of GetSpan() getting called in the hot path. The place I know is while using trace propagators, and user can always create custom propagators to avoid that.

Copy link
Member

@lalitb lalitb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking the PR to get the actual use-case: #3037 (comment)

@owent
Copy link
Member

owent commented Nov 4, 2024

In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by thread_local and share it between threads and modules, and we dynamic load and unload dlls, there may be problems.

This should not pose an issue because the object in the thread_local storage is wrapped in a shared_ptr. Even if the DLL is unloaded, as long as there are still references to the object elsewhere (via shared_ptr), the object will not be deallocated.

The reference counter will not decrease but the memory segments may be unload according to PE ABI on Windows.

Do you mean the stack memory containing the thread local containing a pointer?

It's not related to thread local storage. It's related to the address pointer being under another heap management, which may be destroyed when unloading DLLs.

Wouldn't that be a general problem, since in the other case we also allocate within the DLL. Could we make this a lazy statci?

Most modules free objects in the same modules where it's created. The old implementation copy the memory and won't make this "shared" object be free in another module, but if we use static or thread_local here, it may happens.

@marcalff
Copy link
Member

marcalff commented Nov 5, 2024

@michaelvanstraten , All,

To summarize the state of this discussion:

  • There are some remaining questions, on the use case itself, to clarify whether there is an issue on a main code path (affecting a lot of users), or on a degraded test cases (with less impact then, and possible mitigations). This is mostly the comments from @lalitb
  • There are some concerns about whether the fix, as written, actually improve things, or if there is a risk of performance degradation because of scaling issues when using a shared pointer. This is mostly my own (@marcalff) comments.
  • There are also some concerns on how the patch can be changed to alleviate this, using thread local storage, per thread singletons, all this in a header only include file. This area is extremely sensitive in the API, and there is no proposed solution as of now, just a discussion.

Overall, my personal opinion, shared with also other maintainers, is that it is better to put this issue to rest.

Optimizing the opentelemetry-cpp code for performances is a very valuable goal. There are very likely other areas that can be improved, possibly giving better return in term of performances, and at a much lesser cost in term of development effort or risk, compared to touching GetSpan() in the API, even when it appears to be trivial with 4 lines.

Having this discussion was valuable, but I don't think the outcome can result in a fix that can be merged, therefore this PR will now be closed.

What is direly missing in opentelemetry-cpp is performance benchmark to start with, to help making fact based decisions based on measurements, instead of making assumptions about which part of the code is responsible for overhead.

https://wiki.c2.com/?PrematureOptimization

@marcalff marcalff closed this Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants