-
Notifications
You must be signed in to change notification settings - Fork 501
Make GetSpan return a statically allocated invalid span #3037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make GetSpan return a statically allocated invalid span #3037
Conversation
By initializing the invalid span at process start time, I mean setting it up like this: void main() {
std::shared_ptr<Span> invalid_span =
std::make_shared<DefaultSpan>(SpanContext::GetInvalid());
opentelemetry::trace::SetSpan(invalid_span);
// ...
} |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3037 +/- ##
==========================================
+ Coverage 87.12% 87.87% +0.75%
==========================================
Files 200 195 -5
Lines 6109 6138 +29
==========================================
+ Hits 5322 5393 +71
+ Misses 787 745 -42
|
Just curious, where are you getting issue in usage of
Just couple of thoughts came to my mind as the potential solution/workarounds to avoid any changes in |
I wouldn't necessarily call it an "issue," but it seems suboptimal for GetSpan to allocate a new span each time a default span is needed. Ideally, it would return a statically allocated span when no active span is found in the current context. Are there any potential downsides to returning a statically allocated span as the default? I'm currently working on integrating OpenTelemetry into parts of Gecko (Firefox), and depending on the level of instrumentation, this could potentially impact performance. |
I believe not. It should be fine to return static allocation for an invalid span. I just don't understand how can the flow possibly reach there. Feel free to make it ready for review. |
If you don't have a root span in main, then this will happen (since we primarily use contexts to control when tracing is active). void main() {
for (int i = 0; i < 100,000; i++) {
foo(); // Will result in 100,000 allocations
}
auto provider = opentelemetry::trace::Provider::GetTracerProvider();
auto tracer = provider->GetTracer("foo_library", "1.0.0");
auto span = tracer->StartSpan("foo");
auto scope = tracer->WithActiveSpan(span);
for (int i = 0; i < 100,000; i++) {
foo(); // Will result in 0 allocations
}
};
void foo() {
auto active_span = opentelemetry::trace::GetSpan(
opentelemetry::context::RuntimeContext::GetCurrent());
active_span.AddEvent("something in foo");
} |
Sorry for the confusion, my college mentioned some potential raise conditions which turned out to be not applicable here. |
Just to clarify, should we continue with this PR, or should it be closed (because of the race found) ? As a side note, if the main concern is performances, the best course of action could be to avoid reaching this case in the first place (i.e., provide a root span in main), instead of optimizing non nominal use cases. [Edit 2024-09-27]
It turns out this is a nominal case, when linking an instrumented library in a non instrumented application, where by definition adding a top level span is not an option. |
Yes, I've marked the PR as a draft for the time being where the race was unclear, but it does not apply here. So please continue. The reason for this PR is that for users not familiar with the lib this might not be obvious. If this change is too invasive, we can close this PR and add a note in the docs. |
Use caseAbout the use case itself, i.e., to understand when this code path is involved: Assume a library libL, instrumented with OpenTelemetry. This library instruments some spans, so it invokes The library does not create a top level span, because it does not have a main() function. Now, assume an application that links with libL. Instrumented applicationWhen an application is instrumented for opentelemetry-cpp, the application created a top level span, in main() or in a major entry point. During execution, there will be a trace in the runtime context, so the code mentioned in this report will not be executed. Non instrumented applicationWhen an application is not instrumented (no SDK, no exporters, nothing in the runtime trace context), all the calls from the instrumented library libL will use noop code. Because there is no current span, the instrumentation code in the library will hit the code path reported. This kind of deployment:
is valid, and will happen a lot, so performances in this case is a valid concern. Code changeWhat the code change proposes is to trade memory allocation for a shared singleton object. Note however that this global object is referred to with a Each thread will use it's own copy of My concern is that the fix is not as simple as it seem, and it may actually make performance worse. Behind a This control block object is shared between all instances of shared_ptr, so that for a global span singleton, there will be an associated global control block. Now, on each creation of a shared_ptr object, the reference count in the control block will be incremented, say with an atomic operation. All the atomic increment and decrement, for all threads in the process, will all operate on the same memory, which is the globally shared control block. This can lead to stalls for all threads, and can affect any part of the application code, in every place where a span is instrumented. See issue #3010, the number of CPU cores for the machine used is 132. On a machine with 132 CPU cores, every thread performing atomic operations, from every place in the code base that instruments a span, will make concurrent access on the same, global, unique, span control block. While I don't have measurements to compare the two implementations (malloc+free per thread, versus atomic increment and decrement on the same global), the concern is that the proposed solution will not scale and causes bottlenecks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping the PR open for discussion.
Blocking the merge due to the scalability concern with many CPU cores.
Okay, I get the concern, make this a thread local should not be a problem. |
Another question the, if we use a static here in the method and the method gets inlined, do we have separate statics for each inlining? I will check compiler explorer maybe. |
ebb0bca
to
93bfdb0
Compare
✅ Deploy Preview for opentelemetry-cpp-api-docs canceled.
|
@marcalff This should elevate the concerns with parallelism, since each thread gets its own instance.
Apparently, the answer is no; the static variable is not monomorphized and remains a single instance shared across all inlinings. Source: StackOverflow answer |
In my understanding, if we allocate a object in a dll on Windows, we should also destroy it when the dll is unloaded. If we create this object by |
Can you suggest an alternative? |
@michaelvanstraten Sorry for bringing this up again. Could you please clarify in which specific hot-path scenario this method is becoming a bottleneck? While I agree that returning a statically allocated default span would be ideal, I believe it's important to avoid prematurely fixing this as the potential bottleneck without more context about the impact. Specifically more, when there are changes at the API code. |
I have no idea about this problem. In my understanding, copy the instance may be a better solution if the cost of copy is not bottleneck. |
This should not pose an issue because the object in the |
@lalitb I think that @marcalff summarized the issue well in his #3037 (comment). |
The reference counter will not decrease but the memory segments may be unload according to PE ABI on Windows. |
Do you mean the stack memory containing the thread local containing a pointer? |
It's not related to thread local storage. It's related to the address pointer being under another heap management, which may be destroyed when unloading DLLs. |
Wouldn't that be a general problem, since in the other case we also allocate within the DLL. Could we make this a lazy statci? |
The summarization assumes that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking the PR to get the actual use-case: #3037 (comment)
Most modules free objects in the same modules where it's created. The old implementation copy the memory and won't make this "shared" object be free in another module, but if we use static or thread_local here, it may happens. |
@michaelvanstraten , All, To summarize the state of this discussion:
Overall, my personal opinion, shared with also other maintainers, is that it is better to put this issue to rest. Optimizing the opentelemetry-cpp code for performances is a very valuable goal. There are very likely other areas that can be improved, possibly giving better return in term of performances, and at a much lesser cost in term of development effort or risk, compared to touching GetSpan() in the API, even when it appears to be trivial with 4 lines. Having this discussion was valuable, but I don't think the outcome can result in a fix that can be merged, therefore this PR will now be closed. What is direly missing in opentelemetry-cpp is performance benchmark to start with, to help making fact based decisions based on measurements, instead of making assumptions about which part of the code is responsible for overhead. |
I’ve observed that the current implementation of
GetSpan
performs a new allocation each time it is called if no active span is found. SinceDefaultSpan
is immutable (or at least intended to be), it seems reasonable to return a statically allocated invalid span instead.This approach could reduce unnecessary heap allocations and improve performance. However, it also introduces the possibility of someone inadvertently replacing the default invalid span with a valid one.
I’m not certain whether this change aligns with the intended design or if the allocation was deliberately chosen to provide more isolation for the default span. An alternative could be to initialize an invalid span at the start of the program if the goal is to eliminate the allocation overhead in production. Nonetheless, there might be other implications to consider.
Please consider this PR as a suggestion and an opportunity for discussion.