Skip to content

feat: OpenTelemetry context activation POC #202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 12 commits into
base: v0.1.x
Choose a base branch
from

Conversation

bantonsson
Copy link

TL;DR

This PR is opened as a basis for discussion around providing a more seamless interoperability for users of the OpenTelemetry APIs when using frameworks and libraries that use tracing at their core.

An OpenTelemetry Context that mirrors the active tracing Span is now active on the executing thread while in a tracing Span. Having proper OpenTelemetry context activation will not only make interoperability easier for users of the OpenTelemetry API, but will also open up the possibility for simpler log/trace/baggage correlation and in the future profiling/trace correlation among other things.

Motivation

The aim is to provide users of the OpenTelemetry API a more seamless and consistent experience. One such thing is that calling the OpenTelemetry API Context::current() did not give you a Context that contained an OpenTelemetry Span that represented the current tracing Span. Another thing is propagation of Baggage or user defined types that were not picked up either.

A lot of effort has been put into making the existing OpenTelemetrySpanExt API work the same way to maintain maximum backwards compatibility and to avoid fragmentation by building a separate OpenTelemetry and Tokio Tracing bridge.

Long discussion about OpenTelemetry and Tokio Tracing interoperability, and an issue specifically discussing this POC.

Solution

This solution adds a new feature activate_context that will activate/deactivate an OpenTelemetry Context that mirrors the current tracing Span. This feature is currently on by default, but can be turned off to avoid any performance overhead incurred by the context activation and bookkeeping.

The second big change is that the OpenTelemetry SpanBuilder is lazily consumed and a real OpenTelemetry Span is created when necessary, so there is no longer any need for the PreSampledTracer and the special code in OpenTelemetry that it relies on. This has some implications, like you can no longer set the parent_context after you have entered a Span or called the context method on the Span.

Benchmark numbers are available in this comment, and when the activate_context feature is turned off, the performance stays the same. (There has been one more optimization since the last run there, fixing the regression in the many_events benchmarks)

@djc
Copy link
Collaborator

djc commented May 6, 2025

I have been passively maintaining tracing-opentelemetry (mostly by reviewing relatively small PRs) for a few months, but I don't currently have a use case for this crate. Unless some funding for doing this work materializes, I'm not motivated to review larger efforts like this, so I suggest the OpenTelemetry and/or tracing organizations find some other people to move this effort forward.

@bantonsson bantonsson force-pushed the ban/tracing-otel-context-updated branch from ce5421a to 51e471b Compare May 7, 2025 10:56
Copy link
Collaborator

@mladedav mladedav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a closer look at everything (including the previous discussions we've had in the other repo) later, but for now, I think the most useful thing would be to get rid of the feature and possibly other unrelated changes (maybe the patched dependencies?) to make it easier to understand which parts here are left because they are needed and which parts are here just for compatibility with the old way.

Generally, I think this would be a step in the right direction, though I'd still want to explore the possibility of providing parents at span creation time and hopefully starting the context then, which would be closer to what opentelemetry specifies.

src/tracer.rs Outdated
Comment on lines 66 to 77
impl PreSampledTracer for SdkTracer {
fn sampled_context(&self, data: &mut crate::OtelData) -> OtelContext {
let parent_cx = &data.parent_cx;
let builder = &mut data.builder;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no longer any need for the PreSampledTracer and the special code in OpenTelemetry that it relies on.

If it's not needed anymore, are these changes here because you want to support this as well as the old behavior through the feature?

I think we should really get rid of the feature. For the purpose of this PoC it just adds unnecessary changes and it's not clear whether for example this is or is not needed.

Even outside of the PoC I don't think we'd want the feature, because it adds maintenance burden with the justification being avoiding performance, but later you wrote there's no performance difference between the feature being on and off, so I'd say the feature might just cause confusion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With regards to the performance, what I meant to say with

Benchmark numbers are available in this comment, and when the activate_context feature is turned off, the performance stays the same. (There has been one more optimization since the last run there, fixing the regression in the many_events benchmarks)

Is that when the activate_context feature is turned off, the performance is the same as before the POC.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not removed the PreSampledTracer yet, but it is not needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does sampling work when the feature is turned off? That would have to use the same mechanism as before wouldn't it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So regardless of the activate_context feature, the OTel SpanBuilder is consumed when necessary and the created OTel Span is inserted into an OTel Context that is stored with the tracing Span.

src/layer.rs Outdated
} else {
data.builder.links = Some(vec![follows_link]);
data.parent_cx.span().add_link(follows_context, vec![]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is adding the link to the parent context instead of the current context?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the names are confusing. This field is the parent_context while the SpanBuilder is being used, and then it becomes the current_context. The name should be changed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just change the struct into an enum so that we can also get rid of the options and make it clear when are things expected to be present and what they are. Now there's for example also an end time inside the builder, but also one in the struct.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The end_time in the builder API should be removed open-telemetry/opentelemetry-rust#2746

src/span_ext.rs Outdated
/// A span's parent should only be set _once_, for the purpose described above.
/// Additionally, once a span has been fully built - and the SpanBuilder has been consumed -
/// the parent _cannot_ be mutated.
///
fn set_parent(&self, cx: Context) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should return a Result to signal whether this actually did something or not. I get that it's technically a breaking change and you tried to keep the api identical, but unless someone actually did let () = span.set_parent(...); which I highly doubt, or someone implemented this trait for their own type (which I doubt and they probably should not have done so either), users will get at most a warning about an unused Result.

Other methods that behave differently based on whether the context was already started or not should do the same thing, but I don't think anything else has this kind of behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's a valid point.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While looking at this I realized that not only this one but any of the setters in the OpenTelemetrySpanExt can fail silently if there is no OtelData associated with the Span. Adding a Result here will change a lot of code, and I'm not sure if there is any benefit to this since there is nothing the end user can do about it (except maybe log something).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The end user can at least unwrap on it in their prototype phase. The alternative being set all of tracing and propagation up, export their traces and then manually check that all their traces have correct parents.

Ignoring the Result is easy enough for users that I think that giving them the option to at least log the issue (and the parent context they tried to use) even if they can't handle the failure properly is worth it. And I do believe it's too easy to misuse this API that I'd rather have everyone explicitly ignore the Results, because I believe that this will happen.

I guess we can ignore the other methods for now then. However, I think that here it would be worth it to cover all cases because if setting the parent fails for any reason, the whole trace will be broken as even the trace ID will not be propagated anywhere else and every child of this broken span will be a root span of its own new trace. I get there probably isn't much the code can do if that happens in production, but I still believe we should not just silently ignore it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated set_parent to return a result.

@bantonsson
Copy link
Author

Thanks for the comments @mladedav. I think that getting rid of the feature is a bit risky, since these changes do have performance implications, and it would be good to allow current users to continue.

I'll rebase the PR and try to clean it up a bit.

bantonsson and others added 7 commits June 17, 2025 17:28
Right now this is completely focused on the Spans but there could be a
feature that only propagates the context as well.
Ensure that the we materialize the span and activate the context when accessing the span. This ensures the correct parent child relationships.
@bantonsson bantonsson force-pushed the ban/tracing-otel-context-updated branch from 51e471b to f5c70c3 Compare June 17, 2025 16:31
@bantonsson bantonsson force-pushed the ban/tracing-otel-context-updated branch from 68d210f to 133ab73 Compare June 24, 2025 11:49
@bantonsson
Copy link
Author

I've updated the PR with the proposed changes except returning the Result from set_parent, since I think it needs more discussion.

@mladedav
Copy link
Collaborator

Regarding the feature, would it make sense to have it instead as a config on the layer? Features are notoriously problematic for changing behavior due to feature unification, especially turning off default features since crates tend not to opt out of them. I wouldn't expect this to be a transitive dependency though so it might be fine, but I still think normal configuration would suit it more than a feature.

@bantonsson
Copy link
Author

@mladedav I've change the code from a feature to a configuration.

@cijothomas
Copy link

@mladedav Would you have bandwidth to continue reviewing this PR? We have delayed cleaning up OTel Tracing API as it'll break tracing-opentelemetry bridge, without this PR merged.

@@ -214,20 +219,48 @@ pub trait OpenTelemetrySpanExt {
}

impl OpenTelemetrySpanExt for tracing::Span {
fn set_parent(&self, cx: Context) {
///
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The leading and trailing doccomment empty line should be deleted.

Regardless, I don't think we want to have the doc comments here at all. It's pretty hard to find them on the implementation itself. The only relevant thing added here as opposed to what's already written on the trait itself is that this should be called just once before the span is entered for the first time so I'd say let's just put that information there and let's not have documentation on the implementation.

fn set_parent(&self, cx: Context) {
///
/// Allows us to set the parent context of this span. This method exists primarily to allow
/// us to pull in distributed_ incoming context - e.g. span IDs, etc - that have been read
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you move the comments (or want to keep them), you have broken formatting here, I guess you wanted to put an _ also before distributed.

/// Additionally, once a span has been fully built - and the SpanBuilder has been consumed -
/// the parent _cannot_ be mutated.
///
fn set_parent(&self, cx: Context) -> Result<(), &'static str> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not return string errors, create an enum with two cases please.

We already depend on thiserror through opentelemetry, so I'd say use that.

let mut cx = Some(cx);
let mut result = Ok(());
let result_ref = &mut result;

self.with_subscriber(move |(id, subscriber)| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another failure mode hidden here -- if the current span is disabled, with_subscriber does not call the provided closure at all. Otherwise, it calls it and returns the value returned by the closure.

So I think it would be better to move the result variable inside this closure, then mutate it inside with_context and then return it, getting Option<Result<_, _>> from the call where we can just unwrap_or(Err(SetParentError::SpanDisabled)). The first *result_ref = ... call can be changed to direct return Err(..).

It will be better to have the captured variables to be closer to where they're needed and it will be harder to break this in the future.

Comment on lines 327 to 328
let mut key = Some(key.into());
let mut value = Some(value.into());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut key = Some(key.into());
let mut value = Some(value.into());
let mut key_value = Some(KeyValue::new(key.into(), value.into()));

and then use that directly.

Arguably, it should have already been written that way but now that there are two places where KeyValue is constructed, let's just extract it.

}

#[cfg(test)]
pub(super) fn len(&self) -> usize {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need pub(super), it's used just from within mod tests and that module already has visibility here.

What's more, it also has visibility inside of this struct, so this method does not need to exist at all, the tests can just call stack.stack.len() directly so let's not clutter the API with this.

@@ -37,7 +37,7 @@ fn trace_with_assigned_otel_context() {

tracing::subscriber::with_default(subscriber, || {
let child = tracing::debug_span!("child");
child.set_parent(cx);
let _ = child.set_parent(cx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer .unwrap() in tests please. Or .unwrap_err() where applicable of course. But the behavior should be checked here.

Comment on lines +125 to 129
// Observation: if you force the _child_ to materialize before the parent, e.g.,
// if you swap these two lines - bad things will happen, and we shouldn't support
// this.
let _ = child.set_parent(Context::current_with_span(root_span));
child.context(); // force a sampling decision
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing that would happen would be that the parent would not be set and the return value of set_parent would be Err, right?

I think we can do without that comment here and maybe we can have another test for this.

@mladedav
Copy link
Collaborator

mladedav commented Aug 5, 2025

Sorry, I've created some threads some 23 days ago but did not send the review 🤦

@mladedav
Copy link
Collaborator

mladedav commented Aug 5, 2025

It's just some minor things so if @bantonsson doesn't have time right now, I can fix those things.

@cijothomas
Copy link

Sorry, I've created some threads some 23 days ago but did not send the review 🤦

np! (has happened to me too!). Thanks for your time.

@bantonsson
Copy link
Author

@mladedav Just got back from vacation. I'll go through your comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants