-
Notifications
You must be signed in to change notification settings - Fork 272
fix(context): use a Fiber attribute for Context #1807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hey 👋. I saw this after Ariel pinged on #1766 . Amazing work on enumerating and contrasting the alternatives :) I work at the Datadog on our Ruby profiler. One of the things we currently support is correlating the profiling data we gather with OTEL traces. The intuition here is you can look at the "profiles for this span/trace". Right now the way we do this with otel-ruby... complicated and error-prone. But it does work :) TL;DR:
The semi-sad news is that I prototyped changing our code to read from (We often slightly cheat by using internal Ruby headers to get access to things that we wouldn't otherwise have, but fibers actually are defined in So this would complicate our life quite a bit. Obviously, an important question is: how much should you care about this? Having the otel tracing context being in an accessible location is something that is also relevant for the OTEL profiling support (Datadog is also participating in standardizing this effort). Right now this whole stack and contexts structure per-fiber is very flexible at the Ruby level, but it makes it quite complicated to answer the query "what's active in this thread/fiber right now". |
@ivoanjo what's the correct way to fix this? |
Yeah, something like that is missing and could be used. In fact, it wouldn't even need to be exposed on the Ruby side -- e.g. could only exist as a C-level API. And indeed it would be racy, but for profilers that would probably be ok, since they often interrupt Ruby so have ~some control over the GVL. And it probably wouldn't be more racy than other things involving threads? On the other hand, "getting context for connecting to other things" could be somewhat decoupled from the internal management of otel-ruby state. E.g. if the library exposed the latest Alternatively alternatively, having some kind of very lightweight callback when a span was activated/deactivated on a fiber could allow 3rd parties to build their own tracking. That would probably well enough for profilers that work inside the Ruby process, but doesn't actually solve the issue for tools that work from outside the process, such as https://github.com/open-telemetry/opentelemetry-ebpf-profiler or even something like rbspy. |
That all makes sense. I am okay with introducing a C API but it might be good to get feedback from @eregon about it as he has some strong opinions about these issues/design. He may have some other insights into the problem too. Ideas for implementation: // Safe to call from any thread provided you have the GVL?
VALUE rb_thread_current_fiber(VALUE self) {
rb_thread_t *thread = rb_thread_ptr(self);
if (thread->ec->fiber_ptr) {
return rb_fiberptr_self(thread->ec->fiber_ptr);
}
else {
return Qnil;
}
} If I understand correctly, your use case would be given a list of threads, getting the current fiber, and extracting fiber-instance attributes, like the ones introduced in this PR. I realise this may be getting off topic for this issue, so perhaps we should move this discussion elsewhere - even if tangentially related. |
I appreciate everyone's feedback here. We have been working with @ivoanjo to get the Datadog profiler working with this distribution and I hope to be able to implement correlation with OTel profiler later this year. Any changes to Ruby to help profiler agents would be grately appreciated! |
Thanks for the ping. I don't have time to read all the context, but assuming we want to associate profiling information to Fibers, the API to get the profiling information should tell us which Fiber was executing at the time of the sample. There is no way to recover the current Fiber from a Thread after the fact. |
For the sake of the profiler, it's probably good enough? |
Yeah, it would probably be fine. This is already a challenge that sampling profilers have -- they observe only periodically, so they may "blame" cpu-time or another resource on a tiny once-in-a-lifetime-method. The usual approach is that with enough samples, it evens out and thus these errors end up not impacting the overall picture (think lossy video/image/audio codecs). Having said that...
...this would be nice as well. In fact, the GO VM does have a built-in profiler, and has a mechanism where you can attach context (simple key/value pairs) to goroutines and these get captured by the profiler as well. But I don't want to get too outside the scope from this PR! Apologies to @fbogsany for the noise. The TL;DR is that this change makes it a bit harder to observe otel Ruby apps, due to no fault of your amazing work. And at least for me there's not a clear suggestion that gives us no trade-offs :| |
I think not, the profiler should not misattribute time/backtrace to the wrong Fiber, that should be considered a significant bug in the profiler. |
Hey all, appreciate the conversation that came up around this change. We're still in a position where the latest release of Rails breaks Otel instrumentation for LiveController instances. I'd like to move ahead with merging and releasing this change. @ivoanjo I don't necessarily want us to commit to supporting what we intended to be a private interface, but I realize this has implication on work you have been doing with profiling. I think it would be beneficial if we saw some examples of what a more formalized solution might look like. |
One good thing from this being an internal detail currently, is that evolving/changing minds seems like a low-cost thing. (The only cost is to external profilers that are reaching to internal APIs 😉 ) I've quickly prototyped an alternative to def stack
current_stack, owner = Thread.current[STACK_KEY]
if owner != Fiber.current
current_stack = []
Thread.current[STACK_KEY] = [current_stack, Fiber.current]
end
current_stack
end Benchmarks show this is only slightly more costly than Having said that, I've also spent more time looking at the CRuby VM and while there's no public ways of getting the current Fiber for a thread, it's not impossible, so we can support the current PR as-is if needed with some acrobatics on our side. |
**What does this PR do?** This PR adds a new otel appraisal variant for profiling that uses `opentelemetry-api` >= 1.5 . This is because 1.5 includes open-telemetry/opentelemetry-ruby#1807 and we'll need to add special support for it. **Motivation:** I'm adding already the appraisal version (disabled for now) to avoid any future conflicts with changes to the `Matrixfile`. **Additional Notes:** N/A **How to test the change?** This new group is not yet used -- green CI is enough for this PR.
This will be used to access the OTEL context after open-telemetry/opentelemetry-ruby#1807 .
**What does this PR do?** This PR adds support for correlating profiling wih otel-api 1.5+. Context storage was moved in open-telemetry/opentelemetry-ruby#1807 and we needed to update the profiler to account for this. **Motivation:** Keep profiling + otel correlation working. **Additional Notes:** N/A **How to test the change?** In #4421 I had already bootstrapped the new appraisal groups needed for testing this. This PR enables them, and our existing test coverage will cover the new code path when used with otel-api 1.5+.
Alternative to #1760
This PR implements
Context
storage using an attribute added toFiber
. This is inspired by Rails' implementation ofIsolatedExecutionState
. It adds 3 tests that fail with alternative implementations ofContext
storage. The tests represent the sort of Fiber-local storage and Thread-based Fiber-local variable manipulation performed by things likeActionController::Live
. The approach used in this PR is a form of "security via obscurity" - side-stepping this kind of brute-force manipulation of Fiber-scoped data by stashing our data elsewhere. It won't prevent malicious manipulation ofFiber.current.opentelemetry_context
, of course.A benchmark is included with this PR exploring the design space with 8 different implementations of
Context
. Most of these, including the most performant implementations, are not safe in the presence ofActionController::Live
, but they're included for comparison.Results with Ruby 3.4.1 (note that this PR,
FiberAttributeContext
, is a performance increase over the existing implementation,ArrayContext
):The recursive benchmark measures 10 nested calls to
Context.with_value
, similar to a typical uses of nested spans (OpenTelemetry.tracer.in_span('foo') { ... }
), which utilizeContext.with_value
.Implementations:
FiberLocalArrayContext
- mutable array held in aFiber[STACK_KEY]
that corrects for shared ownership afterFiber.new
(see below). Unfortunately, this trickery permits the owning Fiber to mutate the shared array afterFiber.new
before the newFiber
uses it. It was a nice try, though. This is unsafe in the presence of bulk manipulation ofFiber#storage
.FiberLocalLinkedListContext
- linked list held inFiber[STACK_KEY]
. This is unsafe in the presence of bulk manipulation ofFiber#storage
.FiberAttributeContext
- mutable array held inFiber.current.opentelemetry_context
. This is the implementation proposed in this PR and is the most performant "safe" implementation.ArrayContext
- mutable array held inThread.current[STACK_KEY]
. This is unsafe in the presence of bulk manipulation ofThread#[]
. This is the existing implementation.LinkedListContext
- linked list held inThread.current[STACK_KEY]
. This is unsafe in the presence of bulk manipulation ofThread#[]
.ImmutableArrayContext
- immutable array held inThread.current[STACK_KEY]
. This is unsafe in the presence of bulk manipulation ofThread#[]
. This is the implementation in fix(context): do not modify stack array directly when attach and detach #1760.FiberLocalImmutableArrayContext
- immutable array held in aFiber[STACK_KEY]
. This is unsafe in the presence of bulk manipulation ofFiber#storage
.FiberLocalVarContext
- mutable array held in aConcurrent::FiberLocalVar
. This is unsafe in the presence of bulk manipulation ofFiber#storage
.Tricky implementation of Fiber-local
Stack
that isn't quite safe enough: