Skip to content

Conversation

fbogsany
Copy link
Contributor

@fbogsany fbogsany commented Feb 5, 2025

Alternative to #1760

This PR implements Context storage using an attribute added to Fiber. This is inspired by Rails' implementation of IsolatedExecutionState. It adds 3 tests that fail with alternative implementations of Context storage. The tests represent the sort of Fiber-local storage and Thread-based Fiber-local variable manipulation performed by things like ActionController::Live. The approach used in this PR is a form of "security via obscurity" - side-stepping this kind of brute-force manipulation of Fiber-scoped data by stashing our data elsewhere. It won't prevent malicious manipulation of Fiber.current.opentelemetry_context, of course.

A benchmark is included with this PR exploring the design space with 8 different implementations of Context. Most of these, including the most performant implementations, are not safe in the presence of ActionController::Live, but they're included for comparison.

Results with Ruby 3.4.1 (note that this PR, FiberAttributeContext, is a performance increase over the existing implementation, ArrayContext):

Comparison:
FiberLocalArrayContext.with_value:  2697865.6 i/s
FiberLocalLinkedListContext.with_value:  2287481.6 i/s - 1.18x slower
FiberAttributeContext.with_value:  2241011.1 i/s - 1.20x slower
ArrayContext.with_value:  2187952.5 i/s - 1.23x slower
LinkedListContext.with_value:  1992425.8 i/s - 1.35x slower
ImmutableArrayContext.with_value:  1769365.2 i/s - 1.52x slower
FiberLocalImmutableArrayContext.with_value:  1761293.6 i/s - 1.53x slower
FiberLocalVarContext.with_value:  1517009.7 i/s - 1.78x slower

Comparison:
FiberLocalArrayContext.with_value recursive:   296797.6 i/s
FiberAttributeContext.with_value recursive:   236727.5 i/s - 1.25x slower
FiberLocalLinkedListContext.with_value recursive:   235673.7 i/s - 1.26x slower
ArrayContext.with_value recursive:   230727.3 i/s - 1.29x slower
LinkedListContext.with_value recursive:   205255.5 i/s - 1.45x slower
FiberLocalImmutableArrayContext.with_value recursive:   180077.8 i/s - 1.65x slower
ImmutableArrayContext.with_value recursive:   177865.2 i/s - 1.67x slower
FiberLocalVarContext.with_value recursive:   158603.8 i/s - 1.87x slower

The recursive benchmark measures 10 nested calls to Context.with_value, similar to a typical uses of nested spans (OpenTelemetry.tracer.in_span('foo') { ... }), which utilize Context.with_value.

Implementations:

  1. FiberLocalArrayContext - mutable array held in a Fiber[STACK_KEY] that corrects for shared ownership after Fiber.new (see below). Unfortunately, this trickery permits the owning Fiber to mutate the shared array after Fiber.new before the new Fiber uses it. It was a nice try, though. This is unsafe in the presence of bulk manipulation of Fiber#storage.
  2. FiberLocalLinkedListContext - linked list held in Fiber[STACK_KEY]. This is unsafe in the presence of bulk manipulation of Fiber#storage.
  3. FiberAttributeContext - mutable array held in Fiber.current.opentelemetry_context. This is the implementation proposed in this PR and is the most performant "safe" implementation.
  4. ArrayContext - mutable array held in Thread.current[STACK_KEY]. This is unsafe in the presence of bulk manipulation of Thread#[]. This is the existing implementation.
  5. LinkedListContext - linked list held in Thread.current[STACK_KEY]. This is unsafe in the presence of bulk manipulation of Thread#[].
  6. ImmutableArrayContext - immutable array held in Thread.current[STACK_KEY]. This is unsafe in the presence of bulk manipulation of Thread#[]. This is the implementation in fix(context): do not modify stack array directly when attach and detach #1760.
  7. FiberLocalImmutableArrayContext - immutable array held in a Fiber[STACK_KEY]. This is unsafe in the presence of bulk manipulation of Fiber#storage.
  8. FiberLocalVarContext - mutable array held in a Concurrent::FiberLocalVar. This is unsafe in the presence of bulk manipulation of Fiber#storage.

Tricky implementation of Fiber-local Stack that isn't quite safe enough:

  # NOTE: This is cool, but is isn't safe for concurrent use because it allows the
  # owner to modify the stack after it has been shared with another fiber.
  class Stack < Array
    def self.current
      s = Fiber[STACK_KEY] ||= new
      s.correct_owner!
    end

    def initialize
      super
      @owner = Fiber.current
    end

    def correct_owner!
      if @owner != Fiber.current
        Fiber[STACK_KEY] = self.class.new.replace(self)
      else
        self
      end
    end
  end

@fbogsany fbogsany changed the title Use a Fiber attribute for Context fix(context): use a Fiber attribute for Context Feb 6, 2025
@ivoanjo
Copy link

ivoanjo commented Feb 10, 2025

Hey 👋. I saw this after Ariel pinged on #1766 . Amazing work on enumerating and contrasting the alternatives :)

I work at the Datadog on our Ruby profiler. One of the things we currently support is correlating the profiling data we gather with OTEL traces. The intuition here is you can look at the "profiles for this span/trace".

Right now the way we do this with otel-ruby... complicated and error-prone. But it does work :)

TL;DR:

  • The profiler is implemented with native code because there are things that can only be gathered from the C side, for instance profiling allocations and the Global VM Lock.
  • The profiler samples across threads. E.g. it reads this context from different threads, not just from the current thread.
  • If we want to read the current span in the middle of an allocation, or GVL event, we can't call Ruby code at that point. Thus, we're "barging in" an directly reading some_thread[STACK_KEY] using C code.

The semi-sad news is that I prototyped changing our code to read from Fiber.current... and discovered that Ruby doesn't allow us to get "the current fiber object for a given thread".

(We often slightly cheat by using internal Ruby headers to get access to things that we wouldn't otherwise have, but fibers actually are defined in cont.c so no headers involved there.)

So this would complicate our life quite a bit.

Obviously, an important question is: how much should you care about this? Having the otel tracing context being in an accessible location is something that is also relevant for the OTEL profiling support (Datadog is also participating in standardizing this effort).

Right now this whole stack and contexts structure per-fiber is very flexible at the Ruby level, but it makes it quite complicated to answer the query "what's active in this thread/fiber right now".

@ioquatix
Copy link

@ivoanjo what's the correct way to fix this? Thread#fiber? It seems like it would be extremely racy.

@ivoanjo
Copy link

ivoanjo commented Feb 10, 2025

@ivoanjo what's the correct way to fix this? Thread#fiber? It seems like it would be extremely racy.

Yeah, something like that is missing and could be used. In fact, it wouldn't even need to be exposed on the Ruby side -- e.g. could only exist as a C-level API.

And indeed it would be racy, but for profilers that would probably be ok, since they often interrupt Ruby so have ~some control over the GVL. And it probably wouldn't be more racy than other things involving threads?


On the other hand, "getting context for connecting to other things" could be somewhat decoupled from the internal management of otel-ruby state. E.g. if the library exposed the latest span_id/trace_id somewhere easily accessible (e.g. maybe keep that in Thread#[] but then use the Fiber as a source of truth) it would make it easier for external tools to get this info while still keeping the state consistent.

Alternatively alternatively, having some kind of very lightweight callback when a span was activated/deactivated on a fiber could allow 3rd parties to build their own tracking. That would probably well enough for profilers that work inside the Ruby process, but doesn't actually solve the issue for tools that work from outside the process, such as https://github.com/open-telemetry/opentelemetry-ebpf-profiler or even something like rbspy.

@ioquatix
Copy link

ioquatix commented Feb 11, 2025

That all makes sense.

I am okay with introducing a C API but it might be good to get feedback from @eregon about it as he has some strong opinions about these issues/design. He may have some other insights into the problem too.

Ideas for implementation:

// Safe to call from any thread provided you have the GVL?
VALUE rb_thread_current_fiber(VALUE self) {
    rb_thread_t *thread = rb_thread_ptr(self);

    if (thread->ec->fiber_ptr) {
        return rb_fiberptr_self(thread->ec->fiber_ptr);
    }
    else {
        return Qnil;
    }
}

If I understand correctly, your use case would be given a list of threads, getting the current fiber, and extracting fiber-instance attributes, like the ones introduced in this PR.

I realise this may be getting off topic for this issue, so perhaps we should move this discussion elsewhere - even if tangentially related.

@arielvalentin
Copy link
Contributor

arielvalentin commented Feb 11, 2025

I appreciate everyone's feedback here.

We have been working with @ivoanjo to get the Datadog profiler working with this distribution and I hope to be able to implement correlation with OTel profiler later this year.

Any changes to Ruby to help profiler agents would be grately appreciated!

@eregon
Copy link

eregon commented Feb 11, 2025

Thanks for the ping.
That does seem problematic for me, this rb_thread_current_fiber() cannot work without a GVL, e.g. on TruffleRuby it would always return a stale Fiber which once upon a time was (and maybe still is or not) the current Fiber of that Thread.
Even with the GVL it's unsafe if there any Ruby code executing in between, or any rb_* function called (as those can release the GVL).

I don't have time to read all the context, but assuming we want to associate profiling information to Fibers, the API to get the profiling information should tell us which Fiber was executing at the time of the sample. There is no way to recover the current Fiber from a Thread after the fact.

@ioquatix
Copy link

always return a stale Fiber which once upon a time was (and maybe still is or not) the current Fiber of that Thread.

For the sake of the profiler, it's probably good enough?

@ivoanjo
Copy link

ivoanjo commented Feb 12, 2025

always return a stale Fiber which once upon a time was (and maybe still is or not) the current Fiber of that Thread.

For the sake of the profiler, it's probably good enough?

Yeah, it would probably be fine. This is already a challenge that sampling profilers have -- they observe only periodically, so they may "blame" cpu-time or another resource on a tiny once-in-a-lifetime-method. The usual approach is that with enough samples, it evens out and thus these errors end up not impacting the overall picture (think lossy video/image/audio codecs).

Having said that...

I don't have time to read all the context, but assuming we want to associate profiling information to Fibers, the API to get the profiling information should tell us which Fiber was executing at the time of the sample.

...this would be nice as well. In fact, the GO VM does have a built-in profiler, and has a mechanism where you can attach context (simple key/value pairs) to goroutines and these get captured by the profiler as well.


But I don't want to get too outside the scope from this PR! Apologies to @fbogsany for the noise. The TL;DR is that this change makes it a bit harder to observe otel Ruby apps, due to no fault of your amazing work. And at least for me there's not a clear suggestion that gives us no trade-offs :|

@eregon
Copy link

eregon commented Feb 12, 2025

For the sake of the profiler, it's probably good enough?

I think not, the profiler should not misattribute time/backtrace to the wrong Fiber, that should be considered a significant bug in the profiler.
Whatever is used for sampling ought to tell you which Fiber was active at the time of the sample, that's the API that needs fixing if it's not already the case.

@robertlaurin
Copy link
Contributor

Hey all, appreciate the conversation that came up around this change. We're still in a position where the latest release of Rails breaks Otel instrumentation for LiveController instances. I'd like to move ahead with merging and releasing this change.

@ivoanjo I don't necessarily want us to commit to supporting what we intended to be a private interface, but I realize this has implication on work you have been doing with profiling. I think it would be beneficial if we saw some examples of what a more formalized solution might look like.

@ivoanjo
Copy link

ivoanjo commented Feb 20, 2025

One good thing from this being an internal detail currently, is that evolving/changing minds seems like a low-cost thing. (The only cost is to external profilers that are reaching to internal APIs 😉 )

I've quickly prototyped an alternative to ArrayContext that uses Thread.current in this branch. The intuition is -- what if the array context was defensive and checked if it was copied:

    def stack
      current_stack, owner = Thread.current[STACK_KEY]

      if owner != Fiber.current
        current_stack = []
        Thread.current[STACK_KEY] = [current_stack, Fiber.current]
      end

      current_stack
    end

Benchmarks show this is only slightly more costly than ArrayContext.

Having said that, I've also spent more time looking at the CRuby VM and while there's no public ways of getting the current Fiber for a thread, it's not impossible, so we can support the current PR as-is if needed with some acrobatics on our side.

@robertlaurin robertlaurin merged commit 34b050c into main Feb 20, 2025
68 checks passed
@robertlaurin robertlaurin deleted the fiber-attribute-for-context branch February 20, 2025 19:11
This was referenced Feb 20, 2025
ivoanjo added a commit to DataDog/dd-trace-rb that referenced this pull request Feb 21, 2025
**What does this PR do?**

This PR adds a new otel appraisal variant for profiling that uses
`opentelemetry-api` >= 1.5 .

This is because 1.5 includes
open-telemetry/opentelemetry-ruby#1807
and we'll need to add special support for it.

**Motivation:**

I'm adding already the appraisal version (disabled for now) to
avoid any future conflicts with changes to the `Matrixfile`.

**Additional Notes:**

N/A

**How to test the change?**

This new group is not yet used -- green CI is enough for this PR.
ivoanjo added a commit to DataDog/dd-trace-rb that referenced this pull request Feb 21, 2025
This will be used to access the OTEL context after
open-telemetry/opentelemetry-ruby#1807 .
ivoanjo added a commit to DataDog/dd-trace-rb that referenced this pull request Feb 21, 2025
**What does this PR do?**

This PR adds support for correlating profiling wih otel-api 1.5+.

Context storage was moved in
open-telemetry/opentelemetry-ruby#1807
and we needed to update the profiler to account for this.

**Motivation:**

Keep profiling + otel correlation working.

**Additional Notes:**

N/A

**How to test the change?**

In #4421 I had already bootstrapped the new appraisal groups needed
for testing this. This PR enables them, and our existing test
coverage will cover the new code path when used with otel-api 1.5+.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants