-
Notifications
You must be signed in to change notification settings - Fork 3k
Fix OpenTelemetry context resolution when HTTP Vert.x proactive authentication is enabled #50310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix OpenTelemetry context resolution when HTTP Vert.x proactive authentication is enabled #50310
Conversation
This comment has been minimized.
This comment has been minimized.
I actually have fix proposal (not sure if acceptable) that touches same code, I'll just re-use this PR and switch reviewers. It will require Julian or Clement. |
c7b1533
to
22599de
Compare
22599de
to
89fc260
Compare
Recap: this definitely fixes the issue (cannot reproduce it), but I doubt this will get accepted. Nevertheless, I am not sure what else to do, so I'll wait for CI and then point to this PR in the linked issue and open discussion. I will disable one test because can't figure what is expected behavior, the fact that multi operator demand has risen seems like a legit issue, but I lack context and I'd like to see if there are other tests failing. |
89fc260
to
e69022f
Compare
CI looked fine but it starts from scratch now that I marked this PR as ready :-). Hello @jponge and @cescoffier, with changes in this PR, the #49468 cannot be reproducer. I think that the moment when
I tried to verify this theory by replacing following code:
with something like this:
and the linked issue was gone, but it had other problems (the Could you kindly look at the code of this PR, linked description and suggest if you can, what is core issue, how far off is this PR and what would be proper way to deal with this situation? I understand it could take a long time, just put me in the right direction. Thanks Also cc @geoand as it concerns Quarkus REST. |
Status for workflow
|
I actually think this one is also related: #40918 Because I use Quarkus Security as well and have seen both the error mentioned in this ticket and the other one |
It would be nice if we had a test for the reported condition. |
You are right. I don't think it will be a problem to create a test that reproduces the behavior for my workstation, I just wasn't sure how reliable it will be for slower machines etc. Anyway, no worry, once I have signal that this could be even remotely acceptable, I'll add test. My guess is that re-scheduling same HTTP request when the Quarkus Security is present just because of this will not be acceptable. I kinda hope @jponge will put me in a direction how to combine contexts (one empty without SR Ctx propagation and other with it).... |
@michalvavrik you might confuse me with @vietj who is the Vert.x maintainer |
Not really, but maybe I evaluated wrongly what I need? I thought the easiest way to fix this concurrency issue is to simply call To put it simple, I don't see a way to avoid |
I'm not the best expert around that particular area (SR context propagation) but I'll try to have a look next week |
Thanks, I'd really appreciate to hear your thoughts, you don't need to solve it, just put me in the right direction 🌌 |
I eventually found this PR from @michalvavrik, while trying to find out if there are some security issues opened related to closing the OpenTelemetry scope warnings, so this PR looks relevant :-) |
I checked today how SR Ctx propagation works and I think we could apply wrappers that only use correct context if some flag is set to
But if you think it is the context propagation issue, I can ask around. @FroMage and @manovotn worked on it, maybe you can give me a hint on what should I look into if I want to avoid the context propagation only for certain mutiny callbacks (those that use |
It looks like you're saying that the problem is that we can call If yes, then indeed this has nothing to do with MP-CP, which does not deal with Vert.x context switching. This is a Vert.x question. Probably @cescoffier will have your answer. |
@FroMage I know there is a lot of description in this PR so let me just highlight what I personally think is an issue:
As far as the context propagation contexts is concerned, the Hope it is bit clearer, let me know if it isn't. |
And yeah, re-scheduling |
I'm not entirely sure to be honest. I suppose Here's another lead: the way it works is that Mutiny, when you call If the lambda you pass to You can try that, see how things work. Or (probably better) you can build a lambda to pass to // you can save this one around for every time you need it
ThreadContext cleanThreadContext = ThreadContext.builder().propagated(ThreadContext.NONE).cleared(ThreadContext.ALL_REMAINING).build();
uni.onItem(cleanThreadContext.contextualCallable(() -> { rc.next(); })); |
This is indeed how it works, it goes to the next handler in the chain, and there's not thread hop. |
I spend whole weekend debugging it, I realize that, but maybe I didn't stress out that Quarkus REST endpoint is executed on a worker thread and there is a switch between IO thread and the worker thread. Maybe I am just not able explain this easily, sorry 🤷♂️
That is super helpful @FroMage , thanks, I am going to re-read and try it tomorrow. That sounds exactly like what I need. |
What this PR does
Tests (or a reproducer?)
You can use the reproducer attached to the linked issue to test this PR, because it only happens with certain level of concurrency and takes time. What I used is this:
Explanation of the change
In #49468, OpenTelemetry contextual data are lost because previous context is incorrectly determined. There is a race, caused by the fact that
quarkus/extensions/opentelemetry/runtime/src/main/java/io/quarkus/opentelemetry/runtime/propagation/OpenTelemetryMpContextPropagationProvider.java
Line 32 in 2ce3255
io.smallrye.context.impl.SlowActiveContextState.close(SlowActiveContextState.java:41)
is called when other code is using the same duplicated context (you can see in in the text filekey-diff
that I put into my reproducer). I think it happens in the moment when threads are switched to the worker thread, but it is very hard to debug, I am not sure where in Quarkus REST it happens. However this PR fixes principle, and not one scenario:quarkus/extensions/vertx-http/runtime/src/main/java/io/quarkus/vertx/http/runtime/security/HttpSecurityRecorder.java
Line 406 in 2ce3255
onItem
wrapper by SR Context Mutiny interceptorquarkus-mutiny
which is transitive dep ofquarkus-vertx
ThreadContextProvider
called https://github.com/quarkusio/quarkus/blob/main/extensions/opentelemetry/runtime/src/main/java/io/quarkus/opentelemetry/runtime/propagation/OpenTelemetryMpContextPropagationProvider.javaOpenTelemetryMpContextPropagationProvider
endContext
can be called while Quarkus REST endpoint blocking method is executed and it is all about timing, when it happens, it switches to "prior" OTel context stored on the Vert.x duplicated context is used concurrentlyMy conclusion is that the fact that the same Vert.x duplicated context can be used concurrently is a bug, but we can easily avoid that as far as Vert.x HTTP Security goes by scheduling
routingContext.next()
once the thread local is not used. But I'd like to find more efficient way to avoid the "last" callback which needs to happen without SR Ctx propagation, something like switching context for one subscriber.Test added in this PR to reproduce the issue
Without the fix, the test fails on my laptop most of the time (only way how to make it always fail would be to raise concurrency and number of requests, which however would stretch CI). However if the added fix, following command succeeds:
Which should verify the fix added in this PR.