Skip to content

Conversation

michalvavrik
Copy link
Member

@michalvavrik michalvavrik commented Sep 27, 2025

What this PR does

  • fixes quarkus-security dependency causes OpenTelemetry context errors #49468
  • tiny refactoring: I mentioned that anonymous identity is not memoized and when we use deferred identity elsewhere in a code, this code every time calls identity manager, resolves the anonymous identity providers and apply augmentors. Now, the resolved anonymous identity is used instead of calling the identity manager multiple times.

Tests (or a reproducer?)

You can use the reproducer attached to the linked issue to test this PR, because it only happens with certain level of concurrency and takes time. What I used is this:

[email protected]:michalvavrik/eager-security-slow-context-activation-reproducer.git
cd eager-security-slow-context-activation-reproducer
mvn clean install
java -jar target/quarkus-app/quarkus-run.jar  2>&1 | tee log
# other window
seq 1 100000 | xargs -n1 -P1000 -I{} curl -v http://localhost:8080/reproducer
# look for "Baggage is missing first value" log message

Explanation of the change

In #49468, OpenTelemetry contextual data are lost because previous context is incorrectly determined. There is a race, caused by the fact that

invoked by io.smallrye.context.impl.SlowActiveContextState.close(SlowActiveContextState.java:41) is called when other code is using the same duplicated context (you can see in in the text file key-diff that I put into my reproducer). I think it happens in the moment when threads are switched to the worker thread, but it is very hard to debug, I am not sure where in Quarkus REST it happens. However this PR fixes principle, and not one scenario:

  1. is called from a subscriber onItem wrapper by SR Context Mutiny interceptor
  2. other route handlers run until synchronous code ends and there is a context switch
  3. Uni subscriber callbacks like onItem, OnFailure (onCompletion for publisher etc.) are wrapped by Mutiny interceptor https://github.com/smallrye/smallrye-mutiny/blob/main/context-propagation/src/main/java/io/smallrye/mutiny/context/DefaultContextPropagationInterceptor.java
  4. SR Context propagation provides https://github.com/smallrye/smallrye-context-propagation/blob/main/core/src/main/java/io/smallrye/context/SmallRyeThreadContext.java
  5. SR Ctx propagation is transitive dep of quarkus-mutiny which is transitive dep of quarkus-vertx
  6. Otel defines its own ThreadContextProvider called https://github.com/quarkusio/quarkus/blob/main/extensions/opentelemetry/runtime/src/main/java/io/quarkus/opentelemetry/runtime/propagation/OpenTelemetryMpContextPropagationProvider.java
  7. Then OpenTelemetryMpContextPropagationProvider endContext can be called while Quarkus REST endpoint blocking method is executed and it is all about timing, when it happens, it switches to "prior" OTel context stored on the Vert.x duplicated context is used concurrently

My conclusion is that the fact that the same Vert.x duplicated context can be used concurrently is a bug, but we can easily avoid that as far as Vert.x HTTP Security goes by scheduling routingContext.next() once the thread local is not used. But I'd like to find more efficient way to avoid the "last" callback which needs to happen without SR Ctx propagation, something like switching context for one subscriber.

Test added in this PR to reproduce the issue

Without the fix, the test fails on my laptop most of the time (only way how to make it always fail would be to raise concurrency and number of requests, which however would stretch CI). However if the added fix, following command succeeds:

for i in {1..100}; do
  echo "--- Run $i of 100 ---"
  mvn clean verify -Dtest=OpenTelemetryContextPropagationTest
  if [ $? -ne 0 ]; then
    echo "Command failed on run $i. Stopping loop."
    break
  fi
done

Which should verify the fix added in this PR.

This comment has been minimized.

@michalvavrik
Copy link
Member Author

I actually have fix proposal (not sure if acceptable) that touches same code, I'll just re-use this PR and switch reviewers. It will require Julian or Clement.

@michalvavrik michalvavrik force-pushed the feature/refactory-eager-auth-deffered-identity branch from c7b1533 to 22599de Compare September 28, 2025 09:40
@michalvavrik michalvavrik removed the request for review from sberyozkin September 28, 2025 09:43
@michalvavrik michalvavrik changed the title Resolve deferred anonymous identity only once when proactive authentication is enabled Fix OpenTelemetry context resolution when HTTP Vert.X proactive security is enabled Sep 28, 2025
@michalvavrik michalvavrik changed the title Fix OpenTelemetry context resolution when HTTP Vert.X proactive security is enabled Fix OpenTelemetry context resolution when HTTP Vert.x proactive authentication is enabled Sep 28, 2025
@michalvavrik michalvavrik marked this pull request as draft September 28, 2025 10:25
@michalvavrik michalvavrik marked this pull request as ready for review September 28, 2025 10:53
@michalvavrik michalvavrik force-pushed the feature/refactory-eager-auth-deffered-identity branch from 22599de to 89fc260 Compare September 28, 2025 10:53
@michalvavrik michalvavrik marked this pull request as draft September 28, 2025 11:37
@michalvavrik
Copy link
Member Author

michalvavrik commented Sep 28, 2025

Recap: this definitely fixes the issue (cannot reproduce it), but I doubt this will get accepted. Nevertheless, I am not sure what else to do, so I'll wait for CI and then point to this PR in the linked issue and open discussion.

I will disable one test because can't figure what is expected behavior, the fact that multi operator demand has risen seems like a legit issue, but I lack context and I'd like to see if there are other tests failing.

@michalvavrik michalvavrik force-pushed the feature/refactory-eager-auth-deffered-identity branch from 89fc260 to e69022f Compare September 28, 2025 13:27
@michalvavrik michalvavrik marked this pull request as ready for review September 28, 2025 13:27
@michalvavrik michalvavrik marked this pull request as draft September 28, 2025 13:53
@michalvavrik michalvavrik marked this pull request as ready for review September 28, 2025 16:10
@michalvavrik
Copy link
Member Author

CI looked fine but it starts from scratch now that I marked this PR as ready :-).

Hello @jponge and @cescoffier,

with changes in this PR, the #49468 cannot be reproducer. I think that the moment when onItem Mutiny interceptors (decorators wrapping the onItem callback) continue executing while the same duplicated context is used by other thread is when there is a switch to the worker thread here:

I tried to verify this theory by replacing following code:

with something like this:

public static final Supplier<Executor> EXECUTOR_SUPPLIER = new Supplier<>() {
        @Override
        public Executor get() {
            return new Executor() {
                @Override
                public void execute(@NotNull Runnable command) {
                    if (!BlockingOperationControl.isBlockingAllowed() && VertxContext.isOnDuplicatedContext()) {
                        var ctx = Vertx.currentContext();
                        ctx.runOnContext(new Handler<Void>() {
                            @Override
                            public void handle(Void unused) {
                                ctx.executeBlocking(new Callable<Object>() {
                                    @Override
                                    public Object call() throws Exception {
                                        command.run();
                                        return null;
                                    }
                                });
                            }
                        });
                    } else {
                        ExecutorRecorder.getCurrent().execute(command);
                    }
                }
            };
        }
    };

and the linked issue was gone, but it had other problems (the io.quarkus.resteasy.reactive.server.test.RequestLeakDetectionTest#testWithConcurrentBlockingCalls failed and I did not investigate further).

Could you kindly look at the code of this PR, linked description and suggest if you can, what is core issue, how far off is this PR and what would be proper way to deal with this situation? I understand it could take a long time, just put me in the right direction. Thanks

Also cc @geoand as it concerns Quarkus REST.

Copy link

quarkus-bot bot commented Sep 28, 2025

Status for workflow Quarkus CI

This is the status report for running Quarkus CI on commit e69022f.

✅ The latest workflow run for the pull request has completed successfully.

It should be safe to merge provided you have a look at the other checks in the summary.

You can consult the Develocity build scans.


Flaky tests - Develocity

⚙️ JVM Tests - JDK 17

📦 extensions/smallrye-reactive-messaging/deployment

io.quarkus.smallrye.reactivemessaging.hotreload.ConnectorChangeTest.testUpdatingConnector - History

  • Expecting actual: ["-6","-7","-9","-10","-11","-12","-13","-14"] to start with: ["-6", "-7", "-8", "-9"] - java.lang.AssertionError
java.lang.AssertionError: 

Expecting actual:
  ["-6","-7","-9","-10","-11","-12","-13","-14"]
to start with:
  ["-6", "-7", "-8", "-9"]

	at io.quarkus.smallrye.reactivemessaging.hotreload.ConnectorChangeTest.testUpdatingConnector(ConnectorChangeTest.java:41)

@melloware
Copy link
Contributor

I actually think this one is also related: #40918

Because I use Quarkus Security as well and have seen both the error mentioned in this ticket and the other one

@brunobat
Copy link
Contributor

It would be nice if we had a test for the reported condition.

@michalvavrik
Copy link
Member Author

It would be nice if we had a test for the reported condition.

You are right. I don't think it will be a problem to create a test that reproduces the behavior for my workstation, I just wasn't sure how reliable it will be for slower machines etc. Anyway, no worry, once I have signal that this could be even remotely acceptable, I'll add test.

My guess is that re-scheduling same HTTP request when the Quarkus Security is present just because of this will not be acceptable. I kinda hope @jponge will put me in a direction how to combine contexts (one empty without SR Ctx propagation and other with it)....

@jponge
Copy link
Member

jponge commented Sep 30, 2025

@michalvavrik you might confuse me with @vietj who is the Vert.x maintainer

@michalvavrik
Copy link
Member Author

michalvavrik commented Sep 30, 2025

@michalvavrik you might confuse me with @vietj who is the Vert.x maintainer

Not really, but maybe I evaluated wrongly what I need? I thought the easiest way to fix this concurrency issue is to simply call routingContext.next() after Mutiny subscriber has completed because if the callback only does rc.next() I don't really need a propagation context for it, just to run on the right duplicated context. But the Mutiny subscriber callbacks are being intercepted by Mutiny interceptor provided by a SmallRye Context Propagation.

To put it simple, I don't see a way to avoid io.smallrye.mutiny.infrastructure.Infrastructure#UNI_INTERCEPTORS being applied on a Mutiny completion event. I thought you may have a thoughts because what we do in HttpSecurityRecorder must be a common use case. If not, no problem and thank you, I'll find another way how to fix it.

@jponge
Copy link
Member

jponge commented Sep 30, 2025

I'm not the best expert around that particular area (SR context propagation) but I'll try to have a look next week

@michalvavrik
Copy link
Member Author

I'm not the best expert around that particular area (SR context propagation) but I'll try to have a look next week

Thanks, I'd really appreciate to hear your thoughts, you don't need to solve it, just put me in the right direction 🌌

@sberyozkin
Copy link
Member

sberyozkin commented Oct 8, 2025

I eventually found this PR from @michalvavrik, while trying to find out if there are some security issues opened related to closing the OpenTelemetry scope warnings, so this PR looks relevant :-)

@michalvavrik
Copy link
Member Author

michalvavrik commented Oct 11, 2025

I'm not the best expert around that particular area (SR context propagation) but I'll try to have a look next week

I checked today how SR Ctx propagation works and I think we could apply wrappers that only use correct context if some flag is set to true and leverage try(CleanAutoCloseable ac = SmallRyeThreadContext.withThreadContext(threadContext)) and ThreadContext.ALL_REMAINING, but this situation seem to me like a Mutiny limitation:

  1. we need the context propagation for normal callbacks like onItem transformation (that is, on the same Uni)
  2. but if the only thing that is happening in the subscriber onItem (I am referring to the last onItem) is routingContext.next(), I need no CDI or Otel for that callback, the only thing that I need is that routingContext.next() is invoked on the correct duplicated context.

But if you think it is the context propagation issue, I can ask around. @FroMage and @manovotn worked on it, maybe you can give me a hint on what should I look into if I want to avoid the context propagation only for certain mutiny callbacks (those that use routingContext.next()), please?

@FroMage
Copy link
Member

FroMage commented Oct 13, 2025

It looks like you're saying that the problem is that we can call routingContext.next() from the wrong duplicated Vert.x context?

If yes, then indeed this has nothing to do with MP-CP, which does not deal with Vert.x context switching. This is a Vert.x question. Probably @cescoffier will have your answer.

@michalvavrik
Copy link
Member Author

michalvavrik commented Oct 13, 2025

It looks like you're saying that the problem is that we can call routingContext.next() from the wrong duplicated Vert.x context?

@FroMage I know there is a lot of description in this PR so let me just highlight what I personally think is an issue:

  1. Vert.x HTTP authentication handler needs to subscribe to Uni to resolve SecurityIdentity and then call routingContext.next() in the subscribed onItem
  2. the onItem callback is wrapper by MutinyInterceptors which wraps all callbacks with SmallRye Context propagation, so org.eclipse.microprofile.context.spi.ThreadContextSnapshot#begin is called before onItem is invoked and org.eclipse.microprofile.context.spi.ThreadContextController#endContext is called after the onItem callbck is executed
  3. the onItem callback calls routingContext.next() (so that is inside, before the endContext() invocation)
  4. when there is switch between an IO thread and the worker thread [this part I am least sure about, it is really hard to determine the exact point], the routingContext.next() invocation ends and the org.eclipse.microprofile.context.spi.ThreadContextController#endContext is executed
  5. hence, there are 2 codes reusing the same local data, one from the org.eclipse.microprofile.context.spi.ThreadContextController#endContext and one from the REST endpoint user code

As far as the context propagation contexts is concerned, the org.eclipse.microprofile.context.spi.ThreadContextController#endContext cannot be executed while other code is executed because it will cause random behavior.

Hope it is bit clearer, let me know if it isn't.

@michalvavrik
Copy link
Member Author

And yeah, re-scheduling routingContext.next() fixes the issue, but I don't think it is the right thing to do as it probably has performance implications? I just wanted to show it so that we can find proper solution.

@FroMage
Copy link
Member

FroMage commented Oct 13, 2025

I'm not entirely sure to be honest. I suppose routingContext.next() actually continues the current request, rather than switch to a new request, so they should both have the same vertx duplicated context, no ? Unless the rest of the request is fired on a different thread pool?

Here's another lead: the way it works is that Mutiny, when you call onItem(Callback) will invoke BaseContextPropagationInterceptor.decorate(Callback) which will call context.contextualXXX(Callback) if the lambda is not already contextualised. This would capture the current context right now, and restore it when the callback is called.

If the lambda you pass to onItem() implements Contextualized then it won't be contextualised, it won't capture anything, and nothing will be restored. Which means it will inherit whatever thread locals are lying around. This could be what you're after if the next thread is new, or horrible if it has random left-over thread locals.

You can try that, see how things work.

Or (probably better) you can build a lambda to pass to onItem() that is properly contextualised for all cleared contexts (clean thread locals):

// you can save this one around for every time you need it
ThreadContext cleanThreadContext = ThreadContext.builder().propagated(ThreadContext.NONE).cleared(ThreadContext.ALL_REMAINING).build();
uni.onItem(cleanThreadContext.contextualCallable(() -> { rc.next(); }));

@jponge
Copy link
Member

jponge commented Oct 13, 2025

I'm not entirely sure to be honest. I suppose routingContext.next() actually continues the current request, rather than switch to a new request, so they should both have the same vertx duplicated context, no ? Unless the rest of the request is fired on a different thread pool?

This is indeed how it works, it goes to the next handler in the chain, and there's not thread hop.

@michalvavrik
Copy link
Member Author

michalvavrik commented Oct 13, 2025

I'm not entirely sure to be honest. I suppose routingContext.next() actually continues the current request, rather than switch to a new request, so they should both have the same vertx duplicated context, no ? Unless the rest of the request is fired on a different thread pool?

This is indeed how it works, it goes to the next handler in the chain, and there's not thread hop.

I spend whole weekend debugging it, I realize that, but maybe I didn't stress out that Quarkus REST endpoint is executed on a worker thread and there is a switch between IO thread and the worker thread. Maybe I am just not able explain this easily, sorry 🤷‍♂️

If the lambda you pass to onItem() implements Contextualized then it won't be contextualised, it won't capture anything, and nothing will be restored. Which means it will inherit whatever thread locals are lying around. This could be what you're after if the next thread is new, or horrible if it has random left-over thread locals.

That is super helpful @FroMage , thanks, I am going to re-read and try it tomorrow. That sounds exactly like what I need.

@michalvavrik
Copy link
Member Author

It definitely fixed the reproducer, thank you @FroMage (and @jponge )! I'll have to write a test as @brunobat reminded me and re-run CI if something else broken, but I'll ask you for a review if CI passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

quarkus-security dependency causes OpenTelemetry context errors

6 participants