Skip to content

[PERFORMANCE IMPROVEMENT] Only one line! 🀯 WebClient should publish response on boundedElastic #34199

@mateusz-nalepa

Description

@mateusz-nalepa

ETA: 5 minutes

Hello! πŸ‘‹

TL;DR

Just add:

.publishOn(Schedulers.boundedElastic()) // or maybe Schedulers.parallel()?

Somewhere in WebClient internals to improve app performance πŸ˜„

Context

Recently i was deep diving into SpringWebFlux + Spring WebClient app to figure out which parts of the code are executed by given thread. It turns out, that if i have cpu bound operation, even as simple as encoding object to json, then they are executed on WebClient threads by default. I've created MVCE App for this. Basically the whole app is something like this:

@GetMapping("/endpoint")
fun endpoint(): Mono<ResponseEntity<AppResponse>> =
    webClient
        .get()
        .uri("http://some-external-service/endpoint")
        .retrieve()
        .bodyToMono(MockServerResponse::class.java)
        // comment this line if needed
        .publishOn(Schedulers.parallel())
        .map {
            heavyCpuOperation()
            it
        }
        .map { ResponseEntity.ok(AppResponse(it.data)) }

private fun heavyCpuOperation() {
    var bigInteger = BigInteger.ZERO
    for (i in 0..500_000) {
        bigInteger = bigInteger.add(BigInteger.valueOf(i.toLong()))
    }
}

Here are the results:

  • Logs, where there is no .publishOn() operator
reactor-http-nio-3 ### com.nalepa.publishon.AppEndpoint ###
reactor-http-nio-3 ### com.nalepa.publishon.AppEndpoint ### ENDPOINT: Start processing request
http-client-nio-2 ### io.netty.channel.DefaultChannelPipeline$HeadContext ### Writing data to socket
http-client-nio-2 ### org.springframework.http.codec.json.Jackson2JsonDecoder ### Decoding webClient response
http-client-nio-2 ### com.nalepa.publishon.AppEndpoint ### WEBCLIENT: I hava response from external service
http-client-nio-2 ### com.nalepa.publishon.AppEndpoint ### CPU OPERATION: Started heavy operation
http-client-nio-2 ### com.nalepa.publishon.AppEndpoint ### CPU OPERATION: Ended heavy operation
http-client-nio-2 ### org.springframework.http.codec.json.Jackson2JsonEncoder ### Encoding endpoint response
reactor-http-nio-3 ### io.netty.channel.DefaultChannelPipeline$HeadContext ### Writing data to socket
http-client-nio-2 ### com.nalepa.publishon.AppEndpoint ### ENDPOINT: Ended processing request
  • Logs, where there is .publishOn(Schedulers.parallel()) operator
reactor-http-nio-4 ### com.nalepa.publishon.AppEndpoint ###
reactor-http-nio-4 ### com.nalepa.publishon.AppEndpoint ### ENDPOINT: Start processing request
http-client-nio-2 ### io.netty.channel.DefaultChannelPipeline$HeadContext ### Writing data to socket
http-client-nio-2 ### org.springframework.http.codec.json.Jackson2JsonDecoder ### Decoding webClient response
http-client-nio-2 ### com.nalepa.publishon.AppEndpoint ### WEBCLIENT: I hava response from external service
parallel-1 ### com.nalepa.publishon.AppEndpoint ### CPU OPERATION: Started heavy operation
parallel-1 ### com.nalepa.publishon.AppEndpoint ### CPU OPERATION: Ended heavy operation
parallel-1 ### org.springframework.http.codec.json.Jackson2JsonEncoder ### Encoding endpoint response
parallel-1 ### com.nalepa.publishon.AppEndpoint ### ENDPOINT: Ended processing request
reactor-http-nio-4 ### io.netty.channel.DefaultChannelPipeline$HeadContext ### Writing data to socket

As you can see, CPU Operation is executed on http thread when there is no .publishOn(). I've decided to perform some tests related to this.

Few words about testing

Dependencies used:

  • Spring Boot 3.4.1
  • Java 21

Platform:

  • MacBook Air M2, 16GB RAM

Testing App:

I know, that for testing it's good to tests many scenarios, many schedulers etc. I've done that. In this issue, I'm putting results for so called Base and Complex scenario. Also, maybe I'm missing something and those tests simply does not make any sense, if yes, please let me know!

In tests, i was also ignoring like first 2-3 minutes of results, due to JVM warmup.

Performance Testing

I've decided to perform some tests to find out, if there will be any performance improvement by publishing webClient response on another scheduler.

Base Scenario

I've started with simpliest one:

@GetMapping("/endpoint")
fun endpoint(): Mono<ResponseEntity<String>> =
    webClient
        .get()
        .uri("http://some-external-service/endpoint")
        .retrieve()
        .bodyToMono(String::class.java)
        .map { ResponseEntity.ok(it) }

Here's the architecture diagram:
Zrzut ekranu 2025-01-6 o 15 41 53

So basically the flow is something like:

1. Send `only one` request to an app
2. App get data from mock server by using `only one` WebClient.
3. Go to step 1 

I've run this scenario in 3 variants:

  • without .publishOn()
  • with .publishOn(Schedulers.parallel())
  • with .publishOn(Schedulers.boundedElastic())

For every one of them the results were similar, so I will post only one screenshot from Grafana.

  • About 8K RPS sum by (instance) (irate(http_server_requests_seconds_count[15s]))
  • About 5% CPU Usage max by (instance) (process_cpu_usage)

Zrzut ekranu 2025-01-6 o 15 51 04

So it's good to know, that adding .publishOn() did not has any impact on the simpliest app.

Complex Scenario

I've added:

  • decoding response from Mock Server:
data class MockServerResponse(
    val value: String,
)
  • encoding response from an TestApp by simply returning List<String>

So now app looks like this:

@GetMapping("/endpoint")
fun endpoint(): Mono<ResponseEntity<List<String>>> =
    Flux
        .fromIterable(webClients)
        .flatMap { 
            it
                .getResponseFromWebClient()
                // comment if needed
                .publishOn(Schedulers.boundedElastic()) 
        }
        .collectList()
        .map { ResponseEntity.ok(it) }

I've also changed a little bit architecture. Here's the diagram:
Zrzut ekranu 2025-01-6 o 16 06 43

So basically the flow is something like:

1. Send `N` requests to an app
2. For every request app get data from mock server by using `M` WebClients.
3. Go to step 1 

I've run this scenario in 3 variants:

  • without .publishOn()
  • with .publishOn(Schedulers.parallel())
  • with .publishOn(Schedulers.boundedElastic())

Results where there is no .publishOn() or there isSchedulers.parallel() were similar:

  • About 240 RPS sum by (instance) (irate(http_server_requests_seconds_count[15s]))
  • About 33% CPU Usage max by (instance) (process_cpu_usage)
  • About 260 ms response times max by (instance) (http_server_requests_seconds{uri="/dummy/{id}", quantile="0.999"})

Zrzut ekranu 2025-01-6 o 16 08 26

Results for .publishOn(Schedulers.boundedElastic()) were better:

  • About 300 RPS sum by (instance) (irate(http_server_requests_seconds_count[15s]))
  • About 53% CPU Usage max by (instance) (process_cpu_usage)
  • About 185 ms response times max by (instance) (http_server_requests_seconds{uri="/dummy/{id}", quantile="0.999"})

Zrzut ekranu 2025-01-6 o 16 13 05

So adding .publishOn(Schedulers.boundedElastic()) bring performance benefits! ❀️

  • RPS: ~240 -> ~300
  • CPU Usage: ~33% -> ~53%
  • Response times: ~260 ms -> 185 ms

Based on my tests i would say, that when all Web Client threads are executing cpu bound operations, then using .boundedElastic() shines ❀️

Few words about Schedulers

As far as I know:

  • parallel - every thread has it's own task queue
  • boundedElastic - every thread share one task queu

I did also an comparison:

Schedulers.newBoundedElastic(
    Runtime.getRuntime().availableProcessors(), 100_000, "customBounded"
)
vs
Schedulers.newParallel(
    "customParallel", Runtime.getRuntime().availableProcessors(), 
)

BoundedElastic was also better in that case.
I've tried to find out why and I have no answer for that.

Maybe:

  • it's related with task queue?
  • those threads are blocking/synchronizing somewhere? Project Reactor Docs Schedulers says, that .parallel() should not execute blocking code

Question

What do you think about publishing response from webClient on .boundedElastic() by default?
Also, with possibility to overwrite or disable it.

Proposal

I've focused only on WebClient.Builder API due to fact, that imo this is more important from programming experience than the internals.

fun create(number: Int, size: String): WebClient =
    webClientBuilder
        // disable this new option
        .disablePublishResponseOnAnotherThread()
        // publish response on another scheduler
        .publishResponseOn(schedulerProvidedByProgrammer)
        .build()

Alternatives

In my tests, i was just adding .publishOn() after getting response from WebClient. But WebClient threads are also decoding response from downstream service. Maybe we should use .publishOn() even before that deserializing?

Also, if those tests makes sense, maybe you will be able to provide some another tests, just to double confirm results?

I didn't checked database clients, but maybe they are working in the samy way?

Summary

Publishing WebClient response on .boundedElatic() bring performance improvement. It leads to:

  • shorter response times
  • higher RPS
  • higher CPU usage

Please let me know what do you think about all of this πŸ˜„

Metadata

Metadata

Assignees

No one assigned

    Labels

    in: webIssues in web modules (web, webmvc, webflux, websocket)status: declinedA suggestion or change that we don't feel we should currently apply

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions