[PERFORMANCE IMPROVEMENT] Only one line! 🤯 WebClient should publish response on boundedElastic

ETA: 5 minutes

Hello! 👋 

# TL;DR
Just add:
```kotlin
.publishOn(Schedulers.boundedElastic()) // or maybe Schedulers.parallel()?
```

Somewhere in `WebClient` internals to improve app performance 😄 


# Context
Recently i was deep diving into `SpringWebFlux` + `Spring WebClient` app to figure out which parts of the code are executed by given thread. It turns out, that if i have cpu bound operation, even as simple as encoding object to json, then they are executed on `WebClient threads` by default. I've created [MVCE App](https://github.com/mateusz-nalepa/spring-publishon) for this. Basically the whole app is something like this:

```kotlin
@GetMapping("/endpoint")
fun endpoint(): Mono<ResponseEntity<AppResponse>> =
    webClient
        .get()
        .uri("http://some-external-service/endpoint")
        .retrieve()
        .bodyToMono(MockServerResponse::class.java)
        // comment this line if needed
        .publishOn(Schedulers.parallel())
        .map {
            heavyCpuOperation()
            it
        }
        .map { ResponseEntity.ok(AppResponse(it.data)) }

private fun heavyCpuOperation() {
    var bigInteger = BigInteger.ZERO
    for (i in 0..500_000) {
        bigInteger = bigInteger.add(BigInteger.valueOf(i.toLong()))
    }
}
```

Here are the results:

- Logs, where there is no `.publishOn()` operator
```text
reactor-http-nio-3 ### com.nalepa.publishon.AppEndpoint ###
reactor-http-nio-3 ### com.nalepa.publishon.AppEndpoint ### ENDPOINT: Start processing request
http-client-nio-2 ### io.netty.channel.DefaultChannelPipeline$HeadContext ### Writing data to socket
http-client-nio-2 ### org.springframework.http.codec.json.Jackson2JsonDecoder ### Decoding webClient response
http-client-nio-2 ### com.nalepa.publishon.AppEndpoint ### WEBCLIENT: I hava response from external service
http-client-nio-2 ### com.nalepa.publishon.AppEndpoint ### CPU OPERATION: Started heavy operation
http-client-nio-2 ### com.nalepa.publishon.AppEndpoint ### CPU OPERATION: Ended heavy operation
http-client-nio-2 ### org.springframework.http.codec.json.Jackson2JsonEncoder ### Encoding endpoint response
reactor-http-nio-3 ### io.netty.channel.DefaultChannelPipeline$HeadContext ### Writing data to socket
http-client-nio-2 ### com.nalepa.publishon.AppEndpoint ### ENDPOINT: Ended processing request
```


- Logs, where there is `.publishOn(Schedulers.parallel())` operator
```text
reactor-http-nio-4 ### com.nalepa.publishon.AppEndpoint ###
reactor-http-nio-4 ### com.nalepa.publishon.AppEndpoint ### ENDPOINT: Start processing request
http-client-nio-2 ### io.netty.channel.DefaultChannelPipeline$HeadContext ### Writing data to socket
http-client-nio-2 ### org.springframework.http.codec.json.Jackson2JsonDecoder ### Decoding webClient response
http-client-nio-2 ### com.nalepa.publishon.AppEndpoint ### WEBCLIENT: I hava response from external service
parallel-1 ### com.nalepa.publishon.AppEndpoint ### CPU OPERATION: Started heavy operation
parallel-1 ### com.nalepa.publishon.AppEndpoint ### CPU OPERATION: Ended heavy operation
parallel-1 ### org.springframework.http.codec.json.Jackson2JsonEncoder ### Encoding endpoint response
parallel-1 ### com.nalepa.publishon.AppEndpoint ### ENDPOINT: Ended processing request
reactor-http-nio-4 ### io.netty.channel.DefaultChannelPipeline$HeadContext ### Writing data to socket
```

As you can see, `CPU Operation` is executed on `http thread` when there is no `.publishOn()`. I've decided to perform some tests related to this.

# Few words about testing

Dependencies used:
- Spring Boot `3.4.1`
- Java 21

Platform:
- MacBook Air M2, 16GB RAM

Testing App:
- [Here's the repo](https://github.com/mateusz-nalepa/http-clients-benchmark)

I know, that for testing it's good to tests many scenarios, many schedulers etc. I've done that. In this issue, I'm putting results for so called `Base` and `Complex` scenario. Also, maybe I'm missing something and those tests simply does not make any sense, if yes, please let me know! 

In tests, i was also ignoring like first 2-3 minutes of results, due to JVM warmup.


# Performance Testing
I've decided to perform some tests to find out, if there will be any performance improvement by publishing webClient response on another scheduler.

## Base Scenario 
I've started with simpliest one:
```kotlin
@GetMapping("/endpoint")
fun endpoint(): Mono<ResponseEntity<String>> =
    webClient
        .get()
        .uri("http://some-external-service/endpoint")
        .retrieve()
        .bodyToMono(String::class.java)
        .map { ResponseEntity.ok(it) }
```

Here's the architecture diagram:
![Zrzut ekranu 2025-01-6 o 15 41 53](https://github.com/user-attachments/assets/9f8c27c1-7acb-4f38-bdc2-f22ec5481f0d)

So basically the flow is something like:
```
1. Send `only one` request to an app
2. App get data from mock server by using `only one` WebClient.
3. Go to step 1 
```

I've run this scenario in 3 variants:
- without `.publishOn()`
- with `.publishOn(Schedulers.parallel())`
- with `.publishOn(Schedulers.boundedElastic())`

For every one of them the results were similar, so I will post only one screenshot from Grafana.
- About 8K RPS  `sum by (instance) (irate(http_server_requests_seconds_count[15s]))`
- About 5% CPU Usage `max by (instance) (process_cpu_usage)`

![Zrzut ekranu 2025-01-6 o 15 51 04](https://github.com/user-attachments/assets/dbcd0105-43a0-4e59-87c1-6c3e16742e24)

So it's good to know, that adding `.publishOn()` did not has any impact on the simpliest app.


## Complex Scenario

I've added:
- decoding response from `Mock Server`:
```kotlin
data class MockServerResponse(
    val value: String,
)
```
- encoding response from an `TestApp` by simply returning `List<String>`

So now app looks like this:
```kotlin
@GetMapping("/endpoint")
fun endpoint(): Mono<ResponseEntity<List<String>>> =
    Flux
        .fromIterable(webClients)
        .flatMap { 
            it
                .getResponseFromWebClient()
                // comment if needed
                .publishOn(Schedulers.boundedElastic()) 
        }
        .collectList()
        .map { ResponseEntity.ok(it) }
```

I've also changed a little bit architecture. Here's the diagram:
![Zrzut ekranu 2025-01-6 o 16 06 43](https://github.com/user-attachments/assets/2e33cede-5d53-40fd-b5d1-4e8aed7ee735)

So basically the flow is something like:
```
1. Send `N` requests to an app
2. For every request app get data from mock server by using `M` WebClients.
3. Go to step 1 
```

I've run this scenario in 3 variants:
- without `.publishOn()`
- with `.publishOn(Schedulers.parallel())`
- with `.publishOn(Schedulers.boundedElastic())`

Results where there is no `.publishOn()` or there is`Schedulers.parallel()` were similar:
- About 240 RPS  `sum by (instance) (irate(http_server_requests_seconds_count[15s]))`
- About 33% CPU Usage `max by (instance) (process_cpu_usage)`
- About 260 ms response times `max by (instance) (http_server_requests_seconds{uri="/dummy/{id}", quantile="0.999"})`

![Zrzut ekranu 2025-01-6 o 16 08 26](https://github.com/user-attachments/assets/079d8b74-1af6-4382-b294-3e6671529272)

Results for `.publishOn(Schedulers.boundedElastic())` were better:
- About 300 RPS  `sum by (instance) (irate(http_server_requests_seconds_count[15s]))`
- About 53% CPU Usage `max by (instance) (process_cpu_usage)`
- About 185 ms response times `max by (instance) (http_server_requests_seconds{uri="/dummy/{id}", quantile="0.999"})`

![Zrzut ekranu 2025-01-6 o 16 13 05](https://github.com/user-attachments/assets/8bf4c0cd-74d5-49bf-ba66-a2d371035bea)

So adding `.publishOn(Schedulers.boundedElastic())` bring performance benefits! ❤️ 
- RPS: ~240 -> ~300
- CPU Usage: ~33% -> ~53%
- Response times: ~260 ms -> 185 ms

Based on my tests i would say, that when all `Web Client threads` are executing cpu bound operations, then using `.boundedElastic()` shines ❤️ 

# Few words about Schedulers

As far as I know:
- `parallel` - every thread has it's own task queue
- `boundedElastic` - every thread share one task queu 

I did also an comparison: 
```text
Schedulers.newBoundedElastic(
    Runtime.getRuntime().availableProcessors(), 100_000, "customBounded"
)
vs
Schedulers.newParallel(
    "customParallel", Runtime.getRuntime().availableProcessors(), 
)
```

`BoundedElastic` was also better in that case. 
I've tried to find out why and I have no answer for that. 

Maybe:
- it's related with task queue?
- those threads are blocking/synchronizing somewhere? [Project Reactor Docs Schedulers](https://projectreactor.io/docs/core/release/api/reactor/core/scheduler/Schedulers.html) says, that `.parallel()` should not execute blocking code


# Question

What do you think about publishing response from webClient on `.boundedElastic()` by default?
Also, with possibility to overwrite or disable it.

# Proposal

I've focused only on `WebClient.Builder` API due to fact, that imo this is more important from programming experience than the internals.

```kotlin
fun create(number: Int, size: String): WebClient =
    webClientBuilder
        // disable this new option
        .disablePublishResponseOnAnotherThread()
        // publish response on another scheduler
        .publishResponseOn(schedulerProvidedByProgrammer)
        .build()
```

# Alternatives
In my tests, i was just adding `.publishOn()` after getting response from `WebClient`. But `WebClient` threads are also decoding response from downstream service. Maybe we should use `.publishOn()` even before that deserializing?

Also, if those tests makes sense, maybe you will be able to provide some another tests, just to `double confirm` results?

I didn't checked database clients, but maybe they are working in the samy way? 

# Summary
Publishing `WebClient` response on `.boundedElatic()` bring performance improvement. It leads to: 
- shorter response times
- higher RPS
- higher CPU usage

Please let me know what do you think about all of this 😄 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[PERFORMANCE IMPROVEMENT] Only one line! 🤯 WebClient should publish response on boundedElastic #34199

TL;DR

Context

Few words about testing

Performance Testing

Base Scenario

Complex Scenario

Few words about Schedulers

Question

Proposal

Alternatives

Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[PERFORMANCE IMPROVEMENT] Only one line! 🤯 WebClient should publish response on boundedElastic #34199

Description

TL;DR

Context

Few words about testing

Performance Testing

Base Scenario

Complex Scenario

Few words about Schedulers

Question

Proposal

Alternatives

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions