- 
                Notifications
    You must be signed in to change notification settings 
- Fork 38.8k
Description
We have a backend service, referred to as 'Client,' written in Spring WebFlux and Java. The Client interacts with another backend service, 'Server,' which is developed using Spring Boot and Java.
Currently, the Client communicates with the Server using an ingress URL: http://server.example.com:80. However, we want to introduce an AWS Application Load Balancer (ALB) between the Client and the Server to enable canary deployments. The goal is to gradually shift traffic to canary pods during the Server's deployment, and once 100% traffic is directed to the canary pods, it will be switched to the main pods.
The ALB URL for the Server is: https://server-alb.example.com:443. When calling the Server through this ALB, we are observing a high number of 4xx (460 HTTP code) errors on the load balancer, along with frequent timeouts between the Client and the Server.
We have configured the WebClient timeouts in the Client as follows:
Response Timeout: 100ms
Connect Timeout: 50ms
Read Timeout: 10ms
Write Timeout: 10ms Additionally, we are using Mono.timeout() in Spring WebFlux with a 100ms timeout.
Our system is handling 1000 TPS with 10 pods (4Gi, 8 CPU cores per pod). The Server is receiving the same load at the same TPS rate, with 10 pods configured with 8Gi, 8 CPU cores per pod. The P999 of the Server's API response time is under 30ms.
Things we have already investigated:
- Some requests are not reaching the Server when the 460 HTTP code is returned.
- After consulting with the AWS team, they indicated that the issue appears to be on the Client's side, as the 460 error suggests that the Client is closing the connection.
- A FIN signal has been observed from the Client to the Server in the TCP dump.
- We suspect that Mono.timeout() could be causing the Client to close the connections prematurely.
If anyone has experience working with Spring WebFlux and AWS ALB, could you please share potential reasons for the high ELB 4xx (460 HTTP code) errors and timeouts when calling the Server via the load balancer? Any insights would be greatly appreciated.