BUG/MEDIUM: sock: Remove FD_POLL_HUP during connect() if FD_POLL_ERR is not set

capflam · capflam · commit 7262433183f5 · 2024-11-27T12:16:25.000+01:00
epoll_wait() may return EPOLLUP and/or EPOLLRDHUP after an asynchronous connect(), to indicate that the peer accepted the connection then immediately closed before epoll_wait() returned. When this happens, sock_conn_check() is called to check whether or not the connection correctly established, and after that the receive channel of the socket is assumed to already be closed. This lets haproxy send the request at best (if RDHUP and not HUP) then immediately close. Over the last two years, there were a few reports about this spuriously happening on connections where network captures proved that the server had not closed at all (and sometimes even received the request and responded to it after haproxy had closed). The logs show that a successful connection is immediately reported on error after the request was sent. After investigations, it appeared that a EPOLLUP, or eventually a EPOLLRDHUP, can be reported by epool_wait() during the connect() but in sock_conn_check(), the connect() reports a success. So the connection is validated but the HUP is handled on the first receive and an error is reported. The same behavior could be observed on health-checks, leading HAProxy to consider the server as DOWN while it is not. The only explanation at this point is that it is a kernel bug, notably because it does not even match the documentation for connect() nor epoll. In addition for now it was only observed with Ubuntu kernels 5.4 and 5.15 and was never reproduced on any other one. We have no reproducer but here is the typical strace observed: socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 114 fcntl(114, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 setsockopt(114, SOL_TCP, TCP_NODELAY, [1], 4) = 0 connect(114, {sa_family=AF_INET, sin_port=htons(11000), sin_addr=inet_addr("A.B.C.D")}, 16) = -1 EINPROGRESS (Operation now in progress) epoll_ctl(19, EPOLL_CTL_ADD, 114, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP, data={u32=114, u64=114}}) = 0 epoll_wait(19, [{events=EPOLLIN, data={u32=15, u64=15}}, {events=EPOLLIN, data={u32=151, u64=151}}, {events=EPOLLIN, data={u32=59, u64=59}}, {events=EPOLLIN|EPOLLRDHUP, data={u32=114, u64=114}}], 200, 0) = 4 epoll_ctl(19, EPOLL_CTL_MOD, 114, {events=EPOLLOUT, data={u32=114, u64=114}}) = 0 epoll_wait(19, [{events=EPOLLOUT, data={u32=114, u64=114}}, {events=EPOLLIN, data={u32=15, u64=15}}, {events=EPOLLIN, data={u32=10, u64=10}}, {events=EPOLLIN, data={u32=165, u64=165}}], 200, 0) = 4 connect(114, {sa_family=AF_INET, sin_port=htons(11000), sin_addr=inet_addr("A.B.C.D")}, 16) = 0 sendto(114, "POST "..., 1009, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 1009 close(114) = 0 Some ressources about this issue: - https://www.spinics.net/lists/netdev/msg876470.html - haproxy/haproxy#1863 - haproxy/haproxy#2368 So, to workaround the issue, we have decided to remove FD_POLL_HUP flag on the FD during the connection establishement if FD_POLL_ERR is not reported too in sock_conn_check(). This way, the call to connect() is able to validate or reject the connection. At the end, if the HUP or RDHUP flags were valid, either connect() would report the error itself, or the next recv() would return 0 confirming the closure that the poller tried to report. EPOLL_RDHUP is only an optimization to save a syscall anyway, and this pattern is so rare that nobody will ever notice the extra call to recv(). Please note that at least one reporter confirmed that using poll() instead of epoll() also addressed the problem, so that can also be a temporary workaround for those discovering the problem without the ability to immediately upgrade. The event is accounted via a COUNT_IF(), to be able to spot it in future issue. Just in case. This patch should fix the issue #1863 and #2368. It may be related to #2751. It should be backported as far as 2.4. In 3.0 and below, the COUNT_IF() must be removed.
diff --git a/src/sock.c b/src/sock.c
@@ -945,6 +945,31 @@ int sock_conn_check(struct connection *conn)
 			goto done;
 		if (!(fdtab[fd].state & (FD_POLL_ERR|FD_POLL_HUP)))
 			goto wait;
+
+		/* Removing HUP if there is no ERR reported.
+		 *
+		 * After a first connect() returning EINPROGRESS, it seems
+		 * possible to have EPOLLHUP or EPOLLRDHUP reported by
+		 * epoll_wait() and turned to an error while the following
+		 * connect() will return a success. So the connection is
+		 * validated but the error is saved and reported on the first
+		 * subsequent read.
+		 *
+		 * We have no explanation for now. Why epoll report the
+		 * connection is closed while the connect() it able to validate
+		 * it ? no idea. But, it seems reasonnable in this case, and if
+		 * no error was reported, to remove the the HUP flag. At worst, if
+		 * the connection is really closed, this will be reported later.
+		 *
+		 * Only observed on Ubuntu kernel (5.4/5.15). See:
+		 *   - https://github.com/haproxy/haproxy/issues/1863
+		 *   - https://www.spinics.net/lists/netdev/msg876470.html
+		 */
+		if (unlikely((fdtab[fd].state & (FD_POLL_HUP|FD_POLL_ERR)) == FD_POLL_HUP)) {
+			COUNT_IF(1, "Removing FD_POLL_HUP if no FD_POLL_ERR to let connect() decide");
+			fdtab[fd].state &= ~FD_POLL_HUP;
+		}
+
 		/* error present, fall through common error check path */
 	}