-
Notifications
You must be signed in to change notification settings - Fork 917
Description
We have a hidden issue in handling the fatal connection errors with epoll.
If EReadStatus CChannel::recvfrom(sockaddr* addr, CPacket& packet) const returns RST_ERROR, the epoll never gets notified about the error on that socket.
The test condition can be easily reproduced by returning RST_ERROR from the recvfrom function.
Also reverting changes of #468 (i.e. treating WSAECONNRESET as a fatal error on Windows) make the issue reproduceable on Windows (see the ConnectionTimeoutTest in the test folder).
What is happening in case recvfrom DOES NOT return an error?
The worker thread CRcvQueue::worker is in the while() loop. The socket, that attempts to connect, is in the list m_lRendezvousID. The CRcvQueue::worker thread calls void CRendezvousQueue::updateConnStatus() on every loop iteration. The function updates the socket status, in particular, when TTL is expired and the socket has to be closed.
// partial pseudocode
void* CRcvQueue::worker(void* param)
{
// (...)
while (!self->m_bClosing)
{
EReadStatus rst = self->worker_RetrieveUnit(Ref(id), Ref(unit), &sa);
if (rst == RST_ERROR)
{
// !!! This break will skip updateConnStatus on the socket
// and epoll is never notified
break;
}
self->m_pRendezvousQueue->updateConnStatus(rst, cst, unit->m_Packet);
}
return;
}
One of the following calls to void CRendezvousQueue::updateConnStatus() triggers an opened socket to be closed by timeout.
After this the epoll gets notified about the error on that socket. And then all the threads are closed as expected.
What is happening in case recvfrom DOES return an error?
If the recvfrom return an error, the worker thread CRcvQueue::worker loop breaks and thread function returns. This way the next call to the updateConnStatus() is skipped and never happens. Therefore the epoll is never notified about the error on this socket. srt_epoll_wait can only return on a wait timeout, instead of handling the error event on the socket.
This is probably a minor issue, because in general we still have the epoll wait timeout, that handles these conditions.