Thread WzPendingWrites going in infinite loop with 100% CPU usage

@ManManson Hello, I have a reproducible issue with the new PendingWritesManager code when hosting a game on lobby.wz2100.net (I can't reproduce it on my own lobby).
When a game has been sitting in the lobby for a while (could be hours), the pending write thread is spinning with 100% CPU usage.
From my little understanding of the context, it happens when no players have joined the game, so the only write that has to be done is the "keep" packet sent to the lobby server every 25 seconds. 

Below is the part of the code that leads to the loop (version 4.6.3):

```
// lib/netplay/pending_writes_manager.cpp
void PendingWritesManager::threadImplFunction()
{
	wzMutexLock(mtx_);
	while (!stopRequested_)
	{
		if (checkWritableRes.has_value() && checkWritableRes.value() != 0)
		{
			for (auto connIt = pendingWrites_.begin(); connIt != pendingWrites_.end();)
			{
				// Write data.
				const auto retSent = conn->sendImpl(writeQueue);
				if (retSent.has_value())
				{
					// Erase as much data as written.
					writeQueue.erase(writeQueue.begin(), writeQueue.begin() + retSent.value());
					if (writeQueue.empty())
					{
						pendingWrites_.erase(currentIt);  // Nothing left to write, delete from pending list.
						if (conn->deleteLaterRequested())
						{
							delete conn;
						}
					}
				}
				else
				{
					if (retSent.error() == std::errc::interrupted)
					{
						// Not an actual error, just try to send again later
						continue;
					}
					const auto connStatus = conn->connectionStatus();
						// ends up calling lib/netplay/tcp/poll_descriptor_set.h
						virtual net::result<int> poll(std::chrono::milliseconds timeout) override
						{
							int ret;
							int sockErr = 0;
							do
							{
								ret = pollImpl(timeout);
								if (ret == SOCKET_ERROR)
								{
									sockErr = getSockErr();
								}
							} while (ret == SOCKET_ERROR && (sockErr == EINTR || sockErr == EAGAIN));

							if (ret == SOCKET_ERROR)
							{
								return tl::make_unexpected(make_network_error_code(sockErr));
							}

							return ret;
						}
```
If I understand the logic of the code, since the `send` system call (i.e. `sendImpl` above) returned an error, but that wasn't `std::errc::interrupted`, we try to determine whether the socket is still connected via a call to `connectionStatus()`, that function end up usingon the `poll` system call (i.e. `pollImpl` above), to check whether there are data ready to read, with a timeout of zero. But I fail to understand why we would have data to read in this case (the lobby doesn't respond to "keep" packets), so it seems to me it just returns 0, indicating there's nothing ready.
So we keep the connection and the thread tries again to send data, leading to the loop never reaching a state of error.
I wonder why we don't initially check for other error codes from the `send` system call, especially `std::errc::connection_reset`.
If you could shed some light on the logic here or if you have an idea to fix something, I would very much appreciate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Thread WzPendingWrites going in infinite loop with 100% CPU usage #4836

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Thread WzPendingWrites going in infinite loop with 100% CPU usage #4836

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions