Skip to content

Thread WzPendingWrites going in infinite loop with 100% CPU usage #4836

@johan-boule

Description

@johan-boule

@ManManson Hello, I have a reproducible issue with the new PendingWritesManager code when hosting a game on lobby.wz2100.net (I can't reproduce it on my own lobby).
When a game has been sitting in the lobby for a while (could be hours), the pending write thread is spinning with 100% CPU usage.
From my little understanding of the context, it happens when no players have joined the game, so the only write that has to be done is the "keep" packet sent to the lobby server every 25 seconds.

Below is the part of the code that leads to the loop (version 4.6.3):

// lib/netplay/pending_writes_manager.cpp
void PendingWritesManager::threadImplFunction()
{
	wzMutexLock(mtx_);
	while (!stopRequested_)
	{
		if (checkWritableRes.has_value() && checkWritableRes.value() != 0)
		{
			for (auto connIt = pendingWrites_.begin(); connIt != pendingWrites_.end();)
			{
				// Write data.
				const auto retSent = conn->sendImpl(writeQueue);
				if (retSent.has_value())
				{
					// Erase as much data as written.
					writeQueue.erase(writeQueue.begin(), writeQueue.begin() + retSent.value());
					if (writeQueue.empty())
					{
						pendingWrites_.erase(currentIt);  // Nothing left to write, delete from pending list.
						if (conn->deleteLaterRequested())
						{
							delete conn;
						}
					}
				}
				else
				{
					if (retSent.error() == std::errc::interrupted)
					{
						// Not an actual error, just try to send again later
						continue;
					}
					const auto connStatus = conn->connectionStatus();
						// ends up calling lib/netplay/tcp/poll_descriptor_set.h
						virtual net::result<int> poll(std::chrono::milliseconds timeout) override
						{
							int ret;
							int sockErr = 0;
							do
							{
								ret = pollImpl(timeout);
								if (ret == SOCKET_ERROR)
								{
									sockErr = getSockErr();
								}
							} while (ret == SOCKET_ERROR && (sockErr == EINTR || sockErr == EAGAIN));

							if (ret == SOCKET_ERROR)
							{
								return tl::make_unexpected(make_network_error_code(sockErr));
							}

							return ret;
						}

If I understand the logic of the code, since the send system call (i.e. sendImpl above) returned an error, but that wasn't std::errc::interrupted, we try to determine whether the socket is still connected via a call to connectionStatus(), that function end up usingon the poll system call (i.e. pollImpl above), to check whether there are data ready to read, with a timeout of zero. But I fail to understand why we would have data to read in this case (the lobby doesn't respond to "keep" packets), so it seems to me it just returns 0, indicating there's nothing ready.
So we keep the connection and the thread tries again to send data, leading to the loop never reaching a state of error.
I wonder why we don't initially check for other error codes from the send system call, especially std::errc::connection_reset.
If you could shed some light on the logic here or if you have an idea to fix something, I would very much appreciate.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions