Skip to content

Conversation

CuriousGeorgiy
Copy link
Member

@CuriousGeorgiy CuriousGeorgiy commented Sep 21, 2025

This patchset performs a major refactoring of the client code to replace all internal Connection usage with ConnectionImpl, and fixes several related bugs.

Closes #51
Closes #132
Closes #140
Closes #142
Closes #147
Closes #150
Closes #151

@CuriousGeorgiy CuriousGeorgiy force-pushed the wait-method-fixes-2 branch 3 times, most recently from 214c640 to aa5e506 Compare September 23, 2025 13:34
Copy link
Collaborator

@drewdzzz drewdzzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patch! I believe it improves connector usability quite a lot.

LOG_DEBUG("waitAny() called on connector without connections");
return std::nullopt;
}
for (auto *conn : m_Connections) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think waitAny still should trigger net provider to send pending requests (m_NetProvider.wait(0)).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have this logic in other wait methods, why should we have this logic in this method?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, you are right. I need to think about it.

The problem is:

  1. You've sent some requests and received 10 responses.
  2. While processing these responses, you are sending new requests.
  3. Until you've processed all 10 responses, no new requests will be sent, so you'll observe latency spike after you're done with those 10 responses.

I'm not sure other methods are affected by this problem. But, IMHO, all wait methods should trigger connector to send pending requests (or we should send them right on operation).

Copy link
Member Author

@CuriousGeorgiy CuriousGeorgiy Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, IMHO, all wait methods should trigger connector to send pending requests (or we should send them right on operation).

I don't think you are right here. We should try to minimize the number of system calls, where possible, because system calls cause context switches and kernel- to user- space transition. This is exactly what this patch is trying to do.

Instead, you propose to add an unconditional system call on each wait method call, even when it is not necessary and a user has something to process.

Moreover, doing this system call may be useless in the case when a user does not have any other pending requests.

Until you've processed all 10 responses, no new requests will be sent, so you'll observe latency spike after you're done with those 10 responses.

I think the problem you describe should be solved the other way around: APIs that create requests should actively send them rather than postponing the sends to wait method calls.

@drewdzzz drewdzzz assigned CuriousGeorgiy and unassigned drewdzzz Oct 8, 2025
@CuriousGeorgiy CuriousGeorgiy force-pushed the wait-method-fixes-2 branch 3 times, most recently from 14767b4 to 9a9401c Compare October 13, 2025 17:05
`Connection`s model shared pointers on `ConnectionImpl` and are intended
for external usage by library consumers. Using them internally in the
Connector can cause bugs. For instance, we internally store `Connection`
objects in the LibevNetProvider, which adds an additional reference to them
and prevents them from being deleted, even when all user objects are dead.
Also we leak connections in `close`, since we do not erase them from the
`m_ReadyToSend` and `m_ReadyToDecode` sets.

To fix this, let's internally use `ConnectionImpl` to model weak pointers.
We rely on the fact that the lifetime of these weak pointers is tied to the
lifetime of the shared user objects.

The ideal solution would be to remove the `Connection`<-`ConnectionImpl`
constructor, but we still need to return new Connection objects in methods
like waitAny, so let's just enforce as a rule that we should not use
`Connection` objects internally.

While we are here, let's also fix all the formating issues that the linter
is reporting for the refactored code.

Closes tarantool#140
Closes tarantool#147
Currently, to avoid double-close of a connection we check for the `SS_DEAD`
status on its stream. However, the `SS_DEAD` status can be set for a
multitude of reasons without the connection being closed or even connected.
For instance, we set it if a send/recv failed unrecoverably.

To fix this, let's rely on the fact that a connection is open iff the file
descriptor of its stream is a valid (i.e., not -1). This approach seems to
be portable to the Windows platform too.

Closes tarantool#142
Currently, all connection bookkeeping is done opaquely through the network
providers which, in turn, also do this bookkeeping opaquely using system
interfaces (e.g., libev, epoll). Because of this, we cannot handle cases
when waitAny is called and there are no connections (tarantoolgh-51) or when a
connection has ready responses (tarantoolgh-132). In order to improve
`waitAny` robustness, we need to add connection bookkeeping to Connector.

We should move the timer start to the beginning of the waiting loop, since
the preceding checking overhead should not be accounted for the waiting
time.

Closes tarantool#51
Needed for tarantool#132
Currently, `waitAny` does not account for connections that already have
ready responses. Let's handle this by going through the set of owned
connections and checking them for ready responses. We should do this before
the timer start, since the checking overhead should not be accounted for
the waiting time.

Closes tarantool#132
@CuriousGeorgiy
Copy link
Member Author

CuriousGeorgiy commented Oct 13, 2025

@mkostoevr @drewdzzz I got pissed off by #150, and discovered a bug #151 while fixing it. Please take a look at the two new commits I have added fixing those and at an additional bugfix for #147 that I added in the scope of client: replace all internal Connection usage with ConnectionImpl (namely, erasing the closed connection from the m_ReadyToSend and m_ReadyToDecode sets in Connector::close).

Currently, if we timeout waiting for `connect` using `poll`, we set
`connect_errno` to `ETIMEOUT`. However, after we exit the `connect`
interruption loop, we unconditionally set to `errno`, which we do not even
clear after checking the `connect` result. As a result, the timeout error
gets masked by the in-progress connection error.

To fix this, let's:
1. Clear `errno` after checking the `connect` result;
2. Check the value of `connect_errno` after exiting the `connect` interrupt
   loop before setting it to `errno`;
3. Zero out `socket_errno`, `connect_errno` each `addr_info` loop
   iteration.

Closes tarantool#151
`ClientMultithreadTest` fails on MacOS because the `connect`s are timing
out. To fix this, let's increase the `connect_timeout` for
`ClientMultithreadTest` to 10 seconds.

Closes tarantool#150
@mkostoevr mkostoevr self-requested a review October 14, 2025 07:48
template <class BUFFER, class NetProvider>
void
hasSentBytes(Connection<BUFFER, NetProvider> &conn, size_t bytes)
hasSentBytes(ConnectionImpl<BUFFER, NetProvider> *conn, size_t bytes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving all these functions to ConnectionImpl?

  1. It's weird to see non-method functions here at all.
  2. These utility functions will be hidden in ConnectionImpl so that only advanced user will see them (only implementor of its own NetProvider will work with ConnectionImpl directly.

LOG_DEBUG("waitAny() called on connector without connections");
return std::nullopt;
}
for (auto *conn : m_Connections) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, you are right. I need to think about it.

The problem is:

  1. You've sent some requests and received 10 responses.
  2. While processing these responses, you are sending new requests.
  3. Until you've processed all 10 responses, no new requests will be sent, so you'll observe latency spike after you're done with those 10 responses.

I'm not sure other methods are affected by this problem. But, IMHO, all wait methods should trigger connector to send pending requests (or we should send them right on operation).

@drewdzzz drewdzzz assigned mkostoevr and unassigned drewdzzz Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment