libboostasio fully refactored#550
libboostasio fully refactored#550paolopas wants to merge 12 commits intoCopernicaMarketingSoftware:masterfrom
Conversation
|
As far as heartbeat management is concerned, in my opinion there is still one thing missing.
I haven't been able to find where this feature is actually implemented. If it isn't implemented in the library, I can tell you that AMQP-CPP/include/amqpcpp/libev.h Lines 286 to 299 in 839f74f Maybe a line of code like if (_timeout != 0 && (events & EV_WRITE)) _next = ev_now(_loop) + std::max(_timeout / 2, 1);or a better code layout that avoids duplication ( The same feature can obviously also be implemented in the |
|
I should also point out that issue #479 is easily fixable, although the reason I'm mentioning it here may not seem obvious. With my proposed implementation (and even the current one) there is no apparent reason why a Well, releasing the socket certainly prevents the race condition described in the issue, and it also eliminates any callbacks related to pending operations on the socket, see posix::basic_stream_descriptor::release documentation So when you want to destroy the Watcher, you might need to release the socket and also cancel the timer, which is essentially equivalent to calling the destructor. Since both of these operations can be repeated without causing problems (subsequent releases do nothing), you can write a method that can also be called again by the destructor. All this to finally say that clearing the |
|
As expected, the problem of the delayed destruction of the I discuss the details in issue #479, but here I'll simply say that in my tests, the implementation proposed with this PR behaves exactly like the current version (except for being faster), so the proposed version also needs the same fix. #479 demonstrates how important it is for callbacks to be as simple as possible. My version (which eliminates the need to include a |
|
@EmielBruijntjes, |
|
Ok, sorry for the mess with the comments but I wanted the code to be clear. I think that's all for now, it's time to do more testing. Even though expose
|
|
I know, this PR is getting too long... |
|
I'm still not sure that the condition that should solve #424 is 100% reliable. |
|
Thanks for this contribution. This PR covers a significant amount of ground. Because this project is used in production, we need to be extremely cautious about introducing regressions. Reviewing a change of this magnitude in one go makes it difficult to spot potential edge cases. Could you please split this into smaller, atomic PRs? I recommend an iterative approach: start with the code cleanup/refactoring first, and then add the functional changes in subsequent PRs. I would be happy to prioritize reviewing those smaller PRs to get them merged quickly. |
|
I think now #424 is really fixed. @zerodefect I understand well, but in the meantime here's the whole story. I myself want to do several more tests. Meanwhile, for those who want to have a comprehensive idea, here's where we are Important |
|
But why did you ask me to do this if you wanted to do it yourself?
|
|
Your response came across as dismissive. You responded with:
You understand, but you don't indicate that you are going to do what I suggested. It's not my repo. I don't make the decisions around here. |
|
then it means I have to thank @EmielBruijntjes for this... I'm still happy that in the end you understood the value of what was already given away. |
|
Yes, @zerodefect, okay, we didn't understand each other. But it's the reason for all this haste that bothers me. Considering that @devg (#479) waited more than 3 years without a response (despite having understood and explained the problem better, as well as having served up the solution on a silver platter). And let's not talk about @svyatogor (#424) who has been waiting for more than 4 years for someone to pay attention to him. But these are the facts:
Now the simple question is @EmielBruijntjes should I continue to produce PR like #553 or have you decided to do it yourself? |
|
Let me be clear - I understood you. I didn't like your approach. If there's an issue spotted, does Emiel revert all your changes? We're back to square one. Effectively, nothing has then been updated. I have an interest in this project. If it goes wrong, it costs me. As I said, there is some good stuff in there like the heartbeat(ing). Can your changes be used? Sure. Let's layer changes. I have other changes waiting in the wings. It's always been difficult to get changes merged into this repo. |
|
I don't understand what the problem is with the revert. I didn't like your approach either. 1-1. |
|
Hi Paolo, I'm not here to score points nor to seek public appreciation or approval. I was just sharing my concerns. Go well. |
slow down the heartbeat emission if the client has already sent data within the negotiated timeout, a similar improvement was suggested for LivEvHandler in CopernicaMarketingSoftware#550
|
With the #558, I've finished factorizing this work, at least up to this point. The code is basically the same (except for one condition I improved), but it looks significantly better. I'm also adding the current version here for anyone who wants to try it. Important In the meantime, I've done all the testing I could, and this has led me to the following conclusions, some of which I've partially expressed in a note I left at the bottom of the header, but I have to update that note to talk about the elephant in the china shop. But let's save that for last.
I have often encountered processes that would not terminate because zombie watchers remained in the reactor queue but were attached to an unresponsive filedescriptor. And judging by the issues from several users, it is a frequent case, which pushes many (wrongly) towards a multi-threaded solution (the elephant.) This can happen under different conditions and ensuring that the reactor (boost For example, in one of the tests I use a publisher that does not implement the I have some doubts about what the interface of the single method to be added could be, considering that the user of the protected:
/**
* Stop handling connection(s).
*
* Note that you cannot continue using the connection(s) after calling this method.
* @param pconn The TcpConnection* to release (all by default).
* @return bool If there was a release.
*/
bool release(const TcpConnection *pconn = nullptr)
{
if (pconn == nullptr)
{
// close all watchers
for (auto &wp : _watchers) wp.second->close();
auto b = _watchers.begin();
return b != _watchers.erase(b, _watchers.end());
}
else
{
int fd = pconn->fileno();
// is the connection already closed?
if (fd != -1) {
// close matching watcher
auto iter = _watchers.find(fd);
if (iter != _watchers.end())
{
iter->second->close();
_watchers.erase(iter);
return true;
}
}
return false;
}
}Which obviously should also be called during the destruction phase, always with the aim of preventing unwanted deadlocks. We finally got to the elephant. about libboostasio and thread safetyAMQP-CPP is not thread safe, we cannot repeat this too many times. But I think we should definitely give credit to Gavin, the original author of this handler, for the intuition that with the help of the boost reactor it could get pretty close. The reactor ensures that queued callbacks are executed exclusively by threads that are running the context ( So if all the threads involved are running the context then all the calls to AMQP-CPP originate from callbacks (at least as far as this handler is concerned) and therefore there are no problems This is why among the proposed improvements for the future I said that the handler should expose the |
Why we need to eliminate managed ptrsIn #562 (ddc1662) I removed all the managed ptrs that pointed to the strand from I just started throwing away shared ptr... |
|
I'm done with #566, I promise. Also because I fixed everything that was wrong except for one thing that, however, there's nothing that can be done. The handler and its children now finally have the minimal architecture that allows them to function efficiently. Globally we have gone from this classDiagram
LibBoostAsioHandler : SharedPtr~Strand~ _strand
LibBoostAsioHandler : Map~int SharedPtr~Watcher~~ _watchers
LibBoostAsioHandler *-- Watcher : _watchers
class enable_shared_from_this~Watcher~
Watcher --|> enable_shared_from_this~Watcher~
Watcher : WeakPtr~Strand~ _wpstrand
Watcher o-- LibBoostAsioHandler : _wpstrand
to this classDiagram
LibBoostAsioHandler : Strand _strand
LibBoostAsioHandler : Map~int UniquePtr~Watcher~~ _watchers
LibBoostAsioHandler *-- Watcher : _watchers
Watcher : Strand & _strand
Watcher --> LibBoostAsioHandler : _strand
It may seem like a small change but it has important effects on how it works. Important |
The reasons that pushed me to review the implementation of libboostasio are the following:
boost.asiocode, e.g.deadline_timer,null_buffersstrandis nowboost/asio/io_context_strand.hppLibBoostAsioHandlerandLibBoostAsioHandler::Watcher, an unnecessary complicationlibev, in fact the provided implementation solves issue LibBoostAsioHandler does not detect connection timeouts when heartbeat is negotiated #464 by copying everything fromLibEvHandlerboost::bindandboost::functionwith their standard library equivalentsMost of the work done is related to the callback code (i.e.
asiocompletion handler):make_handlermethodWatcherinstance) and the strand (io_context::strandinstance) equallyboost::system::errc::operation_canceledif the strand has been destroyed as this serves no purposeboost::asio::error::would_block, see boost::asio::posix::stream_descriptor documentationNo changes have been made to the public interface of
LibBoostAsioHandlerexcept for the parameter (uint16_t connection_timeout = 60) added to the constructor (which has a default value anyway.)I just added a couple of
asserts to handle dubious cases in themonitorandonNegotiatemethods, I think the comments in the relevant lines are pretty explanatory. Perhaps the assert inmonitorcan be useful to prematurely intercept many of the problems related to an unsupported concurrency scheme or other weird issues.Need more testing, of course...