-
Notifications
You must be signed in to change notification settings - Fork 27
RSDK-10842 - shut down cpp modules properly when viam-server is hard killed #448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RSDK-10842 - shut down cpp modules properly when viam-server is hard killed #448
Conversation
| } | ||
| if (!connected) { | ||
| // NOLINTNEXTLINE | ||
| close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lia-viam here's somewhere that your C++ expertise would be greatly appreciated! close seemed better than just exiting to ensure graceful shutdown (though we do ultimately want to exit if we've reached this point), but I'm not totally confident that calling close here, from within this thread, is safe behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue calling close from here is that joinable will return false because it checks thread.get_id() == this_thread::id(). we should probably be storing false in the should_check_connection_ thing if we establish that we've been disconnected, so that even if we're not calling join we can be sure that the thread will have exited
another thought would be to have a flag variable that says if we've been disconnected, and have impl::client_helper check that the same way it always checks if impl is non-null. That might exit with an exception throw, but I'm not sure if we're too pressed about "graceful" shutdown in the highly ungraceful case of the process being orphaned by a kill -9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also maybe cc @acmorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably be storing false in the should_check_connection_ thing if we establish that we've been disconnected, so that even if we're not calling join we can be sure that the thread will have exited
I'm not sure I entirely understand this, but I'm a little concerned about it. We'd need to at some point reset the value to true (or else we stop checking connection after the first failure), but I could imagine a case where someone triggers a close (which sets should_check_connection_ to false) and then immediately after, this loop resets it to true and so we just continue this loop forever and fail to shutdown properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, the python SDK at this point just calls sys.exit(), so if we're not too concerned about gracefulness (and this is a case where the module is an orphan process anyway and nothing it logs or does will be trivially inspectable), we could just exit(0) and call it a day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per offline discussion--this is actually fine bc when the check_connection thread calls close it will not join itself, but the later destructor call will. for transparency it may be good to add break; after close(), but also close() sets the while loop flag variable to false so the function will return immediately after close() returns
| should_check_connection_.store(false); | ||
| } | ||
| bool connected(true); | ||
| while (should_check_connection_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lia-viam another ignorant C++ question, should_check_connection_ is set to false when we want to shutdown (makes sense!) but also above when check_every and reconnect_every are set to zero. In such a case this thread will return pretty much immediately, but the client code in such a case could be running for an indefinite amount of time. Are there any safety concerns here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless i'm missing something that should be fine--the thread would return and we would just have the same behavior as pre-this PR
src/viam/sdk/rpc/dial.hpp
Outdated
| /// @brief How often to verify connectivity to the robot, in seconds. If set to 0, will not | ||
| /// check, will default to the `reconnect_every_interval_` value. Defaults to 0. | ||
| /// @note Setting to a non-zero value is useful in modules but may result in delays shutting | ||
| /// down client code | ||
| unsigned int check_every_interval_ = 0; | ||
|
|
||
| /// @brief How often to attempt to reconnect to the robot when disconnected. If set to 0, | ||
| /// will not attempt to reconnect. Defaults to 0. | ||
| /// @note Setting to a non-zero value is useful in modules but may result in delays shutting | ||
| /// down client code | ||
| unsigned int reconnect_every_interval_ = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted to set these both to 0 by default. It leads to no meaningful behavioral change for existing clients, and sidesteps the issue of having to wait for the check_connection loop to finish before shutting down client code. Also if a client is having issues connecting, they probably can figure out what they want to do and stop the process as necessary by interacting directly, without us making any assumptions.
We could put these into constructors, but I chose not to because I didn't want any breaking changes or constructor bloat (what if a user only wants to set one?), and at any rate maintaining the existing constructor to avoid breaking existing code means that we still have to be opinionated about a default value.
the default is overridden for modules in modules/service.cpp, to ensure that modules don't get orphaned.
src/viam/sdk/rpc/dial.hpp
Outdated
| /// check, will default to the `reconnect_every_interval_` value. Defaults to 0. | ||
| /// @note Setting to a non-zero value is useful in modules but may result in delays shutting | ||
| /// down client code | ||
| unsigned int check_every_interval_ = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these are new let's change them to std::chrono::seconds which is what I had done to refresh_interval_ on the dial direct branch
src/viam/sdk/rpc/dial.cpp
Outdated
| const char* ViamChannel::get_channel_addr() const { | ||
| return uri_; | ||
| } | ||
| Options& Options::set_check_every_interval(unsigned int interval) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same with the other comment let's have these be std::chrono::seconds
| should_check_connection_.store(false); | ||
| } | ||
| bool connected(true); | ||
| while (should_check_connection_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless i'm missing something that should be fine--the thread would return and we would just have the same behavior as pre-this PR
src/viam/sdk/robot/client.cpp
Outdated
| } | ||
| bool connected(true); | ||
| while (should_check_connection_) { | ||
| std::exception connection_error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'll want to do this with std::exception_ptr otherwise the polymorphic exception object could get sliced when it's copied into std::exception
src/viam/sdk/robot/client.cpp
Outdated
| auto impl = | ||
| std::make_unique<RobotClient::impl>(RobotService::NewStub(channel.channel())); | ||
| impl_.reset(); | ||
| impl_.swap(impl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can all just be impl_ = std::make_unique<...>(...)
| } | ||
| if (!connected) { | ||
| // NOLINTNEXTLINE | ||
| close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue calling close from here is that joinable will return false because it checks thread.get_id() == this_thread::id(). we should probably be storing false in the should_check_connection_ thing if we establish that we've been disconnected, so that even if we're not calling join we can be sure that the thread will have exited
another thought would be to have a flag variable that says if we've been disconnected, and have impl::client_helper check that the same way it always checks if impl is non-null. That might exit with an exception throw, but I'm not sure if we're too pressed about "graceful" shutdown in the highly ungraceful case of the process being orphaned by a kill -9
| } | ||
| if (!connected) { | ||
| // NOLINTNEXTLINE | ||
| close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also maybe cc @acmorrow
🔗 Link your GitHub account to AtlassianTo enable Code Reviewer, please link your GitHub account to your Atlassian account. Click here to connect your accounts This is a one-time setup that takes less than a minute. |
No description provided.