-
Notifications
You must be signed in to change notification settings - Fork 27
Fix sigabrt in module integration tests #421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/viam/sdk/rpc/dial.hpp
Outdated
| RustDialData(const char* path_, void* runtime); | ||
|
|
||
| RustDialData(const RustDialData&) = delete; | ||
| RustDialData(RustDialData&&) noexcept; | ||
|
|
||
| RustDialData& operator=(const RustDialData&) = delete; | ||
| RustDialData& operator=(RustDialData&&) noexcept; | ||
|
|
||
| ~RustDialData(); | ||
|
|
||
| const char* path; | ||
| void* rust_runtime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(q) is there some C++ reason I don't understand why we aren't using the pimpl idiom here? It seems like this is all private implementation details that we don't really need to tell users about at all, since they have no way of accessing the dial data anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does feel like an implementation detail that could be hidden. As written it will affect the ABI since is held "by size" inside the optional and therefore inside ViamChannel, so if it grew new fields it would be an ABI break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope this does seem like a good candidate for pimpl and i considered doing it that way, only didn't do it bc it wasn't that way already but this seems like a good candidate for a drive-by
stuqdog
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple minor things and we'll want to figure out why tests are failing, otherwise lgtm!
src/viam/sdk/rpc/dial.cpp
Outdated
| ViamChannel::RustDialData::RustDialData(RustDialData&& other) noexcept | ||
| : path(std::exchange(other.path, nullptr)), | ||
| rust_runtime(std::exchange(other.rust_runtime, nullptr)) {} | ||
|
|
||
| ViamChannel::RustDialData& ViamChannel::RustDialData::operator=(RustDialData&& other) noexcept { | ||
| path = std::exchange(other.path, nullptr); | ||
| rust_runtime = std::exchange(other.rust_runtime, nullptr); | ||
|
|
||
| return *this; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(q) when/how do these get used? It's not clear to me that we do any assignment of a RustDialData object except when creating a new one in the ViamChannel constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I had mentioned to you wanting to change instances of shared_ptr<ViamChannel> to ViamChannel and for that we would want correctly operating special member functions, otherwise we could recreate anew the issue with double deletes or frees. even without that though it's generally a good idea to program defensively with resource-managing classes, see eg https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#c21-if-you-define-or-delete-any-copy-move-or-destructor-function-define-or-delete-them-all
src/viam/sdk/rpc/dial.hpp
Outdated
| void close(); | ||
|
|
||
| private: | ||
| struct RustDialData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor gotcha, I think we've typically used snake_case for structs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐍
| for (const std::shared_ptr<std::thread>& t : threads_) { | ||
| t->~thread(); | ||
| } | ||
| threads_.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woah. This was also a double destroy for all these std::thread objects, right? I'm sort of amazed this went unnoticed for so long. Are we doing ASAN / UBSAN builds anywhere in CI? If not, I think it might be time to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our unit tests in CI run with santized builds for shared libraries, but our sanitizer seems to only be ubsan, not asan. i'm pretty surprised this hasn't been caught before either, and same with the other main issue being resolved in this PR which is that we had an instance of rust-utils free_string being called on a c_str managed by a std::string. perhaps a blessing in disguise that logging directed us to these!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASAN and UBSAN work together just fine so it might be pretty easy to light it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc turning on ASAN causes compilation to fail because it modifies the underlying memory shape but gRPC is not being compiled with ASAN, leading to ABI errors (I might have the details here somewhat wrong). Definitely fixable but it means compiling gRPC by hand in tests, which I believe is why we haven't done it thus far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right. That's a nuisance, and one I'd forgotten about since most of my time with ASAN was in a world where all third party C++ dependencies were vendored (in large part, exactly to solve this). I wonder if conan can be leveraged to solve this for us since it will build dependencies from source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realizing I encountered this too yesterday when trying to debug the original issue in this PR. As for conan, here is a discussion from 1.x but the answer is 'kinda sorta' conan-io/conan#4754
src/viam/sdk/robot/client.cpp
Outdated
| // UB sanitizer | ||
| t->detach(); | ||
| robot->threads_.push_back(t); | ||
| t.detach(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this is why the double delete didn't really do anything harmful: the thread was detached. But, how do they terminate then?
I think if you call ~RobotClient you could have real problems: close doesn't wait for the threads, and then the should_refresh variable will could get destroyed out from under.
I have a real distrust of detached threads for reasons exactly like this. Perhaps this review is an opportunity to do the necessary refactoring to bound the lifetime of the refresh threads so they are a subset of the lifetime of the RobotClient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose our alternative would be having all the threads join in close? This is a little dicey for throwing-in-dtor reasons, but I think in practice it may be fairly common
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think joining the threads in close is the right thing to do.
src/viam/sdk/robot/client.cpp
Outdated
| auto viam_channel = std::make_shared<ViamChannel>(channel, address.c_str(), nullptr); | ||
| std::shared_ptr<RobotClient> robot = RobotClient::with_channel(viam_channel, options); | ||
| sdk::impl::create_viam_channel(addr, grpc::InsecureChannelCredentials()); | ||
| std::shared_ptr<RobotClient> robot = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto robot =
src/viam/sdk/rpc/dial.hpp
Outdated
| RustDialData(const char* path_, void* runtime); | ||
|
|
||
| RustDialData(const RustDialData&) = delete; | ||
| RustDialData(RustDialData&&) noexcept; | ||
|
|
||
| RustDialData& operator=(const RustDialData&) = delete; | ||
| RustDialData& operator=(RustDialData&&) noexcept; | ||
|
|
||
| ~RustDialData(); | ||
|
|
||
| const char* path; | ||
| void* rust_runtime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does feel like an implementation detail that could be hidden. As written it will affect the ABI since is held "by size" inside the optional and therefore inside ViamChannel, so if it grew new fields it would be an ABI break.
| }; | ||
| } | ||
|
|
||
| const std::shared_ptr<grpc::Channel>& ViamChannel::channel() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this just unimplemented before? If so, maybe that means nobody was using it, and therefore it could be renamed to something like grpc::Channel? Or, simply deleted if we don't want gRPC types in our ABI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry weird diff, it was implemented before but I rearranged the method definition order in the .cpp to conform with the sequencing in the .hpp. I will do a check to see if it's used anywhere but I think yes because I probably touched this when I was defining grpc aliases and doing abi insulation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So yes this was used, but I have taken this opportunity to remove the data member shared_ptr<grpc::Channel> in RobotClient which was redundant and replace all its uses with this method
src/viam/sdk/rpc/dial.hpp
Outdated
|
|
||
| private: | ||
| struct RustDialData { | ||
| RustDialData(const char* path_, void* runtime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the trailing _ on path_?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could just as well be a_path or similar, but I think ctor arg with the same name as member may cause a variable shadow warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly sure it doesn't actually
acmorrow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with two optional comments.
src/viam/sdk/robot/client.hpp
Outdated
| static std::shared_ptr<RobotClient> with_channel(ViamChannel channel, const Options& options); | ||
|
|
||
| RobotClient(std::shared_ptr<ViamChannel> channel); | ||
| RobotClient(ViamChannel channel); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this actually be public? When would it be the right thing to use over with_channel or one of the other static factory methods above? If it should stay public, it should probably be explicit, and also move up to line 66 or so, before ~RobotClient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should probably be private, good callout!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally agreed but this ends up breaking make_shared bc it needs a public ctor ... ugh...going to just make it explicit and public for now, although in the future it may be that we can un-share RobotClient as we did with ViamChannel, tbd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right. That always gets me. You could always do the thing with a private dummy struct argument if you wanted to lock it down so it was still public but not callable from outside the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for now we can probably merge it as is if the builds pass so as to unblock this week's release, and I will futz around in another branch to see if we can clean this up further
| void ViamChannel::close() { | ||
| if (closed_) { | ||
| return; | ||
| impl(const impl&) = delete; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think because you are declaring the move ops that the copy ops are implicitly deleted, but if you don't trust that and want to make it explicit I don't mind either.
stuqdog
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
There was a sigabrt in recent module integration tests that revealed some UB where we sometimes accidentally double-free the memory managed by a
std::string. Zero clue as to why logging brought this up but it is indeed something we'd want to fix anyway.Contains some drive-by changes as well