Skip to content

Conversation

@lia-viam
Copy link
Collaborator

There was a sigabrt in recent module integration tests that revealed some UB where we sometimes accidentally double-free the memory managed by a std::string. Zero clue as to why logging brought this up but it is indeed something we'd want to fix anyway.

Contains some drive-by changes as well

@lia-viam lia-viam requested a review from a team as a code owner April 22, 2025 19:50
@lia-viam lia-viam requested review from njooma and stuqdog and removed request for a team April 22, 2025 19:50
Comment on lines 48 to 59
RustDialData(const char* path_, void* runtime);

RustDialData(const RustDialData&) = delete;
RustDialData(RustDialData&&) noexcept;

RustDialData& operator=(const RustDialData&) = delete;
RustDialData& operator=(RustDialData&&) noexcept;

~RustDialData();

const char* path;
void* rust_runtime;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(q) is there some C++ reason I don't understand why we aren't using the pimpl idiom here? It seems like this is all private implementation details that we don't really need to tell users about at all, since they have no way of accessing the dial data anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does feel like an implementation detail that could be hidden. As written it will affect the ABI since is held "by size" inside the optional and therefore inside ViamChannel, so if it grew new fields it would be an ABI break.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope this does seem like a good candidate for pimpl and i considered doing it that way, only didn't do it bc it wasn't that way already but this seems like a good candidate for a drive-by

Copy link
Member

@stuqdog stuqdog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple minor things and we'll want to figure out why tests are failing, otherwise lgtm!

Comment on lines 24 to 33
ViamChannel::RustDialData::RustDialData(RustDialData&& other) noexcept
: path(std::exchange(other.path, nullptr)),
rust_runtime(std::exchange(other.rust_runtime, nullptr)) {}

ViamChannel::RustDialData& ViamChannel::RustDialData::operator=(RustDialData&& other) noexcept {
path = std::exchange(other.path, nullptr);
rust_runtime = std::exchange(other.rust_runtime, nullptr);

return *this;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(q) when/how do these get used? It's not clear to me that we do any assignment of a RustDialData object except when creating a new one in the ViamChannel constructor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I had mentioned to you wanting to change instances of shared_ptr<ViamChannel> to ViamChannel and for that we would want correctly operating special member functions, otherwise we could recreate anew the issue with double deletes or frees. even without that though it's generally a good idea to program defensively with resource-managing classes, see eg https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#c21-if-you-define-or-delete-any-copy-move-or-destructor-function-define-or-delete-them-all

void close();

private:
struct RustDialData {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor gotcha, I think we've typically used snake_case for structs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐍

for (const std::shared_ptr<std::thread>& t : threads_) {
t->~thread();
}
threads_.clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah. This was also a double destroy for all these std::thread objects, right? I'm sort of amazed this went unnoticed for so long. Are we doing ASAN / UBSAN builds anywhere in CI? If not, I think it might be time to do so.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our unit tests in CI run with santized builds for shared libraries, but our sanitizer seems to only be ubsan, not asan. i'm pretty surprised this hasn't been caught before either, and same with the other main issue being resolved in this PR which is that we had an instance of rust-utils free_string being called on a c_str managed by a std::string. perhaps a blessing in disguise that logging directed us to these!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASAN and UBSAN work together just fine so it might be pretty easy to light it up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc turning on ASAN causes compilation to fail because it modifies the underlying memory shape but gRPC is not being compiled with ASAN, leading to ABI errors (I might have the details here somewhat wrong). Definitely fixable but it means compiling gRPC by hand in tests, which I believe is why we haven't done it thus far.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. That's a nuisance, and one I'd forgotten about since most of my time with ASAN was in a world where all third party C++ dependencies were vendored (in large part, exactly to solve this). I wonder if conan can be leveraged to solve this for us since it will build dependencies from source.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realizing I encountered this too yesterday when trying to debug the original issue in this PR. As for conan, here is a discussion from 1.x but the answer is 'kinda sorta' conan-io/conan#4754

// UB sanitizer
t->detach();
robot->threads_.push_back(t);
t.detach();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is why the double delete didn't really do anything harmful: the thread was detached. But, how do they terminate then?

I think if you call ~RobotClient you could have real problems: close doesn't wait for the threads, and then the should_refresh variable will could get destroyed out from under.

I have a real distrust of detached threads for reasons exactly like this. Perhaps this review is an opportunity to do the necessary refactoring to bound the lifetime of the refresh threads so they are a subset of the lifetime of the RobotClient.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose our alternative would be having all the threads join in close? This is a little dicey for throwing-in-dtor reasons, but I think in practice it may be fairly common

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think joining the threads in close is the right thing to do.

auto viam_channel = std::make_shared<ViamChannel>(channel, address.c_str(), nullptr);
std::shared_ptr<RobotClient> robot = RobotClient::with_channel(viam_channel, options);
sdk::impl::create_viam_channel(addr, grpc::InsecureChannelCredentials());
std::shared_ptr<RobotClient> robot =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto robot =

Comment on lines 48 to 59
RustDialData(const char* path_, void* runtime);

RustDialData(const RustDialData&) = delete;
RustDialData(RustDialData&&) noexcept;

RustDialData& operator=(const RustDialData&) = delete;
RustDialData& operator=(RustDialData&&) noexcept;

~RustDialData();

const char* path;
void* rust_runtime;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does feel like an implementation detail that could be hidden. As written it will affect the ABI since is held "by size" inside the optional and therefore inside ViamChannel, so if it grew new fields it would be an ABI break.

};
}

const std::shared_ptr<grpc::Channel>& ViamChannel::channel() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this just unimplemented before? If so, maybe that means nobody was using it, and therefore it could be renamed to something like grpc::Channel? Or, simply deleted if we don't want gRPC types in our ABI?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry weird diff, it was implemented before but I rearranged the method definition order in the .cpp to conform with the sequencing in the .hpp. I will do a check to see if it's used anywhere but I think yes because I probably touched this when I was defining grpc aliases and doing abi insulation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yes this was used, but I have taken this opportunity to remove the data member shared_ptr<grpc::Channel> in RobotClient which was redundant and replace all its uses with this method


private:
struct RustDialData {
RustDialData(const char* path_, void* runtime);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the trailing _ on path_?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could just as well be a_path or similar, but I think ctor arg with the same name as member may cause a variable shadow warning

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly sure it doesn't actually

@lia-viam lia-viam requested review from acmorrow and stuqdog April 23, 2025 16:40
Copy link
Member

@acmorrow acmorrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with two optional comments.

static std::shared_ptr<RobotClient> with_channel(ViamChannel channel, const Options& options);

RobotClient(std::shared_ptr<ViamChannel> channel);
RobotClient(ViamChannel channel);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this actually be public? When would it be the right thing to use over with_channel or one of the other static factory methods above? If it should stay public, it should probably be explicit, and also move up to line 66 or so, before ~RobotClient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should probably be private, good callout!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally agreed but this ends up breaking make_shared bc it needs a public ctor ... ugh...going to just make it explicit and public for now, although in the future it may be that we can un-share RobotClient as we did with ViamChannel, tbd

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right. That always gets me. You could always do the thing with a private dummy struct argument if you wanted to lock it down so it was still public but not callable from outside the class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now we can probably merge it as is if the builds pass so as to unblock this week's release, and I will futz around in another branch to see if we can clean this up further

void ViamChannel::close() {
if (closed_) {
return;
impl(const impl&) = delete;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think because you are declaring the move ops that the copy ops are implicitly deleted, but if you don't trust that and want to make it explicit I don't mind either.

Copy link
Member

@stuqdog stuqdog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@stuqdog stuqdog merged commit faa73b6 into viamrobotics:main Apr 23, 2025
4 checks passed
@lia-viam lia-viam deleted the abort-fix branch April 23, 2025 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants