Skip to content
18 changes: 7 additions & 11 deletions src/viam/sdk/robot/client.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -129,9 +129,7 @@ RobotClient::~RobotClient() {

void RobotClient::close() {
should_refresh_.store(false);
for (const std::shared_ptr<std::thread>& t : threads_) {
t->~thread();
}
threads_.clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah. This was also a double destroy for all these std::thread objects, right? I'm sort of amazed this went unnoticed for so long. Are we doing ASAN / UBSAN builds anywhere in CI? If not, I think it might be time to do so.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our unit tests in CI run with santized builds for shared libraries, but our sanitizer seems to only be ubsan, not asan. i'm pretty surprised this hasn't been caught before either, and same with the other main issue being resolved in this PR which is that we had an instance of rust-utils free_string being called on a c_str managed by a std::string. perhaps a blessing in disguise that logging directed us to these!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASAN and UBSAN work together just fine so it might be pretty easy to light it up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc turning on ASAN causes compilation to fail because it modifies the underlying memory shape but gRPC is not being compiled with ASAN, leading to ABI errors (I might have the details here somewhat wrong). Definitely fixable but it means compiling gRPC by hand in tests, which I believe is why we haven't done it thus far.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. That's a nuisance, and one I'd forgotten about since most of my time with ASAN was in a world where all third party C++ dependencies were vendored (in large part, exactly to solve this). I wonder if conan can be leveraged to solve this for us since it will build dependencies from source.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realizing I encountered this too yesterday when trying to debug the original issue in this PR. As for conan, here is a discussion from 1.x but the answer is 'kinda sorta' conan-io/conan#4754

stop_all();
viam_channel_->close();
}
Expand Down Expand Up @@ -292,13 +290,12 @@ std::shared_ptr<RobotClient> RobotClient::with_channel(std::shared_ptr<ViamChann
robot->refresh_interval_ = options.refresh_interval();
robot->should_refresh_ = (robot->refresh_interval_ > 0);
if (robot->should_refresh_) {
const std::shared_ptr<std::thread> t =
std::make_shared<std::thread>(&RobotClient::refresh_every, robot);
auto t = std::thread(&RobotClient::refresh_every, robot);
// TODO(RSDK-1743): this was leaking, confirm that adding thread catching in
// close/destructor lets us shutdown gracefully. See also address sanitizer,
// UB sanitizer
t->detach();
robot->threads_.push_back(t);
t.detach();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is why the double delete didn't really do anything harmful: the thread was detached. But, how do they terminate then?

I think if you call ~RobotClient you could have real problems: close doesn't wait for the threads, and then the should_refresh variable will could get destroyed out from under.

I have a real distrust of detached threads for reasons exactly like this. Perhaps this review is an opportunity to do the necessary refactoring to bound the lifetime of the refresh threads so they are a subset of the lifetime of the RobotClient.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose our alternative would be having all the threads join in close? This is a little dicey for throwing-in-dtor reasons, but I think in practice it may be fairly common

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think joining the threads in close is the right thing to do.

robot->threads_.push_back(std::move(t));
};

robot->refresh();
Expand All @@ -318,11 +315,10 @@ std::shared_ptr<RobotClient> RobotClient::at_address(const std::string& address,
std::shared_ptr<RobotClient> RobotClient::at_local_socket(const std::string& address,
const Options& options) {
const std::string addr = "unix://" + address;
const char* uri = addr.c_str();
const std::shared_ptr<grpc::Channel> channel =
sdk::impl::create_viam_channel(uri, grpc::InsecureChannelCredentials());
auto viam_channel = std::make_shared<ViamChannel>(channel, address.c_str(), nullptr);
std::shared_ptr<RobotClient> robot = RobotClient::with_channel(viam_channel, options);
sdk::impl::create_viam_channel(addr, grpc::InsecureChannelCredentials());
std::shared_ptr<RobotClient> robot =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto robot =

RobotClient::with_channel(std::make_shared<ViamChannel>(channel), options);
robot->should_close_channel_ = true;

return robot;
Expand Down
2 changes: 1 addition & 1 deletion src/viam/sdk/robot/client.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ class RobotClient {

void refresh_every();

std::vector<std::shared_ptr<std::thread>> threads_;
std::vector<std::thread> threads_;

std::atomic<bool> should_refresh_;
unsigned int refresh_interval_;
Expand Down
48 changes: 34 additions & 14 deletions src/viam/sdk/rpc/dial.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -18,18 +18,33 @@
namespace viam {
namespace sdk {

const std::shared_ptr<grpc::Channel>& ViamChannel::channel() const {
return channel_;
ViamChannel::RustDialData::RustDialData(const char* path_, void* runtime)
: path(path_), rust_runtime(runtime) {}

ViamChannel::RustDialData::RustDialData(RustDialData&& other) noexcept
: path(std::exchange(other.path, nullptr)),
rust_runtime(std::exchange(other.rust_runtime, nullptr)) {}

ViamChannel::RustDialData& ViamChannel::RustDialData::operator=(RustDialData&& other) noexcept {
path = std::exchange(other.path, nullptr);
rust_runtime = std::exchange(other.rust_runtime, nullptr);

return *this;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(q) when/how do these get used? It's not clear to me that we do any assignment of a RustDialData object except when creating a new one in the ViamChannel constructor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I had mentioned to you wanting to change instances of shared_ptr<ViamChannel> to ViamChannel and for that we would want correctly operating special member functions, otherwise we could recreate anew the issue with double deletes or frees. even without that though it's generally a good idea to program defensively with resource-managing classes, see eg https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#c21-if-you-define-or-delete-any-copy-move-or-destructor-function-define-or-delete-them-all


void ViamChannel::close() {
if (closed_) {
return;
}
closed_ = true;
free_string(path_);
free_rust_runtime(rust_runtime_);
};
ViamChannel::RustDialData::~RustDialData() {
free_string(path);
free_rust_runtime(rust_runtime);
}

ViamChannel::ViamChannel(std::shared_ptr<grpc::Channel> channel, const char* path, void* runtime)
: channel_(std::move(channel)), rust_data_(RustDialData(path, runtime)) {}

ViamChannel::ViamChannel(std::shared_ptr<grpc::Channel> channel) : channel_(std::move(channel)) {}

ViamChannel::~ViamChannel() {
close();
}

const std::string& Credentials::type() const {
return type_;
Expand All @@ -39,9 +54,6 @@ const std::string& Credentials::payload() const {
return payload_;
}

ViamChannel::ViamChannel(std::shared_ptr<grpc::Channel> channel, const char* path, void* runtime)
: channel_(std::move(channel)), path_(path), closed_(false), rust_runtime_(runtime) {}

DialOptions::DialOptions() = default;

DialOptions& DialOptions::set_credentials(boost::optional<Credentials> creds) {
Expand Down Expand Up @@ -162,7 +174,15 @@ std::shared_ptr<ViamChannel> ViamChannel::dial(const char* uri,
const std::unique_ptr<viam::robot::v1::RobotService::Stub> st =
viam::robot::v1::RobotService::NewStub(channel);
return std::make_shared<ViamChannel>(channel, socket_path, ptr);
};
}

const std::shared_ptr<grpc::Channel>& ViamChannel::channel() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this just unimplemented before? If so, maybe that means nobody was using it, and therefore it could be renamed to something like grpc::Channel? Or, simply deleted if we don't want gRPC types in our ABI?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry weird diff, it was implemented before but I rearranged the method definition order in the .cpp to conform with the sequencing in the .hpp. I will do a check to see if it's used anywhere but I think yes because I probably touched this when I was defining grpc aliases and doing abi insulation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yes this was used, but I have taken this opportunity to remove the data member shared_ptr<grpc::Channel> in RobotClient which was redundant and replace all its uses with this method

return channel_;
}

void ViamChannel::close() {
rust_data_.reset();
}

unsigned int Options::refresh_interval() const {
return refresh_interval_;
Expand Down
27 changes: 23 additions & 4 deletions src/viam/sdk/rpc/dial.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,12 @@ namespace sdk {
class DialOptions;
class ViamChannel {
public:
void close();
explicit ViamChannel(std::shared_ptr<GrpcChannel> channel);

ViamChannel(std::shared_ptr<GrpcChannel> channel, const char* path, void* runtime);

~ViamChannel();

/// @brief Connects to a robot at the given URI address, using the provided dial options (or
/// default options is none are provided). Ignores initial connection options specifying
/// how many times to attempt to connect and with what timeout.
Expand All @@ -38,11 +41,27 @@ class ViamChannel {

const std::shared_ptr<GrpcChannel>& channel() const;

void close();

private:
struct RustDialData {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor gotcha, I think we've typically used snake_case for structs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐍

RustDialData(const char* path_, void* runtime);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the trailing _ on path_?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could just as well be a_path or similar, but I think ctor arg with the same name as member may cause a variable shadow warning

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly sure it doesn't actually


RustDialData(const RustDialData&) = delete;
RustDialData(RustDialData&&) noexcept;

RustDialData& operator=(const RustDialData&) = delete;
RustDialData& operator=(RustDialData&&) noexcept;

~RustDialData();

const char* path;
void* rust_runtime;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(q) is there some C++ reason I don't understand why we aren't using the pimpl idiom here? It seems like this is all private implementation details that we don't really need to tell users about at all, since they have no way of accessing the dial data anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does feel like an implementation detail that could be hidden. As written it will affect the ABI since is held "by size" inside the optional and therefore inside ViamChannel, so if it grew new fields it would be an ABI break.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope this does seem like a good candidate for pimpl and i considered doing it that way, only didn't do it bc it wasn't that way already but this seems like a good candidate for a drive-by

};

std::shared_ptr<GrpcChannel> channel_;
const char* path_;
bool closed_;
void* rust_runtime_;

boost::optional<RustDialData> rust_data_;
};

class Credentials {
Expand Down
Loading