Conversation
Add shared `CoordinatorState` struct for keeping state. Use `DashMap` for concurrent maps. Not all messages use the proper tarpc request handling yet. This will be fixed in a follow-up commit. Same goes for the manual (de)serialization.
tarpc service and servertarpc service and server to coordinator
No longer needed, replaced by fine-grained result types
|
@claude Can you look into the failure of the CI/Examples test? It looks like the multiple daemons example is not finishing properly. It works locally for me though. |
|
Claude finished @phil-opp's task in 5m 14s —— View job CI/Examples Test Failure - Fixed ✅Summary: The multiple daemons example was failing in CI due to a panic at the FFI boundary when using Root CauseThe
The Problem Code// Creates current_thread runtime that leaves thread-local state
let rt = tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()?;
rt.block_on(download_file(source, target_path))
// rt is dropped, but thread-local state remains!
// Later, this panics because Tokio detects runtime context
self.events_tx.blocking_send(event) // <-- PANICSThe FixChanged all Changes made:
CommitPushed fix in commit --- | Branch: tarpc |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Tokio channels panic on blocking operations when they believe that they are called from a runtime context.
Instead of default 10 second deadline
More robust than the UDP socket trick, which returns the IP address from the CLI's point of view.
haixuanTao
left a comment
There was a problem hiding this comment.
Thanks for the solid refactor! The DashMap lock discipline is well done and the insert-before-spawn pattern is excellent. I tested the full CLI lifecycle (up/check/list/start/stop/logs/build/destroy) on macOS and everything works correctly — including log streaming via both dora start and dora run.
Left a few comments on things that caught my eye.
| addr: IpAddr, | ||
| _control_port: u16, | ||
| ) -> eyre::Result<CliControlClient> { | ||
| let rpc_port = DORA_COORDINATOR_PORT_RPC_DEFAULT; |
There was a problem hiding this comment.
The _control_port parameter is accepted but silently ignored — the RPC port is always hardcoded to DORA_COORDINATOR_PORT_RPC_DEFAULT (6013). This means any user passing --coordinator-port will have it silently discarded for all tarpc operations.
Suggestion — either derive the RPC port from the control port:
pub(crate) async fn connect_to_coordinator_rpc(
addr: IpAddr,
control_port: u16,
) -> eyre::Result<CliControlClient> {
// RPC port is control port + 1 by convention
let rpc_port = if control_port == DORA_COORDINATOR_PORT_CONTROL_DEFAULT {
DORA_COORDINATOR_PORT_RPC_DEFAULT
} else {
control_port + 1
};
// ...
}Or add a dedicated --coordinator-rpc-port CLI argument to CoordinatorOptions so users can configure both ports independently.
| tracing::info!("dataflow build finished: `{build_id}`"); | ||
| let mut build = running_builds.remove(&build_id).unwrap(); | ||
| let (build_id, mut build) = | ||
| coordinator_state.running_builds.remove(&build_id).unwrap(); |
There was a problem hiding this comment.
This .unwrap() can panic and crash the entire coordinator. While the preceding get_mut() succeeded, the RefMut guard was dropped when we entered this if block, so another concurrent path could theoretically remove the entry in between (the tarpc server spawns handlers concurrently).
Even if that's unlikely today, a panic here takes down all running dataflows. Suggestion:
let Some((build_id, mut build)) =
coordinator_state.running_builds.remove(&build_id)
else {
tracing::warn!("build {build_id} was removed while finalizing results");
continue;
};| ) | ||
| let client = connect_to_coordinator_rpc(self.coordinator_addr, self.coordinator_port) | ||
| .await | ||
| .map_err(|_| eyre!("Failed to connect to coordinator"))?; |
There was a problem hiding this comment.
.map_err(|_| ...) discards the underlying connection error, making it hard to debug why a connection failed (refused? timeout? DNS?). Other commands like stop.rs and logs.rs use .wrap_err() which preserves the cause chain.
Suggestion:
let client = connect_to_coordinator_rpc(self.coordinator_addr, self.coordinator_port)
.await
.wrap_err("failed to connect to dora coordinator")?;| future | ||
| .await | ||
| .context("RPC transport error")? | ||
| .map_err(|e| eyre::eyre!(e)) |
There was a problem hiding this comment.
Two issues with error reporting here:
-
Transport errors lose detail:
.context("RPC transport error")wraps the error but the message is generic — when the coordinator is down you get "RPC transport error" without knowing if it was connection refused, timeout, or network unreachable. -
Application errors lack operation context:
.map_err(|e| eyre::eyre\!(e))converts the coordinator's error String into an eyre Report but doesn't say which operation failed. Callers that don't add.wrap_err()produce unhelpful messages.
Some callers do add context (e.g. stop.rs uses .wrap_err("...")), but others don't (e.g. list.rs:81 — rpc(client.get_node_info(...))).
One option is to accept an operation label:
pub(crate) async fn rpc<T, E: std::error::Error + Send + Sync + 'static>(
future: impl Future<Output = Result<Result<T, String>, E>>,
operation: &str,
) -> eyre::Result<T> {
future
.await
.wrap_err_with(|| format\!("RPC transport error during {operation}"))?
.map_err(|e| eyre::eyre\!("{operation} failed: {e}"))
}Alternatively, just ensure every call site wraps with .wrap_err().
Add shared
CoordinatorStatestruct for keeping state. UseDashMapfor concurrent maps.Not all messages use the proper tarpc request handling yet. This will be fixed in a follow-up commit. Same goes for the manual (de)serialization.
Alternative to #1288
Implementation notes:
Message Format
CliControltrait in themessagecrate (cli_to_coordinator)ControlRequestReplyenum is no longer needed, so it is removedControlRequestvariants are no longer needed as they are part of theCliControltrait now. I removed these variants and renamed the enum toLegacyControlRequestdora-messagecrateCoordinator
LogSubscribeandBuildLogSubscribemessages, which are both streaming-based and cannot be represented well in tarpcCoordinatorStatestruct, which usesDashMapfields that provide safe concurrent accessimpl CliControl for ControlServeruses the same code for the different requests as the previous event loop (with minor modifications)CLI
asyncCliControlClientLegacyControlRequestis usedNote: I used AI to generate some of the code, but I reviewed all generated code manually. I also improved and simplified the generated code.