Start adding `tarpc` service and server to coordinator by phil-opp · Pull Request #1345 · dora-rs/dora

phil-opp · 2026-02-10T14:16:01Z

Add shared CoordinatorState struct for keeping state. Use DashMap for concurrent maps.

Not all messages use the proper tarpc request handling yet. This will be fixed in a follow-up commit. Same goes for the manual (de)serialization.

Alternative to #1288

Implementation notes:

Message Format

The message format is defined by the new CliControl trait in the message crate (cli_to_coordinator)
The ControlRequestReply enum is no longer needed, so it is removed
Most of the ControlRequest variants are no longer needed as they are part of the CliControl trait now. I removed these variants and renamed the enum to LegacyControlRequest
Given that the message format is changed, this PR will require a minor version bump of the dora-message crate

Coordinator

The coordinator now provides a tarpc server at port DORA_COORDINATOR_PORT_RPC_DEFAULT
- Most messages are handled by this server
The existing control server still exists at port DORA_COORDINATOR_PORT_CONTROL_DEFAULT
- It's only used for LogSubscribe and BuildLogSubscribe messages, which are both streaming-based and cannot be represented well in tarpc
- IMO, the proper solution to this would be using zenoh for these subscriptions. This way, we would no longer need to involve the coordinater
Instead of handling all requests in a single event loop, the coordinator now spawns a tokio task for each request and handles them in parallel
- state is managed in a CoordinatorState struct, which uses DashMap fields that provide safe concurrent access
The impl CliControl for ControlServer uses the same code for the different requests as the previous event loop (with minor modifications)

CLI

The CLI's main function and all execute calls are now async
Instead of manual serialization/deserialization and message type specifications, we can now simply call a function on the CliControlClient
For (build) log subscriptions, the LegacyControlRequest is used

Note: I used AI to generate some of the code, but I reviewed all generated code manually. I also improved and simplified the generated code.

Add shared `CoordinatorState` struct for keeping state. Use `DashMap` for concurrent maps. Not all messages use the proper tarpc request handling yet. This will be fixed in a follow-up commit. Same goes for the manual (de)serialization.

…ate`

libraries/message/src/coordinator_to_daemon.rs

libraries/message/src/cli_to_coordinator.rs

binaries/cli/src/common.rs

binaries/coordinator/src/control.rs

binaries/coordinator/src/lib.rs

examples/multiple-daemons/run.rs

binaries/coordinator/src/server.rs

No longer needed, replaced by fine-grained result types

binaries/cli/src/common.rs

phil-opp · 2026-02-11T16:10:53Z

@claude Can you look into the failure of the CI/Examples test? It looks like the multiple daemons example is not finishing properly. It works locally for me though.

claude · 2026-02-11T16:11:11Z

Claude finished @phil-opp's task in 5m 14s —— View job

CI/Examples Test Failure - Fixed ✅

Summary: The multiple daemons example was failing in CI due to a panic at the FFI boundary when using blocking_send() on a thread with lingering Tokio runtime context.

Root Cause

The runtime-node was crashing with Signal(6) - SIGABRT due to "panic in a function that cannot unwind" at the FFI boundary. Here's the chain of events:

In binaries/runtime/src/operator/shared_lib.rs:40-44, a current_thread Tokio runtime is created to download shared libraries
Even after this runtime is dropped, Tokio's thread-local state persists on the thread
When the operator callback tries to send output via events_tx.blocking_send() (line 155), Tokio detects it's being called from a thread with runtime context
blocking_send() panics when called from within a Tokio runtime context
This panic propagates through the FFI boundary (ArcDynFn1 callback), which is extern "C" and cannot unwind
Rust aborts the process with SIGABRT

The Problem Code

// Creates current_thread runtime that leaves thread-local state
let rt = tokio::runtime::Builder::new_current_thread()
    .enable_all()
    .build()?;
rt.block_on(download_file(source, target_path))
// rt is dropped, but thread-local state remains!

// Later, this panics because Tokio detects runtime context
self.events_tx.blocking_send(event)  // <-- PANICS

The Fix

Changed all blocking_send() calls to try_send() in binaries/runtime/src/operator/shared_lib.rs. The try_send() method doesn't require blocking and works correctly regardless of Tokio runtime context.

Changes made:

Line 155: blocking_send → try_send (in output closure called from FFI)
Line 79: blocking_send → try_send (Finished event)
Line 82: blocking_send → try_send (Error event)
Line 85: blocking_send → try_send (Panic event)

Commit

Pushed fix in commit a4d3ec1. CI should now pass for the multiple daemons example.

--- | Branch: tarpc

Tokio channels panic on blocking operations when they believe that they are called from a runtime context.

…me::main`

Instead of default 10 second deadline

More robust than the UDP socket trick, which returns the IP address from the CLI's point of view.

haixuanTao · 2026-02-12T14:14:40Z

@sjfhsjfh @Mivik , could you guys take a look on this :)

haixuanTao

Thanks for the solid refactor! The DashMap lock discipline is well done and the insert-before-spawn pattern is excellent. I tested the full CLI lifecycle (up/check/list/start/stop/logs/build/destroy) on macOS and everything works correctly — including log streaming via both dora start and dora run.

Left a few comments on things that caught my eye.

haixuanTao · 2026-02-13T12:25:02Z

binaries/cli/src/common.rs

+    addr: IpAddr,
+    _control_port: u16,
+) -> eyre::Result<CliControlClient> {
+    let rpc_port = DORA_COORDINATOR_PORT_RPC_DEFAULT;


The _control_port parameter is accepted but silently ignored — the RPC port is always hardcoded to DORA_COORDINATOR_PORT_RPC_DEFAULT (6013). This means any user passing --coordinator-port will have it silently discarded for all tarpc operations.

Suggestion — either derive the RPC port from the control port:

pub(crate) async fn connect_to_coordinator_rpc( addr: IpAddr, control_port: u16, ) -> eyre::Result<CliControlClient> { // RPC port is control port + 1 by convention let rpc_port = if control_port == DORA_COORDINATOR_PORT_CONTROL_DEFAULT { DORA_COORDINATOR_PORT_RPC_DEFAULT } else { control_port + 1 }; // ... }

Or add a dedicated --coordinator-rpc-port CLI argument to CoordinatorOptions so users can configure both ports independently.

haixuanTao · 2026-02-13T12:29:41Z

binaries/coordinator/src/lib.rs

                        tracing::info!("dataflow build finished: `{build_id}`");
-                        let mut build = running_builds.remove(&build_id).unwrap();
+                        let (build_id, mut build) =
+                            coordinator_state.running_builds.remove(&build_id).unwrap();


This .unwrap() can panic and crash the entire coordinator. While the preceding get_mut() succeeded, the RefMut guard was dropped when we entered this if block, so another concurrent path could theoretically remove the entry in between (the tarpc server spawns handlers concurrently).

Even if that's unlikely today, a panic here takes down all running dataflows. Suggestion:

let Some((build_id, mut build)) = coordinator_state.running_builds.remove(&build_id) else { tracing::warn!("build {build_id} was removed while finalizing results"); continue; };

haixuanTao · 2026-02-13T12:34:00Z

binaries/cli/src/command/list.rs

-        )
+        let client = connect_to_coordinator_rpc(self.coordinator_addr, self.coordinator_port)
+            .await
+            .map_err(|_| eyre!("Failed to connect to coordinator"))?;


.map_err(|_| ...) discards the underlying connection error, making it hard to debug why a connection failed (refused? timeout? DNS?). Other commands like stop.rs and logs.rs use .wrap_err() which preserves the cause chain.

Suggestion:

let client = connect_to_coordinator_rpc(self.coordinator_addr, self.coordinator_port) .await .wrap_err("failed to connect to dora coordinator")?;

haixuanTao · 2026-02-13T12:34:57Z

binaries/cli/src/common.rs

+    future
+        .await
+        .context("RPC transport error")?
+        .map_err(|e| eyre::eyre!(e))


Two issues with error reporting here:

Transport errors lose detail: .context("RPC transport error") wraps the error but the message is generic — when the coordinator is down you get "RPC transport error" without knowing if it was connection refused, timeout, or network unreachable.

Application errors lack operation context: .map_err(|e| eyre::eyre\!(e)) converts the coordinator's error String into an eyre Report but doesn't say which operation failed. Callers that don't add .wrap_err() produce unhelpful messages.

Some callers do add context (e.g. stop.rs uses .wrap_err("...")), but others don't (e.g. list.rs:81 — rpc(client.get_node_info(...))).

One option is to accept an operation label:

pub(crate) async fn rpc<T, E: std::error::Error + Send + Sync + 'static>( future: impl Future<Output = Result<Result<T, String>, E>>, operation: &str, ) -> eyre::Result<T> { future .await .wrap_err_with(|| format\!("RPC transport error during {operation}"))? .map_err(|e| eyre::eyre\!("{operation} failed: {e}")) }

Alternatively, just ensure every call site wraps with .wrap_err().

Start adding tarpc service and server

cd63115

Add shared `CoordinatorState` struct for keeping state. Use `DashMap` for concurrent maps. Not all messages use the proper tarpc request handling yet. This will be fixed in a follow-up commit. Same goes for the manual (de)serialization.

phil-opp changed the title ~~Start adding tarpc service and server~~ Start adding tarpc service and server to coordinator Feb 10, 2026

Fix: remove remaining function-local state in favor of `CoordinatorSt…

877a3cb

…ate`

phil-opp force-pushed the tarpc branch from 1c8ba41 to 877a3cb Compare February 10, 2026 14:25

phil-opp added 9 commits February 10, 2026 16:04

Move remaining ControlRequest handling into server

94a0d8d

Wip

d2d11dd

Continue refactoring using AI

82c6256

Fix multiple-daemons example to use tarpc client

0c0847a

wip

5b2e589

Migrate CLI to tarpc server

8549545

Fix missing tokio runtime for dora up

7104869

Implement rpc helper method

8ecf30d

Improve connection to coordinator and simplify

90303ca

phil-opp commented Feb 11, 2026

View reviewed changes

phil-opp added 6 commits February 11, 2026 11:40

Remove draft code

7c5f845

Remove unused ControlRequest variants

45d66c2

Clean up ControlRequest enum and rename to LegacyControlRequest

391d7f2

Clean up ControlRequestReply and use finer-grained enums instead

53af486

Remove ControlRequestReply enum

74a9eb8

No longer needed, replaced by fine-grained result types

Minor fixes

e9c6d86

phil-opp commented Feb 11, 2026

View reviewed changes

binaries/cli/src/common.rs Outdated Show resolved Hide resolved

phil-opp added 6 commits February 11, 2026 13:47

Improve coordinator start function

fb5eff2

Fix race condition

5ce5233

Make CLI async to avoid block_on calls

1a982f9

Avoid blocking in async CLI functions

3b7e4f6

Avoid breaking build method by introducing a sync wrapper

b114a51

Fix run function in Python node API

c4ecde7

Copilot AI mentioned this pull request Feb 11, 2026

[WIP] Mark PR as ready for review when CI jobs pass #1348

Closed

6 tasks

phil-opp added 2 commits February 11, 2026 15:57

Fix: actually spawn the RPC message handlers

cb82027

Run cargo fmt

a5379da

phil-opp marked this pull request as ready for review February 11, 2026 17:05

This comment was marked as outdated.

Sign in to view

phil-opp mentioned this pull request Feb 11, 2026

Refactoring the transport layer with a consistent, schema-driven RPC interface #1288

Open

13 tasks

This comment was marked as outdated.

Sign in to view

phil-opp force-pushed the tarpc branch from 4c70a43 to a5379da Compare February 11, 2026 17:46

phil-opp added 15 commits February 11, 2026 21:10

Fix: Avoid holding DashMap locks across await points

2b61e3b

Change channel type to avoid panic

7585e6d

Tokio channels panic on blocking operations when they believe that they are called from a runtime context.

Use tokio::spawn_blocking when calling into synchronous `dora_runti…

3b3e65c

…me::main`

Fix: Use tarpc context with 10 minute deadline for longer operations

663d368

Instead of default 10 second deadline

Use DORA_COORDINATOR_PORT_RPC_DEFAULT instead of control port + 1

8fa1d66

Remove unused field

c1656c3

Remove unused bindings

47efccc

Enforce max message length for legacy interface

2e682de

Join daemon futures in parallel on destroy

ee1a15d

Fix long_context imports

8fda957

Remove unused imports

feda760

Formatting and import fixes

2196e33

Add client IP address as field to ControlServer

28b7b07

Use tarpc transport client_ip for checking if CLI is on same machine

6866484

More robust than the UDP socket trick, which returns the IP address from the CLI's point of view.

Reduce locking of running_dataflows and dataflow_results DashMaps

a13db24

haixuanTao reviewed Feb 13, 2026

View reviewed changes

Conversation

phil-opp commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation notes:

Message Format

Coordinator

CLI

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

phil-opp commented Feb 11, 2026

Uh oh!

claude bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI/Examples Test Failure - Fixed ✅

Root Cause

The Problem Code

The Fix

Commit

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

haixuanTao commented Feb 12, 2026

Uh oh!

haixuanTao left a comment

Choose a reason for hiding this comment

Uh oh!

haixuanTao Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

haixuanTao Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

haixuanTao Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

haixuanTao Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

phil-opp commented Feb 10, 2026 •

edited

Loading

claude bot commented Feb 11, 2026 •

edited

Loading