feat: add error event propagation to downstream nodes #1252

guptapratykshh · 2025-12-06T07:35:17Z

This PR implements error event propagation (#1232) that allows nodes to send errors to downstream nodes instead of crashing. The feature provides node.send_error(output_id, error) API method that propagates error events to all affected downstream consumers, who receive InputError events containing error message, source node ID, and affected input ID for graceful error handling.

The implementation extends the message protocol with NodeEvent::InputError and DaemonRequest::SendError types, adds daemon-side error propagation logic that finds and notifies all downstream receivers, and includes complete working example demonstrating producer/consumer error handling patterns. All changes are backward compatible with no breaking changes to existing functionality.

Previous PR #1250

phil-opp

Thanks for the PR!

We discussed this and would like to propose the following change: Instead of providing an explicit send_error function in the API, the error should be created by the daemon when a node exits with a non-zero exit code. This way, we can notify downstream nodes of failed nodes.

If nodes want to send errors they can still do that using normal outputs.

So please remove the send_error method from the API (and also the modifications to node_to_daemon.

…ation

phil-opp · 2025-12-12T08:18:58Z

apis/rust/node/src/event_stream/event.rs

    },
+    /// An error occurred in an upstream node.
+    ///
+    /// Instead of crashing, the upstream node sent an error event to allow


The sending crashed because of this error. (It might restart once we merge #1245)

phil-opp · 2025-12-12T08:20:10Z

apis/rust/node/src/event_stream/event.rs

+    /// Instead of crashing, the upstream node sent an error event to allow
+    /// downstream nodes to handle the error gracefully (e.g., use cached data,
+    /// switch to backup source, log and continue).
+    InputError {


Let's rename this variant to NodeFailed or similar.

phil-opp · 2025-12-12T08:20:52Z

apis/rust/node/src/event_stream/event.rs

+    /// switch to backup source, log and continue).
+    InputError {
+        /// The ID of the input that had an error, as specified in the YAML file.
+        id: DataId,


The node error might affect multiple errors, so let's change this to affected_input_ids: Vec<DataId>.

binaries/cli/src/common.rs

phil-opp · 2025-12-12T08:24:53Z

binaries/coordinator/src/lib.rs

+    if let Err(err) = tcp_send(&mut daemon_connection.stream, &message).await {
+        // If we can't send the message, the daemon might have already closed the connection
+        // This can happen during shutdown. Log a debug message but don't fail.
+        tracing::debug!(


This should not happen typically, so I would prefer to keep this an error. The code in destroy_daemons will still proceed to send destroy commands to the other daemons after an error.

Or is there any specific reason why this can now happen more commonly?

phil-opp · 2025-12-12T08:27:32Z

binaries/coordinator/src/lib.rs

    // wait for reply
-    let reply_raw = tcp_receive(&mut daemon_connection.stream)
-        .await
-        .wrap_err("failed to receive destroy reply from daemon")?;
-    match serde_json::from_slice(&reply_raw)
-        .wrap_err("failed to deserialize destroy reply from daemon")?
-    {
-        DaemonCoordinatorReply::DestroyResult { result, .. } => result
+    let reply_raw = match tcp_receive(&mut daemon_connection.stream).await {
+        Ok(raw) => raw,
+        Err(err) => {
+            // If we can't receive the reply, the daemon might have already closed the connection
+            // This can happen during shutdown. Log a debug message but don't fail.
+            tracing::debug!(
+                "failed to receive destroy reply from daemon `{daemon_id}`: {err}, \
+                 assuming daemon is already shut down"
+            );
+            return Ok(());
+        }
+    };
+
+    // Try to deserialize the reply. If it fails, treat it as a shutdown.
+    match serde_json::from_slice(&reply_raw) {
+        Ok(DaemonCoordinatorReply::DestroyResult { result, .. }) => result
            .map_err(|e| eyre!(e))
            .wrap_err("failed to destroy dataflow")?,
-        other => bail!("unexpected reply after sending `destroy`: {other:?}"),
+        Ok(other) => {
+            tracing::debug!(
+                "unexpected reply after sending `destroy` to daemon `{daemon_id}`: {other:?}, \
+                 assuming daemon is already shut down"
+            );
+            return Ok(());
+        }
+        Err(err) => {
+            // If deserialization fails, the daemon might have sent a partial reply before closing
+            // This can happen during shutdown. Log a debug message but don't fail.
+            tracing::debug!(
+                "failed to deserialize destroy reply from daemon `{daemon_id}`: {err}, \
+                 assuming daemon is already shut down"
+            );
+            return Ok(());
+        }


same as above: Please keep these as errors, unless there is a good reason to change this.

phil-opp · 2025-12-12T08:30:43Z

binaries/daemon/src/node_communication/tcp.rs

-        bincode::deserialize(&raw)
-            .wrap_err("failed to deserialize DaemonRequest")
-            .map(Some)
+        match bincode::deserialize(&raw) {
+            Ok(message) => Ok(Some(message)),
+            Err(err) => {
+                // If deserialization fails, treat it as a connection error
+                // This can happen if the connection was closed mid-transmission
+                // or if there's a version mismatch
+                // Log the raw data length and first few bytes to help diagnose the issue
+                let preview = if raw.len() > 16 {
+                    format!("{:?}", &raw[..16])
+                } else {
+                    format!("{:?}", raw)
+                };
+                // Log hex dump of first 32 bytes to help diagnose the issue
+                let hex_preview = if raw.len() > 32 {
+                    format!("{:02x?}", &raw[..32])
+                } else {
+                    format!("{:02x?}", raw)
+                };
+                // Try to identify what variant this might be
+                // Note: socket_stream_receive returns the message body (without length prefix)
+                // So the first 4 bytes should be the bincode enum variant index
+                let variant_hint = if raw.len() >= 4 {
+                    let variant_idx = u32::from_le_bytes([raw[0], raw[1], raw[2], raw[3]]);
+                    let variant_name = match variant_idx {
+                        0 => "Register",
+                        1 => "Subscribe",
+                        2 => "SendMessage",
+                        3 => "CloseOutputs",
+                        4 => "OutputsDone",
+                        5 => "NextEvent",
+                        6 => "ReportDropTokens",
+                        7 => "SubscribeDrop",
+                        8 => "NextFinishedDropTokens",
+                        9 => "EventStreamDropped",
+                        10 => "NodeConfig",
+                        _ => "Unknown",
+                    };
+                    format!(
+                        " (variant index: {} = {}, expected 8 for SubscribeDrop)",
+                        variant_idx, variant_name
+                    )
+                } else {
+                    String::new()
+                };
+
+                // Workaround for sync/async I/O mismatch: if we receive variant 7 (ReportDropTokens)
+                // with a 28-byte message (which matches SubscribeDrop size), it's likely a corrupted
+                // SubscribeDrop message due to sync/async I/O mismatch. Try to fix it.
+                let mut fixed_raw = raw.clone();
+                if raw.len() == 28 && raw.len() >= 4 {
+                    let variant_idx = u32::from_le_bytes([raw[0], raw[1], raw[2], raw[3]]);
+                    if variant_idx == 7 {
+                        // Change variant index from 7 (ReportDropTokens) to 8 (SubscribeDrop)
+                        tracing::debug!(
+                            "attempting to fix corrupted SubscribeDrop message (variant 7 -> 8) due to sync/async I/O mismatch"
+                        );
+                        fixed_raw[0] = 8;
+                        fixed_raw[1] = 0;
+                        fixed_raw[2] = 0;
+                        fixed_raw[3] = 0;
+                        // Try to deserialize the fixed message
+                        if let Ok(message) =
+                            bincode::deserialize::<Timestamped<DaemonRequest>>(&fixed_raw)
+                        {
+                            if matches!(message.inner, DaemonRequest::SubscribeDrop) {
+                                tracing::debug!(
+                                    "successfully fixed corrupted SubscribeDrop message"
+                                );
+                                return Ok(Some(message));
+                            }
+                        }
+                    }
+                }
+
+                tracing::warn!(
+                    "failed to deserialize DaemonRequest: {err}, raw data length: {}, preview: {}, hex: {}{}, treating as disconnect",
+                    raw.len(),
+                    preview,
+                    hex_preview,
+                    variant_hint
+                );
+                Ok(None)
+            }
+        }
    }



In general, there should not be any deserialization errors because we use reliable channels. Could you clarify why you think that this change is needed?

Either way, this change it not really related to this PR. So let's move it to a separate one if there is good motivation to do this.

phil-opp · 2025-12-12T08:30:58Z

binaries/daemon/src/node_communication/unix_domain.rs

-        bincode::deserialize(&raw)
-            .wrap_err("failed to deserialize DaemonRequest")
-            .map(Some)
+        match bincode::deserialize(&raw) {
+            Ok(message) => Ok(Some(message)),
+            Err(err) => {
+                // If deserialization fails, treat it as a connection error
+                // This can happen if the connection was closed mid-transmission
+                // or if there's a version mismatch
+                tracing::debug!(
+                    "failed to deserialize DaemonRequest: {err}, treating as disconnect"
+                );
+                Ok(None)
+            }
+        }


same as above (change unrelated to this PR)

phil-opp · 2025-12-12T08:31:24Z

binaries/daemon/src/local_listener.rs

+    match bincode::deserialize(&raw) {
+        Ok(message) => Ok(Some(message)),
+        Err(err) => {
+            // If deserialization fails, treat it as a connection error
+            // This can happen if the connection was closed mid-transmission
+            // or if there's a version mismatch
+            tracing::debug!("failed to deserialize DaemonRequest: {err}, treating as disconnect");
+            Ok(None)
+        }
+    }


Same as above (change unrelated to this PR)

phil-opp · 2025-12-12T08:31:55Z

libraries/message/src/daemon_to_node.rs

    InputClosed {
        id: DataId,
    },
+    InputError {


Please apply the same adjustments as for the other InputError variant here.

All requested changes have been addressed.

…ed inputs

phil-opp · 2025-12-16T09:42:57Z

binaries/cli/src/command/stop.rs

-            handle_dataflow_result(result, Some(uuid))
+            // For stop commands, we consider it successful if we successfully stopped the dataflow,
+            // even if the dataflow was already in a failed state
+            handle_dataflow_result_with_options(result, Some(uuid), true)


I don't think that we should hide errors here. A stop command should lead to normal stopping of all nodes. If that doesn't work, we should report it.

phil-opp · 2025-12-16T09:43:14Z

binaries/cli/src/command/stop.rs

-            handle_dataflow_result(result, Some(uuid))
+            // For stop commands, we consider it successful if we successfully stopped the dataflow,
+            // even if the dataflow was already in a failed state
+            handle_dataflow_result_with_options(result, Some(uuid), true)


same as above (don't hide error)

phil-opp · 2025-12-16T09:43:43Z

binaries/cli/src/common.rs

+    } else if ignore_failures {
+        // For stop commands, we consider it successful if we successfully stopped the dataflow,
+        // even if the dataflow was already in a failed state
+        Ok(())


Why should we ignore errors on stop?

phil-opp · 2025-12-16T09:44:02Z

binaries/coordinator/src/lib.rs

-        .wrap_err(format!(
-            "failed to send destroy message to daemon `{daemon_id}`"
-        ))?;
+        .wrap_err("failed to send destroy message to daemon")?;


What is this change about?

phil-opp · 2025-12-16T09:46:38Z

binaries/daemon/src/node_communication/mod.rs

+                    // Deserialization failed or connection closed
+                    // The node has likely sent a request and is waiting for a reply
+                    // Try to send an error reply before closing so the node doesn't hang
+                    // We check subscribed_drop_events to determine if it's likely a SubscribeDrop request
+                    let error_msg = if self.subscribed_drop_events.is_none() {
+                        // Node hasn't subscribed to drop stream yet, likely waiting for SubscribeDrop reply
+                        "failed to deserialize SubscribeDrop request: connection error or version mismatch"
+                    } else {
+                        // Node has already subscribed, might be a different request
+                        "failed to deserialize request: connection error or version mismatch"
+                    };
+                    let error_reply = DaemonReply::Result(Err(error_msg.into()));
+                    if let Err(err) = self.send_reply(error_reply, &mut connection).await {
+                        tracing::debug!(
+                            "failed to send error reply after deserialization failure: {err:?}"
+                        );
+                    }


What is this change about? I does not seem related to this PR?

phil-opp · 2025-12-16T09:48:43Z

binaries/daemon/src/lib.rs

                            break;
                        }
                    }
+                    // Exit when all dataflows are finished after Ctrl-C


We normally don't exit the daemon on ctrl-c. The exit_when_all_finished flag is set only in some specific scenarios, e.g. when running dora run.

phil-opp · 2025-12-16T09:49:37Z

binaries/daemon/src/lib.rs

+                    // Also exit in example mode when all dataflows finish, even if exit_when_done
+                    // tracking didn't catch all nodes (e.g., if nodes finish before being tracked)
+                    if self.exit_when_done.is_some() && self.running.is_empty() {
+                        tracing::info!(
+                            "exiting daemon because all dataflows are finished (example mode)"
+                        );
                        break;


What is this change about? Is it related to this PR?

phil-opp · 2025-12-16T09:55:11Z

binaries/daemon/src/lib.rs

+                        // Nodes killed by grace duration during stop are not errors
+                        if grace_duration_kill {
+                            logger
+                                .log(
+                                    LogLevel::Debug,
+                                    Some("daemon".into()),
+                                    format!("node `{node_id}` was stopped by grace duration (not an error)"),
+                                )
+                                .await;
+                            Ok(())


Why not? If nodes are unresponsive and need to be killed that seems like an error. For example, there might be some important cleanup logic in the node that was skipped because of the kill. So I think it's better to report this error than to hide it silently.

phil-opp · 2025-12-16T09:57:34Z

binaries/daemon/src/lib.rs

+                // Always send NodeStopped event, even if handle_node_stop fails
+                // This ensures the daemon can exit properly
+                let stop_result = self
+                    .handle_node_stop(dataflow_id, &node_id, dynamic_node)
+                    .await;
+                if let Err(err) = &stop_result {
+                    tracing::warn!(
+                        "Error handling node stop for {}/{}: {err:?}, but NodeStopped event should have been sent",
+                        dataflow_id,
+                        node_id
+                    );
+                }
+                // Don't propagate the error - NodeStopped event is more important for shutdown


These comments and warn message does not seem to match the code.

binaries/daemon/src/lib.rs

We added a new `NodeEvent` variant, which is not backwards compatible (yet, see dora-rs#1273). We also added a new `InterDaemonEvent` variant that is not backwards compatible either.

phil-opp · 2025-12-20T13:46:28Z

libraries/message/src/daemon_to_daemon.rs

+        receiver_node_id: NodeId,
+        affected_input_ids: Vec<DataId>,


We don't really need these two fields in the message, the receiver can figure it out on its own. This has the advantage that we only need to send a single message.

phil-opp · 2025-12-20T13:47:56Z

binaries/daemon/src/lib.rs

+        // Now send to remote receivers (after releasing the borrow on dataflow)
+        for (receiver_node_id, affected_input_ids) in remote_receivers {
+            // Remote receiver - send through coordinator via zenoh
+            // The coordinator will forward this to the appropriate daemon
+            let output_id = OutputId(source_node_id.clone(), affected_input_ids[0].clone());
+            let event = InterDaemonEvent::NodeFailed {
+                dataflow_id,
+                source_node_id: source_node_id.clone(),
+                receiver_node_id: receiver_node_id.clone(),
+                affected_input_ids,
+                error: error_message.clone(),
+            };
+
+            // Send to remote receivers using the same mechanism as regular outputs
+            if let Err(err) = self.send_to_remote_receivers(dataflow_id, &output_id, event).await {
+                tracing::warn!(
+                    "Failed to send error event to remote receiver {}: {err:?}",
+                    receiver_node_id
+                );


We can simplify this by removing the receiver_node_id and affected_input_ids fields of InterDaemonEvent::NodeFailed. Then we only need to send a single message here.

phil-opp · 2025-12-20T13:49:23Z

I pushed a commit to bump the dora-message version so that we don't forget about it.

…r determines affected nodes locally

…-propagation

phil-opp · 2025-12-23T06:46:56Z

binaries/daemon/src/lib.rs

+                    // Also exit in example mode when all dataflows finish, even if exit_when_done
+                    // tracking didn't catch all nodes (e.g., if nodes finish before being tracked)
+                    if self.exit_when_done.is_some() && self.running.is_empty() {
+                        tracing::info!(
+                            "exiting daemon because all dataflows are finished (example mode)"
+                        );


Why is this needed?

phil-opp · 2025-12-23T06:51:02Z

binaries/daemon/src/lib.rs

+        // Get all outputs of the failed node
+        let outputs: Vec<DataId> = dataflow
+            .mappings
+            .keys()
+            .filter(|m| &m.0 == &source_node_id)
+            .map(|m| m.1.clone())
+            .collect();
+
+        // Group affected inputs by receiver node
+        let mut affected_by_receiver: BTreeMap<NodeId, Vec<DataId>> = BTreeMap::new();
+
+        for output_id in &outputs {
+            let output_key = OutputId(source_node_id.clone(), output_id.clone());
+            if let Some(receivers) = dataflow.mappings.get(&output_key) {
+                for (receiver_node_id, input_id) in receivers {
+                    affected_by_receiver
+                        .entry(receiver_node_id.clone())
+                        .or_insert_with(Vec::new)
+                        .push(input_id.clone());
+                }
+            }
+        }
+
+        // Check if there are any remote receivers (nodes not on this daemon)
+        let has_remote_receivers = affected_by_receiver
+            .keys()
+            .any(|receiver_node_id| !dataflow.subscribe_channels.contains_key(receiver_node_id));
+
+        // Send NodeFailed event to each local receiver
+        for (receiver_node_id, affected_input_ids) in affected_by_receiver {
+            if let Some(channel) = dataflow.subscribe_channels.get(&receiver_node_id) {
+                // Local receiver - send directly
+                let event = NodeEvent::NodeFailed {
+                    affected_input_ids,
+                    error: error_message.clone(),
+                    source_node_id: source_node_id.clone(),
+                };
+                let _ = send_with_timestamp(channel, event, &self.clock);
+            }
+        }


This looks like the same code as when handling InterDaemonEvent::NodeFailed. Why not move it to a separate function?

phil-opp · 2025-12-23T06:52:37Z

binaries/daemon/src/lib.rs

+        if has_remote_receivers {
+            // Use the first output as the routing key (any output will work)
+            let first_output = outputs.first().ok_or_else(|| {
+                eyre!("Failed node has no outputs, cannot send error to remote receivers")


Instead of erroring, we could just do nothing (if node has no outputs, no remote receivers are affected)

phil-opp · 2025-12-23T12:40:26Z

binaries/daemon/src/lib.rs

+
+            // Send to remote receivers using the same mechanism as regular outputs
+            if let Err(err) = self
+                .send_to_remote_receivers(dataflow_id, &output_id, event)


I forgot that daemons subscribe to a specific output topic. So just sending the message to the first output is not enough unfortunately (as some daemon might only be subscribed to the second or third output).

So I guess we have two options:

send the message to all outputs and deduplicate it on the receiver side (i.e. remember which nodes failed already to avoid sending duplicated events to nodes)

add a new node-level subscription topic for such messages that all affected daemons subscribe to

What do you prefer?

guptapratykshh · 2025-12-29T19:21:07Z

Can you please review this PR. The changes are done , the ci is failing through the different version

guptapratykshh · 2026-01-15T09:05:43Z

any updated on this, it is ready for review @phil-opp

guptapratykshh added 5 commits December 6, 2025 12:35

feat: add error event propagation to downstream nodes

454cea3

fix: treat nodes killed by grace duration as successful stops

359bca6

fix: gracefully handle connection errors during daemon shutdown

7d7aad7

fixed the formatting

49707c0

Fixed CLI command failures (dora stop/destroy) that were occurring

fb05aa4

phil-opp reviewed Dec 11, 2025

View reviewed changes

guptapratykshh added 3 commits December 11, 2025 18:40

feature/error-event-propagation

21d8546

refactor: replace explicit send_error API with automatic error propag…

321590c

…ation

fixed the formatting issues

52e6be1

phil-opp reviewed Dec 12, 2025

View reviewed changes

guptapratykshh added 3 commits December 12, 2025 23:04

refactor: rename InputError to NodeFailed and support multiple affect…

a4f8435

…ed inputs

refactor: rename InputError to NodeFailed and support multiple affect…

e61e585

…ed inputs

refactor: rename InputError to NodeFailed and support multiple affect…

5440b99

…ed inputs

guptapratykshh requested a review from phil-opp December 13, 2025 10:54

phil-opp reviewed Dec 16, 2025

View reviewed changes

guptapratykshh added 2 commits December 16, 2025 22:34

fix: address mentor feedback on error propagation PR

e97523e

fix: apply cargo fmt formatting fixes

c1b0445

guptapratykshh requested a review from phil-opp December 16, 2025 19:15

Bump dora-message version to 0.7.0-alpha

615788b

We added a new `NodeEvent` variant, which is not backwards compatible (yet, see dora-rs#1273). We also added a new `InterDaemonEvent` variant that is not backwards compatible either.

phil-opp force-pushed the feature/error-event-propagation branch from 466f384 to 615788b Compare December 20, 2025 13:43

phil-opp reviewed Dec 20, 2025

View reviewed changes

guptapratykshh added 3 commits December 20, 2025 23:47

Refactor error propagation to send single NodeFailed message, receive…

295d048

…r determines affected nodes locally

Merge remote-tracking branch 'upstream/main' into feature/error-event…

51dfa4e

…-propagation

Add NodeFailed pattern match to topic info command

2f0979a

guptapratykshh requested a review from phil-opp December 22, 2025 12:44

phil-opp reviewed Dec 23, 2025

View reviewed changes

fix: address reviewer feedback on error propagation PR (dora-rs#1252)

170c69c

guptapratykshh requested a review from phil-opp December 27, 2025 17:12

Merge branch 'main' into feature/error-event-propagation

afa9c14

feat: add error event propagation to downstream nodes #1252

Are you sure you want to change the base?

feat: add error event propagation to downstream nodes #1252

Uh oh!

Conversation

guptapratykshh commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phil-opp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phil-opp commented Dec 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptapratykshh commented Dec 29, 2025

Uh oh!

guptapratykshh commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

guptapratykshh commented Dec 6, 2025 •

edited

Loading