Conversation
License Check Results🚀 The license check job ran with the Bazel command: bazel run //:license-checkStatus: Click to expand output |
|
The created documentation from the pull request is available at: docu-html |
|
the link to the issue that PR created for |
feo/src/agent/relayed/primary.rs
Outdated
| where | ||
| <InterChannelTcp as IsChannel>::MultiReceiver: Send, | ||
| <InterChannelUnix as IsChannel>::MultiReceiver: Send, | ||
| <IntraChannel as IsChannel>::Sender: Send, |
There was a problem hiding this comment.
In our original implementation we avoided making senders and receivers "Send", because this is a limitation that not all communication implementations do actually fulfill. As an example, we found iceoryx2. For that reason, we used builders that are passed to the target thread, and the target thread uses them to build it senders and receivers directly.
There was a problem hiding this comment.
You're right, thank you for the detailed clarification. I wasn't familiar with using the builder pattern to avoid Send bounds, and your explanation was very helpful.
I'll remove the unnecessary Send trait bounds, as the builder pattern already ensures thread safety by creating the channel components within the target thread. I'll ensure this pattern is followed consistently.
|
|
||
| let thread = thread::spawn(move || { | ||
| Self::thread_main(inter_receiver_builder, intra_sender_builder, timeout) | ||
| Self::thread_main(inter_receiver, intra_sender, timeout) |
There was a problem hiding this comment.
As mentioned above, we used the call pattern with builders on purpose to avoid having to implement the Send trait for senders and receivers.
There was a problem hiding this comment.
Thank you for the excellent feedback, this confirms my understanding from our previous conversation. I will refactor the code to remove the unnecessary Send trait bounds.
| .connect_receiver(timeout) | ||
| .expect("failed to connect intra-process sender"); | ||
| trace!("PrimaryReceiveRelay connected"); | ||
| loop { |
There was a problem hiding this comment.
This is where the senders and receivers should be built directly in the target thread, so that they do not need to implement the Send trait.
There was a problem hiding this comment.
Thank you for the excellent feedback, this confirms my understanding from our previous conversation. I will refactor the code to remove the unnecessary Send trait bounds.
| where | ||
| Inter::MultiReceiver: Send, | ||
| Intra::Sender: Send, | ||
| { |
There was a problem hiding this comment.
Same here, please do not enforce implementation of the Send trait.
There was a problem hiding this comment.
Thank you for the excellent feedback, this confirms my understanding from our previous conversation. I will refactor the code to remove the unnecessary Send trait bounds.
| where | ||
| Inter::MultiReceiver: Send, | ||
| Intra::Sender: Send, | ||
| { |
There was a problem hiding this comment.
Same here, please do not enforce implementation of the Send trait.
There was a problem hiding this comment.
Thank you for the excellent feedback, this confirms my understanding from our previous conversation. I will refactor the code to remove the unnecessary Send trait bounds.
| _ => { | ||
| return Err(Error::UnexpectedProtocolSignal); | ||
| other => { | ||
| warn!("Received unexpected signal {:?} from unknown token {:?} during connect", other, token); |
There was a problem hiding this comment.
Why isn't this an error anymore?
There was a problem hiding this comment.
I've changed this from an Error to a warn! to make the connection process more robust.
The previous implementation would cause the entire agent to fail if it received any signal other than ChannelHello during startup. This could happen due to network noise or a race condition where a fast client sends a valid signal immediately after connecting.
The new behavior simply logs the unexpected signal for debugging purposes and continues waiting for the required ChannelHello messages. This makes the startup process resilient to transient issues and ensures it only fails for the correct reason: a timeout because not all peers connected in time.
There was a problem hiding this comment.
Ok, I have no objection to the change, although I think in our current implementation of the startup-handshake, the error should never occur.
|
Hi @armin-acn, could you please review the PR again after I made the changes based on your last feedback? |
|
Sure, I will, but need to find some time. |
armin-acn
left a comment
There was a problem hiding this comment.
Sorry, still two minor comments. I hope that will be all.
Did you test all the available signaling variants, though?
I did not but assume you did ...
| let mut missing_activities: HashSet<ActivityId> = | ||
| self.all_activities.iter().cloned().collect(); | ||
| let mut missing_recorders: HashSet<AgentId> = self.all_recorders.iter().cloned().collect(); | ||
| let start_time = std::time::Instant::now(); |
There was a problem hiding this comment.
Please add a use-statement for Instant from feo_time and just use Instant::now() here. (As you already did in relayed/connectors/relays.rs.)
There was a problem hiding this comment.
Thanks! Updated this to match the implementation in relayed/connectors/relays.rs using Instant::now().
feo/src/signalling/relayed/mod.rs
Outdated
|
|
||
| mod connectors; | ||
| mod interface; | ||
| pub(crate) mod interface; |
There was a problem hiding this comment.
Why does this have to be public now?
There was a problem hiding this comment.
Sorry about that, I forgot to revert it.
|
Hello @armin-acn, the branch has been successfully rebased on top of main. The PR is now ready for merge. |
|
@MohamedRady-99, are you going to merge? |
|
Hi @MohamedRady-99, would it be a problem for you, if the very first Release of FEO v0.5 (to be prepared until tomorrow the latest) would not include this PR? We can probably create a minor release update very soon, which could then include your changes. |
|
Hi @armin-acn, No problem at all, that’s fine. It can be included in the next minor release. |
|
Anyway, you can go ahead and merge your PR, if you want.
|
|
Hi @armin-acn, It seems I don’t have permission to merge, so I’d appreciate your support with that. |
This pull request introduces a connection_timeout to the agent configuration. During the startup phase, agents (primary, secondaries, and recorders) attempt to establish communication channels with each other. Previously, if an expected agent failed to connect, the process could hang indefinitely.
With this change:
A connection_timeout can be specified in the PrimaryConfig.
This timeout dictates the maximum duration the primary agent will wait for all secondary agents and recorders to connect.
If any agent fails to establish a connection within this period, the primary agent will time out, log an error, and exit, preventing the system from getting stuck in a non-operational state.
This improves the robustness and predictability of the system's startup sequence.