Skip to content

Auto Reconnect not working on ROUTER-ROUTER pattern #4788

@Jeducious

Description

@Jeducious

Description of problem

Background

I have a distributed computing app of which I am an author/maintainer (crowdrender - distributed rendering for Blender).
I have recently been moving the core of this app to a different architecture, one based on event driven patterns. There are two components that are essential, one is an event bus, implemented using PUB/SUB, and another for streaming binary data (files, binary blobs, streams etc) which uses the ROUTER-ROUTER pattern for peer to peer streaming of data.

Problem

One of the common problems I want the app to handle is a peer going to sleep. When this happens I see that the PUB/SUB channel, after a timeout, notices that the sleeping peer has dropped its connection, and begins the automatic recovery, here's what I see on that channel for the PUB and SUB sockets for the connecting peer (the sleeping peer is the one that binds in this example)

2025-04-26 22:39:16,445 ; Client EventBusLogger ; INFO ;None; EVENT:{ b'clients'.b'eventbusbridge.socket_event'; body: ZMQEvent: DISCONNECTED, endpoint: tcp://192.168.1.145:9691, value 3872 }
2025-04-26 22:39:18,366 ; Client EventBusLogger ; INFO ;None; EVENT:{ b'clients'.b'eventbusbridge.socket_event'; body: ZMQEvent: DISCONNECTED, endpoint: tcp://192.168.1.145:9697, value 3852 }
2025-04-26 22:42:52,792 ; Client EventBusLogger ; INFO ;None; EVENT:{ b'clients'.b'eventbusbridge.socket_event'; body: ZMQEvent: CLOSED, endpoint: tcp://[::ffff:192.168.1.145]:9691, value 3812 }
2025-04-26 22:42:52,792 ; Client EventBusLogger ; INFO ;None; EVENT:{ b'clients'.b'eventbusbridge.socket_event'; body: ZMQEvent: CONNECT_RETRIED, endpoint: tcp://[::ffff:192.168.1.145]:9691, value 5000 }
2025-04-26 22:42:54,720 ; Client EventBusLogger ; INFO ;None; EVENT:{ b'clients'.b'eventbusbridge.socket_event'; body: ZMQEvent: CLOSED, endpoint: tcp://[::ffff:192.168.1.145]:9697, value 4196 }
2025-04-26 22:42:54,721 ; Client EventBusLogger ; INFO ;None; EVENT:{ b'clients'.b'eventbusbridge.socket_event'; body: ZMQEvent: CONNECT_RETRIED, endpoint: tcp://[::ffff:192.168.1.145]:9697, value 5000 }

The attempts to reconnect continue forever on the PUB-SUB channel. However, on the ROUTER-ROUTER channel, although the socket events give a DISCONNECTED event, there are NO attempts at all to reconnect.

Environment and configuration

Environment

I am using a windows peer on the connecting side running windows 11, and on the binding peer's side, running MacOS. I am using the pyzmq bindings, version 26.4 which runs libzmq 4.3.5. on both machines. I am using python 3.11.11 installed in Blender.

Socket configuration

For the ROUTER-ROUTER socket configuration I am using the following options;

# Connecting peer 
_out_sock.setsockopt(zmq.IPV6, 1)
_out_sock.setsockopt(zmq.CONNECT_TIMEOUT, 10 * 1000)
_out_sock.setsockopt(zmq.RECONNECT_IVL_MAX, 5000)
_out_sock.setsockopt(zmq.ROUTER_MANDATORY, 1)
_out_sock.setsockopt(zmq.ROUTER_HANDOVER, 1)
_out_sock.curve_publickey = self.cr_auth_ctx.client_public_key
_out_sock.curve_secretkey = self.cr_auth_ctx.client_secret_key

# Binding Peer
_out_sock.setsockopt(zmq.IPV6, 1)
_out_sock.setsockopt(zmq.CONNECT_TIMEOUT, 10 * 1000)
_out_sock.setsockopt(zmq.RECONNECT_IVL_MAX, 5000)
_out_sock.setsockopt(zmq.ROUTER_MANDATORY, 1)
_out_sock.setsockopt(zmq.ROUTER_HANDOVER, 1)
_out_sock.curve_publickey = self.cr_auth_ctx.client_public_key
_out_sock.curve_secretkey = self.cr_auth_ctx.client_secret_key
_out_sock.curve_server = True

NOTE: These sockets connect and disconnect perfectly fine, with the expected socket events according to the protocol specification. The connect and binding peer both setup their own keys, supplied by an auth object which guarantees a unique set of keys for each peer.

Expected vs Actual behaviour

Expected Behaviour

If the binding peer is put or allowed to go into sleep mode, the connecting peer detects the drop in connection, issues a ZMQ_EVENT_DISCONNECTED event and then attempts to reconnect, evidenced by ZMQ_EVENT_CONNECT_RETRIED events corresponding to the endpoint of the sleeping peer.

Actual Behaviour

When the binding side goes to sleep, sometimes there is a DISCONNECTED event, sometimes not, there are NO ZMQ_EVENT_CONNECT_RETRIED events at all. Viewing the list of TCP/IP connections for the connecting peer, shows that there are no TCP/IP connections listed for the sleeping peer's endpoints, whereas for the PUB/SUB channel, there are TCP/IP connections listed for the sleeping peer's endpoints for the PUB/SUB channel, which are showing SYN_SENT as the state.

Final Remarks

I have been using pyzmq for ten years, though I do not necessarily consider myself an expert. I would be glad of any corrections, requests for more information on my approach so that I can handle sleeping peers. If this is indeed a bug, I would be happy to contribute my time to helping with a fix, I am highly motivated to make this pattern work or substitute it for something more suitable.

Basically what I am saying is, sometimes, I'm an idiot, but I mean well, and I'm always grateful when someone helps me get past my own limitations of understanding 😄

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions