Skip to content

Service Response Timeout - Race ConditionΒ #842

@hilary-luo

Description

@hilary-luo

Generated by Generative AI

No response

Operating System:

Ubuntu 22.04 & 24.04 Server and Desktop and probably more

ROS version or commit hash:

Humble & Jazzy & probably Rolling

RMW implementation (if applicable):

rmw_fastrtps_cpp

RMW Configuration (if applicable):

No response

Client library (if applicable):

No response

'ros2 doctor --report' output

Not provided because the issue is happening broadly across many different systems

Steps to reproduce issue

Set up a system that launches a great number of nodes at the same time including lifecycle nodes (for example ROS 2 control and Nav2 nodes). Issue is exacerbated by simple discovery or with a discovery server only available over a wireless network.

These systems have good computers and are reasonable systems (run platform including ros2 control to drive + nav2 + a lidar and a camera or a manipulator). They have reasonable CPU power / RAM etc.

Expected behavior

Although discovery may take some time, all nodes should launch properly and lifecycle nodes should transition properly between states, more specifically never missing a response.

Actual behavior

When the nodes all start up, some of the nodes fail to transition because of timeout errors "failed to send response (timeout): client will not receive response". Despite extending the timeout, these issues still keep happening. In some of the instances the node was confirmed to have transitioned correctly although the response was never received. Looking at the code this is likely the known race condition referenced by the todo message found at:

// TODO(MiguelCompany) The following block is a workaround for the race on the
and described in the associated DDS spec.

To quote the DDS spec that is referenced:

Service discovery for the Basic Service Mapping is not robust because discovery race conditions can cause the service replies to be lost. The request-topic and reply-topic are two different RTPS sessions that are matched independently by the DDS discovery process. For this reason it is possible for the entities on the request topic to discover each other before the entities on the reply topic discover each other. If such a situation, if a client makes a request before the entities over the reply topic are fully discovered, the client may lose the corresponding replies.

The todo says that it should be re-implemented with the Enhanced Service Mapping.

Additional information

Is there a reason that the enhanced service mapping has not been implemented? or is it just a matter of time or having someone contribute it? In my eyes this is a major issue so I would like to engage in the conversation of what it would take to get this fixed.

I have personally been seeing this issue affect Clearpath Robotics robots, Turtlebot 4s and in general Nav2 users across Humble and Jazzy. I have seen a number of tickets about symptoms that are likely caused by this root issue across these repos so I do believe it is affecting a lot of people.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions