Skip to content

Conversation

retrohacker
Copy link
Contributor

Description

Previously, the ConcurrentDial would return the first connection that succeeded regardless of whether that peer had the expected PeerId.

This would cause a race condition between multiaddresses when an ip address was shared by two peers (i.e. a local network address). The first connection would win, and only then would the Pool check to see if it was the correct peer id. If it was not, the entire connection would fail with a WrongPeerId error.

This MR converts WrongPeerId to a TransportError
ErrorKind::PermissionDenied to allow all multiaddresses for a connection to be attempted prior to resolving the future.

Fixes #6105.

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

Previously, the ConcurrentDial would return the first connection
that succeeded _regardless_ of whether that peer had the expected
PeerId.

This would cause a race condition between multiaddresses when an
ip address was shared by two peers (i.e. a local network address).
The first connection would win, and only then would the Pool check
to see if it was the correct peer id. If it was not, the entire
connection would fail with a WrongPeerId error.

This MR converts WrongPeerId to a TransportError
ErrorKind::PermissionDenied to allow all multiaddresses for a
connection to be attempted prior to resolving the future.
Comment on lines +99 to +102
Some((addr, Ok(_))) => {
let e = TransportError::Other(std::io::ErrorKind::PermissionDenied.into());
self.errors.push((addr, e));
}
Copy link
Member

@elenaf9 elenaf9 Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PermissionDenied is a bit misleading since it implies that the user doesn't have sufficient OS permissions to perform the action. Unfortunately there isn't really a good matching ErrorKind in this scenario. Maybe HostUnreachable, or just ErrorKind::Other..

But either way, with this change we'd never surface a DialError::WrongPeerId error anymore, even if all addresses belong to the same wrong peer-id (e.g. because the remote changed their identity), right? I am not sure if we should really change that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - this is the patch we are floating on our fork to get connections flowing again. Connections to 127, 192, and 10 subnets were always winning the race against the relayed connection which gave a near 100% connection failure rate.

I was having a hard time finding a “proper” fix for this without a much bigger change because WrongPeerId is not a transport error and “Other” at the transport level made it difficult to determine the cause of the error higher up in the call stack.

The only solutions I can come up with:

  • give ConcurrentDial its own error enum
  • use the parent’s enum and return it from ConcurrentDial

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ConcurrentDial prematurely gives up on dials
3 participants