Skip to content

Conversation

@alogfans
Copy link
Collaborator

@alogfans alogfans commented Jul 15, 2025

Setting SO_REUSEPORT to 1, which avoids bind the same port twice.

@alogfans alogfans requested a review from stmatengss July 16, 2025 04:11
@alogfans alogfans changed the title [Transfer Engine] Force to auto-detect available port using P2P handshake mode [Transfer Engine] Prevent binding the same port for multiple instances Jul 16, 2025
@alogfans alogfans requested a review from doujiang24 July 22, 2025 06:37
@staryxchen
Copy link
Contributor

Hi @alogfans, I met the problem when the two process listen on the same port (e.g. 16474 is used by two process)

mooncake-transfer-engine.VM-17-38-centos.root.log.INFO.20250718-000946.1544039:I0718 00:09:46.066166 1544039 transfer_engine.cpp:100] Transfer Engine RPC using new RPC mapping, listening on IP1:15739
mooncake-transfer-engine.VM-17-38-centos.root.log.INFO.20250718-000946.1544888:I0718 00:09:46.066126 1544888 transfer_engine.cpp:100] Transfer Engine RPC using new RPC mapping, listening on IP1:16595
mooncake-transfer-engine.VM-17-38-centos.root.log.INFO.20250718-000946.1545923:I0718 00:09:46.065773 1545923 transfer_engine.cpp:100] Transfer Engine RPC using new RPC mapping, listening on IP1:16474
mooncake-transfer-engine.VM-17-38-centos.root.log.INFO.20250718-000946.1546858:I0718 00:09:46.058853 1546858 transfer_engine.cpp:100] Transfer Engine RPC using new RPC mapping, listening on IP1:16810
mooncake-transfer-engine.VM-17-38-centos.root.log.INFO.20250718-000946.1547707:I0718 00:09:46.066221 1547707 transfer_engine.cpp:100] Transfer Engine RPC using new RPC mapping, listening on IP1:16474
mooncake-transfer-engine.VM-17-38-centos.root.log.INFO.20250718-000946.1548555:I0718 00:09:46.066220 1548555 transfer_engine.cpp:100] Transfer Engine RPC using new RPC mapping, listening on IP1:16939
mooncake-transfer-engine.VM-17-38-centos.root.log.INFO.20250718-000946.1549372:I0718 00:09:46.066203 1549372 transfer_engine.cpp:100] Transfer Engine RPC using new RPC mapping, listening on IP1:15385
mooncake-transfer-engine.VM-17-38-centos.root.log.INFO.20250718-000946.1550176:I0718 00:09:46.066221 1550176 transfer_engine.cpp:100] Transfer Engine RPC using new RPC mapping, listening on IP1:16354

and then the following error message appears.

rdma_endpoint.cpp:196] Invalid argument: peer nic path inconsistency, expect IP1:20007@mlx5_bond_3 + IP2:20007@mlx5_bond_1, while got IP1:20001@mlx5_bond_3 + IP2:20007@mlx5_bond_1

So I think it would be best not to allow multiple processes to bind to the same port, but does your PR increase the likelihood of this happening? If SO_REUSEPORT is set, multiple sockets are bound to an identical socket address. Therefore, multiple processes can listen to the same port. Is there a problem somewhere?

@stmatengss
Copy link
Collaborator

It seems reusing port cannot solve the problem. @alogfans

@alogfans
Copy link
Collaborator Author

@staryxchen Thanks for your feedback. We use two methods to avoid this problem:

  • Disable REUSEADDR when user specify a MC_DISABLE_REUSEADDR env var. This solves the double-binding problem.
  • We fix the random generator so that the output becomes more random.

@staryxchen
Copy link
Contributor

staryxchen commented Jul 24, 2025

@staryxchen Thanks for your feedback. We use two methods to avoid this problem:

  • Disable REUSEADDR when user specify a MC_DISABLE_REUSEADDR env var. This solves the double-binding problem.
  • We fix the random generator so that the output becomes more random.

@alogfans I think the easiest way to fix this is to reduce the likelihood of port conflicts by expanding the range of randomly selected ports. Here's an example of what I wrote, just for reference.

    const int min_port = 15000;
    const int max_port = 25000;
    static std::random_device rand_gen;
    std::uniform_int_distribution<int> rand_dist(min_port, max_port);
    int port = rand_dist(rand_gen);

@alogfans
Copy link
Collaborator Author

Enlarge port space is a good idea. Also we make REUSEADDR disabled by default. Thus the problem should be fixed.

@staryxchen
Copy link
Contributor

Enlarge port space is a good idea. Also we make REUSEADDR disabled by default. Thus the problem should be fixed.

@alogfans Nice work, but I have two more questions:

  1. It seems that max_port is same as before, which seems to be some minor issue.
  2. Is it necessary to change next() function? It does not used in findAvailableTcpPort(), so I think we do need to change it at this time.

@alogfans
Copy link
Collaborator Author

  1. solved. 2) We find the origin SimpleRandom has problem in our practice. We fix it as well.

@staryxchen
Copy link
Contributor

  1. solved. 2) We find the origin SimpleRandom has problem in our practice. We fix it as well.

Understood. LGTM

Comment on lines +1093 to 1100
if (getenv("MC_ENABLE_REUSEADDR")) {
int on = 1;
if (setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on))) {
close(sockfd);
sockfd = -1;
continue;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd better remove these lines? multiple TE reuse the same port is not expected, since only one TE could accept the connection from client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants