Skip to content

Conversation

@hezhiqian01
Copy link

This PR fix the issue as below:
When using the Intel E810 RDMA NIC, calling the ibv_modify_qp function to change the QP state to INIT results in an 'Invalid argument (22)' error.

The error might be caused by the QP already being in the RESET state, and setting it to RESET again could lead to an internal state inconsistency.

The solution is to remove the step that sets the QP to RESET.

@stmatengss
Copy link
Collaborator

@hezhiqian01 Could you make it specific for Intel E810 RDMA NIC?
After evaluating devices running the standard RDMA protocol, we don't find similar issues.
THX for your contribution.

@alogfans
Copy link
Collaborator

alogfans commented Aug 7, 2025

@hezhiqian01 An alternative solution is to check whether current is in RESET state. Thanks for your feedback.

@hezhiqian01
Copy link
Author

@stmatengss My RDMA device is the Intel E810. I'm not sure whether similar issues occur on other RDMA devices, but what is certain is that the QP is already in the RESET state upon creation. Therefore, there's no need to set it to RESET again. If possible, this step can be removed to prevent such errors.

@stmatengss
Copy link
Collaborator

@stmatengss My RDMA device is the Intel E810. I'm not sure whether similar issues occur on other RDMA devices, but what is certain is that the QP is already in the RESET state upon creation. Therefore, there's no need to set it to RESET again. If possible, this step can be removed to prevent such errors.

After testing another RDMA device, I still believe we cannot skip this step (change to RESET status). As @alogfans mentioned, could you check the status first and then modify_qp?

@alogfans
Copy link
Collaborator

More specifically, you can check the current state using ibv_query_qp(). If it's already IBV_QPS_RESET, you can skip the ibv_modify_qp() call.

@stmatengss
Copy link
Collaborator

@hezhiqian01 Could you update the codes based on the suggestions from @alogfans? Thank you for your contribution!

Comment on lines +358 to +361
// Any state -> RESET (removed)
// Modifying the QP to RESET is not necessary here, as it is already in the
// RESET state. This avoids the 'invalid argument' issue on some RDMA
// devices, such as the Intel E810.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Any state -> RESET (removed)
// Modifying the QP to RESET is not necessary here, as it is already in the
// RESET state. This avoids the 'invalid argument' issue on some RDMA
// devices, such as the Intel E810.
// 1. Query QP State
ibv_qp_attr cur_attr;
ibv_qp_init_attr init_attr;
int cur_state = IBV_QPS_ERR;
if (ibv_query_qp(qp, &cur_attr, IBV_QP_STATE, &init_attr) == 0) {
cur_state = cur_attr.qp_state;
} else {
PLOG(WARNING) << "[Handshake] ibv_query_qp failed, proceed to RESET anyway";
}
// 2. If current state is not reset
if (cur_state != IBV_QPS_RESET) {
ibv_qp_attr attr{};
attr.qp_state = IBV_QPS_RESET;
int ret = ibv_modify_qp(qp, &attr, IBV_QP_STATE);
if (ret) {
std::string message = "Failed to modify QP to RESET";
PLOG(ERROR) << "[Handshake] " << message;
if (reply_msg) *reply_msg = message + ": " + strerror(errno);
return ERR_ENDPOINT;
}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants