Skip to content

[NBS] RDMA endpoint got stuck in disconnecting state after NIC got reconfigured into switchdev mode #5195

@tpashkin

Description

@tpashkin

Endpoint has been successfully created and was working for a while, until suddenly device went down under our feet

2026-02-06T11:40:03.708935Z :BLOCKSTORE_RDMA INFO: [10109089907589427039] start endpoint node-1548.compute-testing.ik8s.kcs.nbhost.net [send_magic=8C4AB3B5 recv_magic=F05AAF5F]
2026-02-06T11:40:03.744804Z :BLOCKSTORE_RDMA INFO: [10109089907589427039] connect [private_data=0x000072D7B92167B8 private_data_len=16 responder_resources=255 initiator_depth=255 flow_control=1 retry_count=7 rnr_retry_count=7 srq=0 qp_num=0]
2026-02-06T11:40:04.756008Z :BLOCKSTORE_RDMA INFO: [10109089907589427039] connect [private_data=0x000072D7B92167B8 private_data_len=16 responder_resources=255 initiator_depth=255 flow_control=1 retry_count=7 rnr_retry_count=7 srq=0 qp_num=0]
2026-02-06T11:40:06.765503Z :BLOCKSTORE_RDMA INFO: [10109089907589427039] connect [private_data=0x000072D7B92167B8 private_data_len=16 responder_resources=255 initiator_depth=255 flow_control=1 retry_count=7 rnr_retry_count=7 srq=0 qp_num=0]
2026-02-06T18:28:13.882567Z :BLOCKSTORE_RDMA ERROR: [10109089907589427039] /opt/buildagent/work/4e4cd8cd8c11d728/nbs/cloud/blockstore/libs/rdma/impl/verbs.cpp:123: SEVERITY_ERROR | FACILITY_SYSTEM | 5 | ibv_get_cq_event failed with error 5: Input/output error
2026-02-06T18:28:13.882569Z :BLOCKSTORE_RDMA INFO: [10109089907589427039] disconnect
2026-02-06T18:28:13.882825Z :BLOCKSTORE_RDMA ERROR: [10109089907589427039] /opt/buildagent/work/4e4cd8cd8c11d728/nbs/cloud/blockstore/libs/rdma/impl/verbs.cpp:123: SEVERITY_ERROR | FACILITY_SYSTEM | 5 | ibv_get_cq_event failed with error 5: Input/output error
2026-02-06T18:28:13.882976Z :BLOCKSTORE_RDMA ERROR: [10109089907589427039] /opt/buildagent/work/4e4cd8cd8c11d728/nbs/cloud/blockstore/libs/rdma/impl/verbs.cpp:123: SEVERITY_ERROR | FACILITY_SYSTEM | 5 | ibv_get_cq_event failed with error 5: Input/output error

Looking into kernel log confirms that eth0 was reconfigured into switchdev mode, which means RDMA was basically turned off on driver/firmware level

2026-02-06 20:28:13.000	mlx5_core 0000:37:00.0: E-Switch: Disable: mode(LEGACY), nvfs(64), necvfs(0), active vports(65)
2026-02-06 20:28:13.000	mlx5_0/1: QP 297 error: local protection error (0x3a 0x0 0x93)
2026-02-06 20:28:16.000	mlx5_core 0000:37:00.0: E-Switch: Supported tc chains and prios offload
2026-02-06 20:28:17.000	mlx5_core 0000:37:00.0: E-Switch: MPFS/FDB active
2026-02-06 20:28:17.000	mlx5_core 0000:37:00.0 eth0: Link up
2026-02-06 20:28:17.000	mlx5_core 0000:37:00.0 eth0: Dropping C-tag vlan stripping offload due to S-tag vlan
2026-02-06 20:28:17.000	mlx5_core 0000:37:00.0 eth0: Disabling hw_tls_tx, not supported in switchdev mode
2026-02-06 20:28:17.000	mlx5_core 0000:37:00.0 eth0: Disabling HW_VLAN CTAG FILTERING, not supported in switchdev mode
2026-02-06 20:28:17.000	mlx5_core 0000:37:00.0 eth0: Disabling HW MACsec offload, not supported in switchdev mode
2026-02-06 20:28:17.000	mlx5_core 0000:37:00.0: mlx5e: IPSec ESP acceleration enabled

There can be no sensible recovery in such a case, but at least we could have restarted the volume in IC mode

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions