-
Notifications
You must be signed in to change notification settings - Fork 40
Open
Labels
Description
Endpoint has been successfully created and was working for a while, until suddenly device went down under our feet
2026-02-06T11:40:03.708935Z :BLOCKSTORE_RDMA INFO: [10109089907589427039] start endpoint node-1548.compute-testing.ik8s.kcs.nbhost.net [send_magic=8C4AB3B5 recv_magic=F05AAF5F]
2026-02-06T11:40:03.744804Z :BLOCKSTORE_RDMA INFO: [10109089907589427039] connect [private_data=0x000072D7B92167B8 private_data_len=16 responder_resources=255 initiator_depth=255 flow_control=1 retry_count=7 rnr_retry_count=7 srq=0 qp_num=0]
2026-02-06T11:40:04.756008Z :BLOCKSTORE_RDMA INFO: [10109089907589427039] connect [private_data=0x000072D7B92167B8 private_data_len=16 responder_resources=255 initiator_depth=255 flow_control=1 retry_count=7 rnr_retry_count=7 srq=0 qp_num=0]
2026-02-06T11:40:06.765503Z :BLOCKSTORE_RDMA INFO: [10109089907589427039] connect [private_data=0x000072D7B92167B8 private_data_len=16 responder_resources=255 initiator_depth=255 flow_control=1 retry_count=7 rnr_retry_count=7 srq=0 qp_num=0]
2026-02-06T18:28:13.882567Z :BLOCKSTORE_RDMA ERROR: [10109089907589427039] /opt/buildagent/work/4e4cd8cd8c11d728/nbs/cloud/blockstore/libs/rdma/impl/verbs.cpp:123: SEVERITY_ERROR | FACILITY_SYSTEM | 5 | ibv_get_cq_event failed with error 5: Input/output error
2026-02-06T18:28:13.882569Z :BLOCKSTORE_RDMA INFO: [10109089907589427039] disconnect
2026-02-06T18:28:13.882825Z :BLOCKSTORE_RDMA ERROR: [10109089907589427039] /opt/buildagent/work/4e4cd8cd8c11d728/nbs/cloud/blockstore/libs/rdma/impl/verbs.cpp:123: SEVERITY_ERROR | FACILITY_SYSTEM | 5 | ibv_get_cq_event failed with error 5: Input/output error
2026-02-06T18:28:13.882976Z :BLOCKSTORE_RDMA ERROR: [10109089907589427039] /opt/buildagent/work/4e4cd8cd8c11d728/nbs/cloud/blockstore/libs/rdma/impl/verbs.cpp:123: SEVERITY_ERROR | FACILITY_SYSTEM | 5 | ibv_get_cq_event failed with error 5: Input/output error
Looking into kernel log confirms that eth0 was reconfigured into switchdev mode, which means RDMA was basically turned off on driver/firmware level
2026-02-06 20:28:13.000 mlx5_core 0000:37:00.0: E-Switch: Disable: mode(LEGACY), nvfs(64), necvfs(0), active vports(65)
2026-02-06 20:28:13.000 mlx5_0/1: QP 297 error: local protection error (0x3a 0x0 0x93)
2026-02-06 20:28:16.000 mlx5_core 0000:37:00.0: E-Switch: Supported tc chains and prios offload
2026-02-06 20:28:17.000 mlx5_core 0000:37:00.0: E-Switch: MPFS/FDB active
2026-02-06 20:28:17.000 mlx5_core 0000:37:00.0 eth0: Link up
2026-02-06 20:28:17.000 mlx5_core 0000:37:00.0 eth0: Dropping C-tag vlan stripping offload due to S-tag vlan
2026-02-06 20:28:17.000 mlx5_core 0000:37:00.0 eth0: Disabling hw_tls_tx, not supported in switchdev mode
2026-02-06 20:28:17.000 mlx5_core 0000:37:00.0 eth0: Disabling HW_VLAN CTAG FILTERING, not supported in switchdev mode
2026-02-06 20:28:17.000 mlx5_core 0000:37:00.0 eth0: Disabling HW MACsec offload, not supported in switchdev mode
2026-02-06 20:28:17.000 mlx5_core 0000:37:00.0: mlx5e: IPSec ESP acceleration enabled
There can be no sensible recovery in such a case, but at least we could have restarted the volume in IC mode
Reactions are currently unavailable