-
Notifications
You must be signed in to change notification settings - Fork 3
Closed
Description
Background
During large-scale distributed training, the following NCCL warning frequently appears in pod logs:
[2026-01-04 21:33:04] test-ys-512-scaling-test-0104-1504-a7de2925-master-0:617:2799 [3] transport/net_ib.cc:2360 NCCL WARN NET/IB: Got completion from peer 172.16.234.184<53393> with status=12 opcode=1 len=5222 vendor err 129 (Send) localGid ::ffff:100.123.200.87 remoteGids::ffff:100.123.200.74 hca mlx5_26
This type of warning often indicates low-level InfiniBand / RoCE communication issues, which may lead to:
- Collective operation stalls or hangs
- NCCL timeouts and watchdog aborts
- Training job failures or degraded performance
Currently, this error pattern is not explicitly detected by the pod log anomaly system.
Objective
Add a new NCCL NET/IB communication anomaly rule to the pod log component to detect and classify this class of network-related NCCL errors.
Anomaly Rule Definition
- Matching method: Regular expression
- Keyword / Pattern:
NCCL WARN NET/IB: Got completion from peer
- Optional fields to extract:
- Peer IP / Port
statusvendor err- HCA device (e.g.
mlx5_xx)
- Classify the anomaly type as:
NCCL_NET_IB_ERROR - Attach extracted peer and HCA information if available
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels