Skip to content

[Task] Detect NCCL NET/IB peer completion errors in pod logs #44

@xlliu-scitix

Description

@xlliu-scitix

Background

During large-scale distributed training, the following NCCL warning frequently appears in pod logs:

[2026-01-04 21:33:04] test-ys-512-scaling-test-0104-1504-a7de2925-master-0:617:2799 [3] transport/net_ib.cc:2360 NCCL WARN NET/IB: Got completion from peer 172.16.234.184<53393> with status=12 opcode=1 len=5222 vendor err 129 (Send) localGid ::ffff:100.123.200.87 remoteGids::ffff:100.123.200.74 hca mlx5_26

This type of warning often indicates low-level InfiniBand / RoCE communication issues, which may lead to:

  • Collective operation stalls or hangs
  • NCCL timeouts and watchdog aborts
  • Training job failures or degraded performance

Currently, this error pattern is not explicitly detected by the pod log anomaly system.


Objective

Add a new NCCL NET/IB communication anomaly rule to the pod log component to detect and classify this class of network-related NCCL errors.


Anomaly Rule Definition

  • Matching method: Regular expression
  • Keyword / Pattern:
    • NCCL WARN NET/IB: Got completion from peer
  • Optional fields to extract:
    • Peer IP / Port
    • status
    • vendor err
    • HCA device (e.g. mlx5_xx)
  • Classify the anomaly type as: NCCL_NET_IB_ERROR
  • Attach extracted peer and HCA information if available

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions