Skip to content

[Bug]: Worker: Process failed for slice (): transport retry counter exceeded #1450

@zx-ai

Description

@zx-ai

Bug Report

Worker: Process failed for slice (opcode: 1, source_addr: 0x14c97801f040, length: 65536, dest_addr: 0x1474af020000, local_nic: mlx5_7, peer_nic: 172.16.37.132:14127@mlx5_4, dest_rkey: 2359727, retry_cnt: 0): transport retry counter exceeded. This type of error occurs intermittently.

I try export MC_MS_AUTO_DISC=0, this help solve this type of error . But I’m curious why turning off auto_discover_ avoids this issue?

Before submitting...

  • Ensure you searched for relevant issues and read the [documentation]

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions