Skip to content

Conversation

rajachan
Copy link
Owner

No description provided.

hershys-aws and others added 3 commits April 29, 2025 09:19
sheduler was in device level and used two locks. One is freelist lock
and the other is scheduler->rr_lock. When a device has multiple threads
and is shared by them, there is lock contention problem in send posting.
As a short term solution, move scheduler to domain not to share the scheduler
resources by multiple threads.

Signed-off-by: Se Wang Oh <[email protected]>
This reverts commit 426acbc.

Reverting because the latest dev cluster has been updated with git 2.43.0 now, which supports
the mailmap feature in git for-each-ref command

Signed-off-by: Ye Xiang <[email protected]>
@rajachan rajachan force-pushed the actions-improvements branch from 977a8d3 to 0d91376 Compare May 3, 2025 19:39
bhasunit and others added 2 commits May 5, 2025 14:42
Since G5 platforms did not support PCIe technologies properly,
the NIC generated by NCCL in the topology file reported a numa
domain of -1. NCCL further assigns it the same system ID as the CPU.
CPUs with the same system ID are not connected to each other
during topology generation. This results in a seg fault during
topology path computation as GPUs are not able to find a
path to the NIC.

Signed-off-by: Sunita Bhaskaran <[email protected]>
Signed-off-by: Raghu Raja <[email protected]>
@rajachan rajachan force-pushed the actions-improvements branch from 0d91376 to 3ae6e8f Compare May 7, 2025 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants