-
Notifications
You must be signed in to change notification settings - Fork 133
Description
We are using the 25.8.1 version and when MNNVL is not working in a workload deployed as a Deployment or StatefulSet, I noticed the following errors in compute-domain-daemon:
[ERROR] Failed to get new memory reference with rc: 87
[ERROR] Failed to allocate memory reference.
[ERROR] Failed to get export ref for UUID: <a UUID> for request from node id 0
[ERROR] Received import response from node id 0 with failure status 3
If the workload has 2 pods, deleting the recreating these 2 pods can mitigate, and during this process we don't need to recreate the ComputeDomain resource.
These multi-node NVLink workloads are running on GB200 and GB300 capacities using driver version 580.105.08
NCCL test results, and the fact that only restarting all workloads working suggest that it's unlikely to be related to hardware failure.
Is it possible to be a leaking of resource or process created by the ComputeDomain stack?
Please let me know if you need any additional information
Thanks!
Metadata
Metadata
Assignees
Labels
Type
Projects
Status