Multi-Node NVLink failure due to memory reference error in compute-domain-daemon

We are using the 25.8.1 version and when MNNVL is not working in a workload deployed as a Deployment or StatefulSet, I noticed the following errors in compute-domain-daemon:

> [ERROR] Failed to get new memory reference with rc: 87
> [ERROR] Failed to allocate memory reference.
> [ERROR] Failed to get export ref for UUID: \<a UUID\> for request from node id 0
> [ERROR] Received import response from node id 0 with failure status 3

If the workload has 2 pods, deleting the recreating these 2 pods can mitigate, and during this process we don't need to recreate the ComputeDomain resource.

These multi-node NVLink workloads are running on GB200 and GB300 capacities using driver version 580.105.08

NCCL test results, and the fact that only restarting all workloads working suggest that it's unlikely to be related to hardware failure.

Is it possible to be a leaking of resource or process created by the ComputeDomain stack?

Please let me know if you need any additional information

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Node NVLink failure due to memory reference error in compute-domain-daemon #948

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-Node NVLink failure due to memory reference error in compute-domain-daemon #948

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions