Skip to content

Multi-Node NVLink failure due to memory reference error in compute-domain-daemon #948

@Wennn

Description

@Wennn

We are using the 25.8.1 version and when MNNVL is not working in a workload deployed as a Deployment or StatefulSet, I noticed the following errors in compute-domain-daemon:

[ERROR] Failed to get new memory reference with rc: 87
[ERROR] Failed to allocate memory reference.
[ERROR] Failed to get export ref for UUID: <a UUID> for request from node id 0
[ERROR] Received import response from node id 0 with failure status 3

If the workload has 2 pods, deleting the recreating these 2 pods can mitigate, and during this process we don't need to recreate the ComputeDomain resource.

These multi-node NVLink workloads are running on GB200 and GB300 capacities using driver version 580.105.08

NCCL test results, and the fact that only restarting all workloads working suggest that it's unlikely to be related to hardware failure.

Is it possible to be a leaking of resource or process created by the ComputeDomain stack?

Please let me know if you need any additional information

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions