Skip to content

The Pod is sharing the GPU with other Pods when CUDA_DISABLE_CONTROL=true and nvidia.com/gpumem=XXX #1678

@mospany

Description

@mospany

What happened:
The Pod is sharing the GPU with other Pods when CUDA_DISABLE_CONTROL=true and nvidia.com/gpumem=XXX, This leads to GPU contention among multiple Pods, resulting in performance degradation and Pod startup failures due to insufficient GPU memory.

What you expected to happen:
The Pod is NOT sharing the GPU with other Pods when CUDA_DISABLE_CONTROL=true and nvidia.com/gpumem=XXX

How to reproduce it (as minimally and precisely as possible):
1、 add follow env in deployment
- name: CUDA_DISABLE_CONTROL
value: "true"
2、add little nvidia.com/gpumem in resources
resources:
limits:
cpu: "8"
memory: 40Gi
nvidia.com/gpu: "1"
nvidia.com/gpumem: "4096"

Anything else we need to know?:

  • The output of nvidia-smi -a on your host
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The hami-device-plugin container logs
  • The hami-scheduler container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
  • Any relevant kernel output lines from dmesg

Environment:

  • HAMi version: v2.7.0
  • nvidia driver or other AI device driver version: 580.82.07
  • Docker version from docker version: v26.0.0
  • Docker command, image and tag used
  • Kernel version from uname -a: Ubuntu 22.04.2 LTS 6.8.0-1037-nvidia
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions