-
Notifications
You must be signed in to change notification settings - Fork 488
Open
Labels
kind/bugSomething isn't workingSomething isn't working
Description
What happened:
The Pod is sharing the GPU with other Pods when CUDA_DISABLE_CONTROL=true and nvidia.com/gpumem=XXX, This leads to GPU contention among multiple Pods, resulting in performance degradation and Pod startup failures due to insufficient GPU memory.
What you expected to happen:
The Pod is NOT sharing the GPU with other Pods when CUDA_DISABLE_CONTROL=true and nvidia.com/gpumem=XXX
How to reproduce it (as minimally and precisely as possible):
1、 add follow env in deployment
- name: CUDA_DISABLE_CONTROL
value: "true"
2、add little nvidia.com/gpumem in resources
resources:
limits:
cpu: "8"
memory: 40Gi
nvidia.com/gpu: "1"
nvidia.com/gpumem: "4096"
Anything else we need to know?:
- The output of
nvidia-smi -aon your host - Your docker or containerd configuration file (e.g:
/etc/docker/daemon.json) - The hami-device-plugin container logs
- The hami-scheduler container logs
- The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet) - Any relevant kernel output lines from
dmesg
Environment:
- HAMi version: v2.7.0
- nvidia driver or other AI device driver version: 580.82.07
- Docker version from
docker version: v26.0.0 - Docker command, image and tag used
- Kernel version from
uname -a: Ubuntu 22.04.2 LTS 6.8.0-1037-nvidia - Others:
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
kind/bugSomething isn't workingSomething isn't working