The Pod is sharing the GPU with other Pods when CUDA_DISABLE_CONTROL=true and nvidia.com/gpumem=XXX

**What happened**:
   The Pod is sharing the GPU with other Pods when CUDA_DISABLE_CONTROL=true and nvidia.com/gpumem=XXX,  This leads to GPU contention among multiple Pods, resulting in performance degradation and Pod startup failures due to insufficient GPU memory.

**What you expected to happen**:
  The Pod is NOT sharing the GPU with other Pods when CUDA_DISABLE_CONTROL=true and nvidia.com/gpumem=XXX

**How to reproduce it (as minimally and precisely as possible)**:
  1、 add follow env in deployment
    - name: CUDA_DISABLE_CONTROL
      value: "true"
  2、add little nvidia.com/gpumem in resources
    resources:
      limits:
        cpu: "8"
        memory: 40Gi
        nvidia.com/gpu: "1"
        nvidia.com/gpumem: "4096"

**Anything else we need to know?**:

- The output of `nvidia-smi -a` on your host
- Your docker or containerd configuration file (e.g: `/etc/docker/daemon.json`)
- The hami-device-plugin container logs
- The hami-scheduler container logs
- The kubelet logs on the node (e.g: `sudo journalctl -r -u kubelet`)
- Any relevant kernel output lines from `dmesg`

**Environment**:
- HAMi version: v2.7.0
- nvidia driver or other AI device driver version:  580.82.07
- Docker version from `docker version`:  v26.0.0
- Docker command, image and tag used
- Kernel version from `uname -a`:  Ubuntu 22.04.2 LTS   6.8.0-1037-nvidia
- Others:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Pod is sharing the GPU with other Pods when CUDA_DISABLE_CONTROL=true and nvidia.com/gpumem=XXX #1678

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The Pod is sharing the GPU with other Pods when CUDA_DISABLE_CONTROL=true and nvidia.com/gpumem=XXX #1678

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions