-
Notifications
You must be signed in to change notification settings - Fork 133
Description
Summary:
When MIG devices get removed, they are sometimes removed from random Placement and not from left-right (or reverse). This could leave GPU unable to create a MIG profile when it has the resources but not in the right "place"
eg.
# nvidia-smi mig -lgi -i 3
+---------------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|=========================================================|
| 3 MIG 1g.10gb 19 7 0:1 |
+---------------------------------------------------------+
| 3 MIG 1g.10gb 19 8 1:1 |
+---------------------------------------------------------+
| 3 MIG 1g.10gb 19 12 5:1 |
+---------------------------------------------------------+
# nvidia-smi mig -lgip -i 3
+-------------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU Name ID Instances Memory P2P SM DEC ENC |
| Free/Total GiB CE JPEG OFA |
|===============================================================================|
| 3 MIG 1g.10gb 19 4/7 9.75 No 14 1 0 |
| 1 1 0 |
+-------------------------------------------------------------------------------+
| 3 MIG 1g.10gb+me 20 1/1 9.75 No 14 1 0 |
| 1 1 1 |
+-------------------------------------------------------------------------------+
| 3 MIG 1g.20gb 15 2/4 19.62 No 14 1 0 |
| 1 1 0 |
+-------------------------------------------------------------------------------+
| 3 MIG 2g.20gb 14 1/3 19.62 No 30 2 0 |
| 2 2 0 |
+-------------------------------------------------------------------------------+
| 3 MIG 3g.40gb 9 0/2 39.50 No 46 3 0 |
| 3 3 0 |
+-------------------------------------------------------------------------------+
| 3 MIG 4g.40gb 5 0/1 39.50 No 62 4 0 |
| 4 4 0 |
+-------------------------------------------------------------------------------+
| 3 MIG 7g.80gb 0 0/1 79.25 No 114 7 0 |
| 8 7 1 |
+-------------------------------------------------------------------------------+
the placement of one of the 1g10g MIG in Placement:Start = 5 leaves only 3 compute positions between the previous MIG and making MIG 4g40gb profile unusable (when it could have been if it was placed on Placement:Start = 2), but why profile 3g40gb is not usable is unclear to me (maybe the memory sectors couldn't overlap?).
Steps to reproduce:
Create a deployment having 28 pods requesting 1g10gb MIG profile:
apiVersion: apps/v1
kind: Deployment
metadata:
name: abstract-mig-claim
namespace: nvidia-dra-driver-gpu
labels:
app: abstract-mig-claim
spec:
replicas: 28
selector:
matchLabels:
app: abstract-mig-claim
strategy:
type: Recreate
template:
metadata:
labels:
app: abstract-mig-claim
spec:
restartPolicy: Always
containers:
- name: abstract-mig-claiming-pod
image: cuda:13.1.1-runtime-ubuntu24.04
command: ["sleep", "6000"]
resources:
claims:
- name: mig-device
request: mig-10gb
resourceClaims:
- name: mig-device
resourceClaimTemplateName: at-least-10gb-mig-templatereduce number of spec.replicas to 14
and check nvidia-smi mig -lgi output
+---------------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|=========================================================|
| 0 MIG 1g.10gb 19 7 4:1 |
+---------------------------------------------------------+
| 0 MIG 1g.10gb 19 8 5:1 |
+---------------------------------------------------------+
| 0 MIG 1g.10gb 19 9 6:1 |
+---------------------------------------------------------+
| 0 MIG 1g.10gb 19 11 0:1 |
+---------------------------------------------------------+
| 0 MIG 1g.10gb 19 12 1:1 |
+---------------------------------------------------------+
| 0 MIG 1g.10gb 19 13 2:1 |
+---------------------------------------------------------+
| 0 MIG 1g.10gb 19 14 3:1 |
+---------------------------------------------------------+
| 1 MIG 1g.10gb 19 8 5:1 |
+---------------------------------------------------------+
| 1 MIG 1g.10gb 19 14 3:1 |
+---------------------------------------------------------+
| 2 MIG 1g.10gb 19 11 0:1 |
+---------------------------------------------------------+
| 2 MIG 1g.10gb 19 12 1:1 |
+---------------------------------------------------------+
| 3 MIG 1g.10gb 19 7 0:1 |
+---------------------------------------------------------+
| 3 MIG 1g.10gb 19 8 1:1 |
+---------------------------------------------------------+
| 3 MIG 1g.10gb 19 12 5:1 |
+---------------------------------------------------------+
multiple MIG devices placed in out of order Placement:Start
Metadata
Metadata
Assignees
Labels
Type
Projects
Status