-
Notifications
You must be signed in to change notification settings - Fork 146
Kubelet UnexpectedAdmissionError after successful scheduling by hami-scheduler on non-MIG GPU #110
Copy link
Copy link
Open
Description
Summary
When using HAMi for fractional GPU sharing on a non-MIG NVIDIA GPU, a second pod requesting a fractional GPU fails to start with an UnexpectedAdmissionError. The hami-scheduler successfully binds the pod to the node, but the kubelet fails to allocate resources.
Steps to reproduce
- Start a Kubernetes cluster with a single node containing an NVIDIA RTX A4000 GPU (or a similar non-MIG GPU).
- Deploy the hami-scheduler and hami-device-plugin.
- Deploy a pod with a GPU resource request lets say 30% of SM cores (gpu-workload-long).
- Immediately after, deploy a second pod with a GPU resource request lets say 70% of SM cores (gpu-workload-short).
Expected Behaviour
The second pod (gpu-workload-short) should be successfully scheduled and started on the same node, running concurrently with the first pod.
Actual Behavior
- The hami-scheduler successfully evaluates the second pod and binds it to the node.
- The kubelet on the node attempts to start the second pod but fails.
- The pod enters an UnexpectedAdmissionError state.
- The kubectl describe output for the failing pod shows Requested: 1, Available: 0, indicating that the hami-device-plugin reported zero available devices at the time of allocation.
Environment
Kubernetes Version:
- Client Version: v1.33.2
- Server Version: v1.21.14
GPU Model: [NVIDIA RTX A4000]
Driver Version: 575.57.08
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels

