Skip to content

Kubelet UnexpectedAdmissionError after successful scheduling by hami-scheduler on non-MIG GPU #110

@pradnyargaykar

Description

@pradnyargaykar

Summary
When using HAMi for fractional GPU sharing on a non-MIG NVIDIA GPU, a second pod requesting a fractional GPU fails to start with an UnexpectedAdmissionError. The hami-scheduler successfully binds the pod to the node, but the kubelet fails to allocate resources.

Steps to reproduce

  1. Start a Kubernetes cluster with a single node containing an NVIDIA RTX A4000 GPU (or a similar non-MIG GPU).
  2. Deploy the hami-scheduler and hami-device-plugin.
  3. Deploy a pod with a GPU resource request lets say 30% of SM cores (gpu-workload-long).
  4. Immediately after, deploy a second pod with a GPU resource request lets say 70% of SM cores (gpu-workload-short).

Expected Behaviour
The second pod (gpu-workload-short) should be successfully scheduled and started on the same node, running concurrently with the first pod.

Actual Behavior

  1. The hami-scheduler successfully evaluates the second pod and binds it to the node.
  2. The kubelet on the node attempts to start the second pod but fails.
  3. The pod enters an UnexpectedAdmissionError state.
  4. The kubectl describe output for the failing pod shows Requested: 1, Available: 0, indicating that the hami-device-plugin reported zero available devices at the time of allocation.

Relevant logs
Image

POD YAML
Image

Image

Environment
Kubernetes Version:

  • Client Version: v1.33.2
  • Server Version: v1.21.14

GPU Model: [NVIDIA RTX A4000]

Driver Version: 575.57.08

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions