Skip to content

Conversation

@karthikvetrivel
Copy link
Member

@karthikvetrivel karthikvetrivel commented Nov 17, 2025

Description

Extends the driver-upgrade controller to detect and evict GPU workloads using Dynamic Resource Allocation (DRA) in addition to traditional nvidia.com/gpu resources. This ensures GPU driver upgrades work correctly as Kubernetes transitions from device plugins to the DRA model (GA in K8s 1.34+).

Changes

  • internal/kubernetes/claim_cache.go (new): Implements ResourceClaimCache that watches ResourceClaim objects and maintains a map of pod UIDs with allocated NVIDIA GPU claims. Uses informers with O(1) pod UID lookups.

  • internal/kubernetes/client.go:

    • Adds claimCache to the Client struct
    • Updates podUsesGPU() to check both traditional resources AND DRA ResourceClaims
    • Cache is started on client creation and synced before any operations

Testing

Tested in a kubeadm cluster (K8s 1.34) with NVIDIA DRA driver installed:

  1. Created test workloads:

    • DRA GPU pod with allocated ResourceClaim (driver: gpu.nvidia.com)
  2. Verified ResourceClaim allocation:

    $ kubectl get resourceclaim -n default dra-gpu-claim -o yaml
    status:
      allocation:
        devices:
          results:
          - driver: gpu.nvidia.com
            device: gpu-0
            pool: ipp1-0744
      reservedFor:
      - name: dra-allocated-pod
        resource: pods
  3. Verified ResourceClaim cache synced:

    level=info msg=ResourceClaim cache synced successfully
    
  4. Triggered driver upgrade eviction:

    level=info msg=Identifying GPU pods to delete
    level=info msg=GPU pod - default/dra-allocated-pod
    level=info msg=Deleting GPU pods...
    evicting pod default/dra-allocated-pod
    
  5. Verified DRA pod evicted successfully:

    $ kubectl get pods -n default
    NAME                  READY   STATUS    RESTARTS   AGE

Copy link
Contributor

@shivamerla shivamerla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@karthikvetrivel karthikvetrivel marked this pull request as ready for review November 18, 2025 13:16
@rahulait
Copy link

I'm wondering if we should evict ANY pod with a ResourceClaim requesting nvidia.com GPUs (regardless of allocation status) to prevent race conditions where a pending claim gets allocated during the upgrade - thoughts?

Don't we cordon the node before starting the upgrade? If the node is cordoned, then there won't be new allocations to that node.

@karthikvetrivel karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch from 65a3f53 to 43d29cc Compare November 25, 2025 19:26
@karthikvetrivel
Copy link
Member Author

I'm wondering if we should evict ANY pod with a ResourceClaim requesting nvidia.com GPUs (regardless of allocation status) to prevent race conditions where a pending claim gets allocated during the upgrade - thoughts?

Don't we cordon the node before starting the upgrade? If the node is cordoned, then there won't be new allocations to that node.

I think you're right here. Good point, thanks for bringing it up.

@karthikvetrivel karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch 2 times, most recently from 6e1a6fb to 0682513 Compare November 25, 2025 20:32
@karthikvetrivel karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch 2 times, most recently from a355d89 to 9c7ed23 Compare November 26, 2025 18:20
Copy link
Collaborator

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review in great detail, but this looks reasonable to me. A couple of things to consider:

  1. Do we want to merge this change (and get it included into a k8s-driver-manager / gpu-operator release) before the DRA driver is integrated with the gpu-operator? I believe the answer is yes since in many cases users will install the DRA driver alongside the GPU Operator (until they are integrated). @shivamerla do you have any contradicting opinions on this?
  2. We will need to make a similar change in the gpu-operator itself. By default, the driver-upgrade state machine (and therefore the GPU pod evictions) are handled by our driver upgrade controller that runs in the gpu-operator. We will need to update this line https://github.com/NVIDIA/gpu-operator/blob/51dd7a28cd86fedde8c4daad65c2643582fa4615/cmd/gpu-operator/main.go#L176 to pass in a modified gpu pod filter (that accounts for pods requesting GPUs via DRA) when constructing the driver upgrade controller.

@karthikvetrivel
Copy link
Member Author

I didn't review in great detail, but this looks reasonable to me. A couple of things to consider:

  1. Do we want to merge this change (and get it included into a k8s-driver-manager / gpu-operator release) before the DRA driver is integrated with the gpu-operator? I believe the answer is yes since in many cases users will install the DRA driver alongside the GPU Operator (until they are integrated). @shivamerla do you have any contradicting opinions on this?
  2. We will need to make a similar change in the gpu-operator itself. By default, the driver-upgrade state machine (and therefore the GPU pod evictions) are handled by our driver upgrade controller that runs in the gpu-operator. We will need to update this line https://github.com/NVIDIA/gpu-operator/blob/51dd7a28cd86fedde8c4daad65c2643582fa4615/cmd/gpu-operator/main.go#L176 to pass in a modified gpu pod filter (that accounts for pods requesting GPUs via DRA) when constructing the driver upgrade controller.

@cdesiniotis

  1. Yes, I believe we should for the reasons you mentioned.

  2. Good point--yeah, I see the same gpuPodSpecFilter in gpu-operator. Once we get this PR approved, I'll open another to make the same changes in that repo.


var claim *resourcev1.ResourceClaim
var lastError error
_ = wait.PollUntilContextTimeout(c.ctx, 5*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we are not consuming the error returned by wait.PollUntilContextTimeout here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, we should. The error handling here became a bit convoluted through the refactors. I will update this once we decide how to search and clean up the claims (claims --> pods vs. pods --> claims).

@guptaNswati
Copy link

guptaNswati commented Dec 2, 2025

Though there is not much detail on best practice of how to clean up claims managed by DRA driver other than this two liner . But what Kevin was saying in the meeting make sense in terms of iterating on all gpu.nvidia.com claims and identifying pods referencing them and evicting them primarily because a claim can exist beyond the lifetime of a Pod and can be shared among multiple pods.

I was curious to look into it from extended resources perspective.

@karthikvetrivel karthikvetrivel marked this pull request as draft December 12, 2025 21:11
@karthikvetrivel karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch from fc6bd1f to 9fc355a Compare December 12, 2025 21:13
@karthikvetrivel karthikvetrivel marked this pull request as ready for review December 17, 2025 23:21
@karthikvetrivel karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch from 9fc355a to b76086f Compare December 18, 2025 00:08
@karthikvetrivel karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch from b76086f to d25e32a Compare December 18, 2025 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants