Add DRA support for GPU pod eviction during driver upgrades #129

karthikvetrivel · 2025-11-17T21:14:42Z

Description

Extends the driver-upgrade controller to detect and evict GPU workloads using Dynamic Resource Allocation (DRA) in addition to traditional nvidia.com/gpu resources. This ensures GPU driver upgrades work correctly as Kubernetes transitions from device plugins to the DRA model (GA in K8s 1.34+).

Changes

internal/kubernetes/claim_cache.go (new): Implements ResourceClaimCache that watches ResourceClaim objects and maintains a map of pod UIDs with allocated NVIDIA GPU claims. Uses informers with O(1) pod UID lookups.
internal/kubernetes/client.go:
- Adds claimCache to the Client struct
- Updates podUsesGPU() to check both traditional resources AND DRA ResourceClaims
- Cache is started on client creation and synced before any operations

Testing

Tested in a kubeadm cluster (K8s 1.34) with NVIDIA DRA driver installed:

Created test workloads:
- DRA GPU pod with allocated ResourceClaim (driver: gpu.nvidia.com)

Verified ResourceClaim allocation:

$ kubectl get resourceclaim -n default dra-gpu-claim -o yaml
status:
  allocation:
    devices:
      results:
      - driver: gpu.nvidia.com
        device: gpu-0
        pool: ipp1-0744
  reservedFor:
  - name: dra-allocated-pod
    resource: pods

Verified ResourceClaim cache synced:

level=info msg=ResourceClaim cache synced successfully

Triggered driver upgrade eviction:

level=info msg=Identifying GPU pods to delete
level=info msg=GPU pod - default/dra-allocated-pod
level=info msg=Deleting GPU pods...
evicting pod default/dra-allocated-pod

Verified DRA pod evicted successfully:

$ kubectl get pods -n default
NAME                  READY   STATUS    RESTARTS   AGE

shivamerla

LGTM

rahulait · 2025-11-25T17:15:07Z

I'm wondering if we should evict ANY pod with a ResourceClaim requesting nvidia.com GPUs (regardless of allocation status) to prevent race conditions where a pending claim gets allocated during the upgrade - thoughts?

Don't we cordon the node before starting the upgrade? If the node is cordoned, then there won't be new allocations to that node.

internal/kubernetes/client.go

karthikvetrivel · 2025-11-25T19:37:28Z

I'm wondering if we should evict ANY pod with a ResourceClaim requesting nvidia.com GPUs (regardless of allocation status) to prevent race conditions where a pending claim gets allocated during the upgrade - thoughts?

Don't we cordon the node before starting the upgrade? If the node is cordoned, then there won't be new allocations to that node.

I think you're right here. Good point, thanks for bringing it up.

internal/kubernetes/client.go

cdesiniotis

I didn't review in great detail, but this looks reasonable to me. A couple of things to consider:

Do we want to merge this change (and get it included into a k8s-driver-manager / gpu-operator release) before the DRA driver is integrated with the gpu-operator? I believe the answer is yes since in many cases users will install the DRA driver alongside the GPU Operator (until they are integrated). @shivamerla do you have any contradicting opinions on this?
We will need to make a similar change in the gpu-operator itself. By default, the driver-upgrade state machine (and therefore the GPU pod evictions) are handled by our driver upgrade controller that runs in the gpu-operator. We will need to update this line https://github.com/NVIDIA/gpu-operator/blob/51dd7a28cd86fedde8c4daad65c2643582fa4615/cmd/gpu-operator/main.go#L176 to pass in a modified gpu pod filter (that accounts for pods requesting GPUs via DRA) when constructing the driver upgrade controller.

go.mod

internal/kubernetes/client.go

karthikvetrivel · 2025-12-01T14:15:52Z

I didn't review in great detail, but this looks reasonable to me. A couple of things to consider:

Do we want to merge this change (and get it included into a k8s-driver-manager / gpu-operator release) before the DRA driver is integrated with the gpu-operator? I believe the answer is yes since in many cases users will install the DRA driver alongside the GPU Operator (until they are integrated). @shivamerla do you have any contradicting opinions on this?

We will need to make a similar change in the gpu-operator itself. By default, the driver-upgrade state machine (and therefore the GPU pod evictions) are handled by our driver upgrade controller that runs in the gpu-operator. We will need to update this line https://github.com/NVIDIA/gpu-operator/blob/51dd7a28cd86fedde8c4daad65c2643582fa4615/cmd/gpu-operator/main.go#L176 to pass in a modified gpu pod filter (that accounts for pods requesting GPUs via DRA) when constructing the driver upgrade controller.

@cdesiniotis

Yes, I believe we should for the reasons you mentioned.
Good point--yeah, I see the same gpuPodSpecFilter in gpu-operator. Once we get this PR approved, I'll open another to make the same changes in that repo.

tariq1890 · 2025-12-02T22:39:32Z

internal/kubernetes/client.go

+
+		var claim *resourcev1.ResourceClaim
+		var lastError error
+		_ = wait.PollUntilContextTimeout(c.ctx, 5*time.Second, timeout, true, func(ctx context.Context) (bool, error) {


I don't understand why we are not consuming the error returned by wait.PollUntilContextTimeout here?

You're right, we should. The error handling here became a bit convoluted through the refactors. I will update this once we decide how to search and clean up the claims (claims --> pods vs. pods --> claims).

guptaNswati · 2025-12-02T23:25:23Z

Though there is not much detail on best practice of how to clean up claims managed by DRA driver other than this two liner . But what Kevin was saying in the meeting make sense in terms of iterating on all gpu.nvidia.com claims and identifying pods referencing them and evicting them primarily because a claim can exist beyond the lifetime of a Pod and can be shared among multiple pods.

I was curious to look into it from extended resources perspective.

Signed-off-by: Karthik Vetrivel <[email protected]>

karthikvetrivel requested review from cdesiniotis, shivamerla and tariq1890 November 17, 2025 21:15

karthikvetrivel marked this pull request as draft November 17, 2025 21:15

shivamerla reviewed Nov 17, 2025

View reviewed changes

karthikvetrivel marked this pull request as ready for review November 18, 2025 13:16

rajathagasthya reviewed Nov 25, 2025

View reviewed changes

internal/kubernetes/client.go Outdated Show resolved Hide resolved

karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch from 65a3f53 to 43d29cc Compare November 25, 2025 19:26

karthikvetrivel requested review from rahulait and rajathagasthya November 25, 2025 19:37

karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch 2 times, most recently from 6e1a6fb to 0682513 Compare November 25, 2025 20:32

rajathagasthya reviewed Nov 25, 2025

View reviewed changes

internal/kubernetes/client.go Outdated Show resolved Hide resolved

internal/kubernetes/client.go Outdated Show resolved Hide resolved

karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch 2 times, most recently from a355d89 to 9c7ed23 Compare November 26, 2025 18:20

cdesiniotis reviewed Nov 26, 2025

View reviewed changes

tariq1890 reviewed Nov 26, 2025

View reviewed changes

go.mod Show resolved Hide resolved

tariq1890 reviewed Nov 26, 2025

View reviewed changes

internal/kubernetes/client.go Outdated Show resolved Hide resolved

tariq1890 reviewed Nov 26, 2025

View reviewed changes

internal/kubernetes/client.go Outdated Show resolved Hide resolved

internal/kubernetes/client.go Outdated Show resolved Hide resolved

internal/kubernetes/client.go Outdated Show resolved Hide resolved

karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch from 9c7ed23 to fc6bd1f Compare December 1, 2025 14:17

karthikvetrivel requested review from cdesiniotis and tariq1890 December 2, 2025 18:54

tariq1890 reviewed Dec 2, 2025

View reviewed changes

karthikvetrivel marked this pull request as draft December 12, 2025 21:11

karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch from fc6bd1f to 9fc355a Compare December 12, 2025 21:13

karthikvetrivel marked this pull request as ready for review December 17, 2025 23:21

karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch from 9fc355a to b76086f Compare December 18, 2025 00:08

Add DRA support for GPU pod eviction during driver upgrades

d25e32a

Signed-off-by: Karthik Vetrivel <[email protected]>

karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch from b76086f to d25e32a Compare December 18, 2025 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DRA support for GPU pod eviction during driver upgrades #129

Add DRA support for GPU pod eviction during driver upgrades #129

Uh oh!

karthikvetrivel commented Nov 17, 2025 •

edited

Loading

Uh oh!

shivamerla left a comment

Uh oh!

rahulait commented Nov 25, 2025

Uh oh!

Uh oh!

karthikvetrivel commented Nov 25, 2025

Uh oh!

Uh oh!

Uh oh!

cdesiniotis left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karthikvetrivel commented Dec 1, 2025

Uh oh!

tariq1890 Dec 2, 2025

Uh oh!

karthikvetrivel Dec 3, 2025

Uh oh!

guptaNswati commented Dec 2, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Add DRA support for GPU pod eviction during driver upgrades #129

Are you sure you want to change the base?

Add DRA support for GPU pod eviction during driver upgrades #129

Uh oh!

Conversation

karthikvetrivel commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Testing

Uh oh!

shivamerla left a comment

Choose a reason for hiding this comment

Uh oh!

rahulait commented Nov 25, 2025

Uh oh!

Uh oh!

karthikvetrivel commented Nov 25, 2025

Uh oh!

Uh oh!

Uh oh!

cdesiniotis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karthikvetrivel commented Dec 1, 2025

Uh oh!

tariq1890 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

karthikvetrivel Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

guptaNswati commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

karthikvetrivel commented Nov 17, 2025 •

edited

Loading

guptaNswati commented Dec 2, 2025 •

edited

Loading