-
Notifications
You must be signed in to change notification settings - Fork 22
Add DRA support for GPU pod eviction during driver upgrades #129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
shivamerla
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Don't we cordon the node before starting the upgrade? If the node is cordoned, then there won't be new allocations to that node. |
65a3f53 to
43d29cc
Compare
I think you're right here. Good point, thanks for bringing it up. |
6e1a6fb to
0682513
Compare
a355d89 to
9c7ed23
Compare
cdesiniotis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't review in great detail, but this looks reasonable to me. A couple of things to consider:
- Do we want to merge this change (and get it included into a k8s-driver-manager / gpu-operator release) before the DRA driver is integrated with the gpu-operator? I believe the answer is yes since in many cases users will install the DRA driver alongside the GPU Operator (until they are integrated). @shivamerla do you have any contradicting opinions on this?
- We will need to make a similar change in the gpu-operator itself. By default, the driver-upgrade state machine (and therefore the GPU pod evictions) are handled by our driver upgrade controller that runs in the gpu-operator. We will need to update this line https://github.com/NVIDIA/gpu-operator/blob/51dd7a28cd86fedde8c4daad65c2643582fa4615/cmd/gpu-operator/main.go#L176 to pass in a modified gpu pod filter (that accounts for pods requesting GPUs via DRA) when constructing the driver upgrade controller.
|
9c7ed23 to
fc6bd1f
Compare
internal/kubernetes/client.go
Outdated
|
|
||
| var claim *resourcev1.ResourceClaim | ||
| var lastError error | ||
| _ = wait.PollUntilContextTimeout(c.ctx, 5*time.Second, timeout, true, func(ctx context.Context) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we are not consuming the error returned by wait.PollUntilContextTimeout here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, we should. The error handling here became a bit convoluted through the refactors. I will update this once we decide how to search and clean up the claims (claims --> pods vs. pods --> claims).
|
Though there is not much detail on best practice of how to clean up claims managed by DRA driver other than this two liner . But what Kevin was saying in the meeting make sense in terms of iterating on all I was curious to look into it from extended resources perspective. |
fc6bd1f to
9fc355a
Compare
9fc355a to
b76086f
Compare
Signed-off-by: Karthik Vetrivel <[email protected]>
b76086f to
d25e32a
Compare
Description
Extends the driver-upgrade controller to detect and evict GPU workloads using Dynamic Resource Allocation (DRA) in addition to traditional
nvidia.com/gpuresources. This ensures GPU driver upgrades work correctly as Kubernetes transitions from device plugins to the DRA model (GA in K8s 1.34+).Changes
internal/kubernetes/claim_cache.go(new): ImplementsResourceClaimCachethat watchesResourceClaimobjects and maintains a map of pod UIDs with allocated NVIDIA GPU claims. Uses informers with O(1) pod UID lookups.internal/kubernetes/client.go:claimCacheto theClientstructpodUsesGPU()to check both traditional resources AND DRA ResourceClaimsTesting
Tested in a kubeadm cluster (K8s 1.34) with NVIDIA DRA driver installed:
Created test workloads:
ResourceClaim(driver: gpu.nvidia.com)Verified ResourceClaim allocation:
$ kubectl get resourceclaim -n default dra-gpu-claim -o yaml status: allocation: devices: results: - driver: gpu.nvidia.com device: gpu-0 pool: ipp1-0744 reservedFor: - name: dra-allocated-pod resource: podsVerified ResourceClaim cache synced:
Triggered driver upgrade eviction:
Verified DRA pod evicted successfully: