-
Notifications
You must be signed in to change notification settings - Fork 15.4k
Blog post for DRA updates in 1.36 #54567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,177 @@ | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| --- | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| layout: blog | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| title: "Kubernetes v1.36: More Drivers, New Features, and the Next Era of DRA" | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| slug: dra-136-updates | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| draft: true | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| date: XXXX-XX-XX | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| author: > | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| The DRA team | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| --- | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Dynamic Resource Allocation (DRA) has fundamentally changed how we handle hardware | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
If "we" means "all the contributors to DRA" then it's best not to have it also mean "anyone using Kubernetes", within one article. |
||||||||||||||||||||||||||||||||||||||||||||||||||||
| accelerators and specialized resources in Kubernetes. In the v1.36 release, DRA | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| continues to mature, bringing a wave of feature graduations, critical usability | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| improvements, and new capabilities that extend the flexibility of DRA to native | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| resources like memory and CPU, and support for ResourceClaims in PodGroups. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| We have also seen significant momentum in driver availability. Both the | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| [NVIDIA GPU](https://github.com/NVIDIA/k8s-dra-driver-gpu) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This link is stale, and the transfer happened before the v1.36 release. Let's update and switch to past tense. |
||||||||||||||||||||||||||||||||||||||||||||||||||||
| and Google TPU DRA drivers are being transferred to the Kubernetes project, joining the | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| [DRANET](https://github.com/kubernetes-sigs/dranet) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| driver that was added last year. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Calling those out seems reasonable for a blog post because this is newsworthy. We could link to https://github.com/kubernetes-sigs/wg-device-management/tree/main/device-ecosystem but I'll defer to SIG Docs about that.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can link to https://www.kubernetes.dev/community/community-groups/wg/device-management/ If we want to publish something like https://github.com/kubernetes-sigs/wg-device-management/tree/main/device-ecosystem for end users, we can. Maybe a (separate) It's not a good fit for the official Kubernetes documentation; it has a bit too much about vendors and offerings. But a neutral blog article could be OK. |
||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Whether you are managing massive fleets of GPUs, need better handling of failures, | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| or simply looking for better ways to define resource fallback options, the upgrades | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| to DRA in 1.36 have something for you. Let's dive into the new features and graduations! | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| ## Feature graduations | ||||||||||||||||||||||||||||||||||||||||||||||||||||
mortent marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| The community has been hard at work stabilizing core DRA concepts. In Kubernetes 1.36, | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| several highly anticipated features have graduated to Beta and Stable. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| **Prioritized List (Stable)** | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would use an actual heading here; also, use sentence case, eg ### Prioritized list (stable) {#prioritized-list}and similarly for other headings |
||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Hardware heterogeneity is a reality in most clusters. With the | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| [Prioritized List](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#prioritized-list) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Idiomatically, don't use title case for the names of features. Outside of a hyperlink we use italics; within a hyperlink they are optional. |
||||||||||||||||||||||||||||||||||||||||||||||||||||
| feature, you can confidently define fallback preferences when requesting | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| devices. Instead of hardcoding a request for a specific device model, you can specify an | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| ordered list of preferences (e.g., "Give me an H100, but if none are available, fall back | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| to an A100"). The scheduler will evaluate these requests in order, drastically improving | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| scheduling flexibility and cluster utilization. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| **Extended Resource Support (Beta)** | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| As DRA becomes the standard for resource allocation, bridging the gap with legacy systems | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| is crucial. The DRA | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| [Extended Resource](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#extended-resource) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| feature allows users to request resources via traditional extended resources on a Pod. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| This allows for a gradual transition to DRA, meaning cluster operators can migrate clusters | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| to DRA but let application developers adopt the ResourceClaim API on their own schedule. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| **Partitionable Devices (Beta)** | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Hardware accelerators are powerful, and sometimes a single workload doesn't need an | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| entire device. The | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| [Partitionable Devices](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#partitionable-devices) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| feature, provides native DRA support for dynamically carving physical hardware into smaller, | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| logical instances (such as Multi-Instance GPUs) based on workload demands. This allows | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| administrators to safely and efficiently share expensive accelerators across multiple Pods. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| **Device Taints (Beta)** | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Just as you can taint a Kubernetes Node, you can now apply taints directly to specific DRA | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| devices. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| [Device Taints and Tolerations](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-taints-and-tolerations) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| empower cluster administrators to manage hardware more effectively. You can taint faulty | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| devices to prevent them from being allocated to standard claims, or reserve specific hardware | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| for dedicated teams, specialized workloads, and experiments. Ultimately, only Pods with | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| matching tolerations are permitted to claim these tainted devices. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| **Device Binding Conditions (Beta)** | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| To improve scheduling reliability, the Kubernetes scheduler can now use the | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Watch out for "now"; was this not possible in v1.35? If it was, even behind a feature gate, we should reword. |
||||||||||||||||||||||||||||||||||||||||||||||||||||
| [Binding Conditions](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-taints-and-tolerations) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| feature to delay committing a Pod to a Node until its required external resources—such as attachable | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| devices or FPGAs—are fully prepared. By explicitly modeling resource readiness, this | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| prevents premature assignments that can lead to Pod failures, ensuring a much more robust | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| and predictable deployment process. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
mortent marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I want to include kubernetes/enhancements#4680 in the feature blog. ref: docs PR is #54420
Suggested change
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So I've added your proposal for now, but do you think we can shorten it a bit and make it just one paragraph? There is a large number of features and we don't want the blog post to be too long. Focus just on the benefits of this feature and what it enables and leave the details to the DRA docs which we link to. Also, including that it is graduating to beta in 1.36 is already given from the context.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry I forgot to add this in the first draft, it is of course something we should include in the blog.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @harche Could you take a look at this? Currently the description of this feature gets into quite a bit more detail than the other descriptions and I think some of it can be left to the documentation.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks @mortent , does this look better ?#54567 (comment) |
||||||||||||||||||||||||||||||||||||||||||||||||||||
| **Resource Health Status (Beta)** | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Knowing when a device has failed or become unhealthy is critical for workloads running on | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| specialized hardware. With | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| [Resource Health Status](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring), | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Kubernetes now exposes device health information directly in the Pod Status through the | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
? |
||||||||||||||||||||||||||||||||||||||||||||||||||||
| `allocatedResourcesStatus` field. When a DRA driver detects that an allocated device | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| has become unhealthy, it reports this back to the kubelet, which surfaces it in each | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| container's status. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| In 1.36, the feature graduates to beta (enabled by default) and adds an optional `message` | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (for this article, only a nit) Remember that this is a post release blog:
Suggested change
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| field providing human readable context about the health status, such as error details or | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| failure reasons. DRA drivers can also configure per device health check timeouts, | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| allowing different hardware types to use appropriate timeout values based on their | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| health reporting characteristics. This gives users and controllers crucial visibility | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| to quickly identify and react to hardware failures. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+79
to
+94
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I recommend writing the graduation as if it was in the past (which it will be). Also, watch out for implying that beta → enabled by default. Some things go to beta as initially opt-in. |
||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| ## New Features | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This enhancement was add in 1.36, I wonder if this section should contain a sub-section for it. cc @pohly
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's add it. Can you suggest something?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, added a suggestion here: https://github.com/kubernetes/website/pull/54567/changes#r3076822230 |
||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Beyond stabilizing existing capabilities, v1.36 introduces foundational new features | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| that expand what DRA can do. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| **ResourceClaim Support for Workloads** | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| To optimize large-scale AI/ML workloads that rely on strict topological scheduling, the | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| [ResourceClaim Support for Workloads](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#workload-resourceclaims) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Try to describe what is possible (rather than summarizing the enhancement). eg People using Kubernetes in their platform may have large AI/ML workloads
and rely on strict _topological scheduling_ (matching Pods to run across multiple nodes
with firm constraints, such as making an entire rack of compute available
along with interconnects).
DRA provides an [integration](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#workload-resourceclaims)
with the Workload API, so that you can get near seamless management of
infrastructure resources, even across very large sets of Pods.
By associating ResourceClaims or ResourceClaimTemplates with the PodGroup API
(also new in Kubernetes v1.35),
this integration eliminates previous scaling bottlenecks, such as the limit on the
number of Pods that can share a claim, and
removes the burden of custom or manual claim
management from specialized orchestrators.Notice how I didn't mention the feature other than by hyperlinking to its docs. I would consider also mentioning the feature gate on this one. |
||||||||||||||||||||||||||||||||||||||||||||||||||||
| feature enables Kubernetes to seamlessly manage shared resources across massive sets | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| of Pods. By associating ResourceClaims or ResourceClaimTemplates with PodGroups, | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| this feature eliminates previous scaling bottlenecks, such as the limit on the | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| number of pods that can share a claim, and removes the burden of manual claim | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| management from specialized orchestrators. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| **DRA for Native Resources** | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Why should DRA only be for external accelerators? In v1.36, we are introducing the first | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| iteration of using the DRA API to manage Kubernetes native resources (like CPU and | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
"Infrastructure resource" to disambiguate from HTTP resources, BTW. |
||||||||||||||||||||||||||||||||||||||||||||||||||||
| memory). By bringing CPU and memory allocation under the DRA umbrella with the DRA | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| [Native Resources](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#node-allocatable-resources) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be called "Node Allocatable Resources" instead of "Native Resources"? cc @pravk03
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we have renamed this to node allocatable resources. Following the convention in the Node Allocatable and Resource Management documentation, we can omit the hyphen. I’ll update the the KEP and docs to keep it consistent. |
||||||||||||||||||||||||||||||||||||||||||||||||||||
| feature, users can leverage DRA's advanced placement, NUMA-awareness, and prioritization | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| semantics for standard compute resources, paving the way for incredibly fine-grained | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| performance tuning. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| **DRA Resource Availability Visibility** | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| One of the most requested features from cluster administrators has been better visibility | ||||||||||||||||||||||||||||||||||||||||||||||||||||
mortent marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| into hardware capacity. The new | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| [DRAResourcePoolStatus](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#resource-pool-status) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| feature allows you to query the availability of devices in DRA resource pools. By creating a | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| `ResourcePoolStatusRequest` object, you get a point-in-time snapshot of device counts | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| — total, allocated, available, and unavailable — for each pool managed by a given | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| driver. This enables better integration with dashboards and capacity planning tools. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| **List Types for Attributes** | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| With | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| [List Types for Attributes](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#list-type-attributes), | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| DRA can represent device attributes as typed lists (`ints`, `bools`, `strings`, and `versions`), not | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| just scalar values. This helps model real hardware topology, such as devices that belong | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| to multiple PCIe roots or NUMA domains. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| This feature also extends ResourceClaim constraint behavior to work naturally | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I recommend: talk about what Kubernetes does more than what the feature does. Something like: In v1.36, ResourceClaim constraint evaluation has changed (behind a feature gate)
to work better with scalar and list values:
`matchAttribute` now checks for a non-empty
intersection, and `distinctAttribute` checks for pairwise disjoint values.
Kubernetes v1.36 also introduces an `includes()` function in CEL, that lets device selectors keep working
more easily when an attribute changes between scalar and list representations.
(The `includes()` function is only available in DRA
contexts for expression evaluation).I avoided "naturally". People reading this probably won't have a sense of the natural way for ResourceClaim evaluation to work; that's different from how people use "naturally" outside of a more jargon-y context. |
||||||||||||||||||||||||||||||||||||||||||||||||||||
| with both scalar and list values: `matchAttribute` now checks for a non-empty | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| intersection, and `distinctAttribute` checks for pairwise disjoint values. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| It also introduces `includes()` function in DRA CEL, which lets device selectors keep working | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| more easily when an attribute changes between scalar and list representations. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| **Deterministic Device Selection via Lexicographical Sorting** | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just call this deterministic device selection; it's snappier. |
||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| The Kubernetes scheduler has been updated to evaluate devices using lexicographical | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| ordering based on resource pool and ResourceSlice names. This change empowers drivers | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| to proactively influence the scheduling process, leading to improved throughput and | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| more optimal scheduling decisions. To support this capability, the ResourceSlice | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| controller toolkit now automatically generates names that reflect the exact device | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| ordering specified by the driver author. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I want to include kubernetes/enhancements#5491 if it's worth putting in the feature blog. ref: docs PR is #54561
Suggested change
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry I forgot this one, it is definitely worth including. Added your suggestion.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar to my comment on #54567 (comment), do you think we could make it a bit more focused on just the benefits of the feature and leave some of the details to the DRA documentation? And see if we can keep it to a single paragraph?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @everpeace Could you take a look at updating the description to align a bit more with the other features, ref my previous comment?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's in now, right? |
||||||||||||||||||||||||||||||||||||||||||||||||||||
| ## What’s next? | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we say that the big priority is to migrate community to DRA? And also make it a call to action?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. I've added a small section for this.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| This cycle introduced a wealth of new Dynamic Resource Allocation (DRA) features, | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would call it a release, not a cycle: we're writing for the public. |
||||||||||||||||||||||||||||||||||||||||||||||||||||
| and the momentum is only building. As we look ahead, our roadmap focuses on maturing | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| existing features toward beta and stable releases while hardening DRA’s performance, | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| scalability, and reliability. A key priority over the coming cycles will be deep | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| integration with Workload-Aware and Topology-Aware Scheduling. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I might write:
Suggested change
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| A big goal for us is to migrate the entire community to DRA, and we want | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems far fetched. Do we mean that we want everyone using accelerators to use DRA, or do we mean that behind even simple workloads, CPU and memory resources become allocated to nodes and Pods dynamically? Without being prescriptive, let's try to help readers guess what we really mean. |
||||||||||||||||||||||||||||||||||||||||||||||||||||
| you involved. Whether you are currently maintaining a driver or are just beginning | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| to explore the possibilities, your input is vital. Partner with us to shape the next | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| generation of resource management. Reach out today to collaborate on development, | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| share feedback, or start building your first DRA driver. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| ## Getting involved | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| A good starting point is joining the WG Device Management | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| [Slack channel](https://kubernetes.slack.com/archives/C0409NGC1TK) and | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| [meetings](https://docs.google.com/document/d/1qxI87VqGtgN7EAJlqVfxx86HGKEAc2A3SKru8nJHNkQ/edit?tab=t.0#heading=h.tgg8gganowxq), | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rather than those two links, consider linking to WG Device Management. Those existing links aren't useful to the public reading the article. |
||||||||||||||||||||||||||||||||||||||||||||||||||||
| which happen at US/EU and EU/APAC friendly time slots. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You might mean Americas rather than "US", and "EMEA" rather than EU. We like to look inclusive. |
||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Not all enhancement ideas are tracked as issues yet, so come talk to us if you want to help or have some ideas yourself! | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| We have work to do at all levels, from difficult core changes to usability enhancements in kubectl, which could be picked up by newcomers. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will be great to include some information on adoption and gaps still left comparing to Device Plugin. Maybe a couple of words on available DRA drivers. So end users may make sense of this blog post.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a little section about the availability of drivers. I'm a little worried that by mentioning some drivers here, we might be forgetting others that also should be included. But I can ask in the device management chat if someone knows about other drivers that should be included.
I need to think a bit more about the gaps vs Device Plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of anything that can be done by a Device Plugin that cannot also be done with a DRA driver. Resource Health Status may have been the last remaining gap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there auto-tainting of "broken" devices? Yes, it seems like the health tracking is the last thing