11---
22layout : blog
3- title : " Kubernetes v1.36: DRA has graduated to GA "
3+ title : " Kubernetes v1.36: More Drivers, New Features, and the Next Era of DRA "
44slug : dra-136-updates
55draft : true
66date : XXXX-XX-XX
@@ -14,6 +14,12 @@ continues to mature, bringing a wave of feature graduations, critical usability
1414improvements, and new capabilities that extends the flexibility of DRA to native
1515resources like memory and CPU, and support for ResourceClaims in PodGroups.
1616
17+ We have also seen significant momentum in driver availability. Both the
18+ [ NVIDIA GPU] ( https://github.com/NVIDIA/k8s-dra-driver-gpu )
19+ and Google TPU DRA drivers are being transferred to the Kubernetes project, joining the
20+ [ DRANET] ( https://github.com/kubernetes-sigs/dranet )
21+ driver that was added last year.
22+
1723Whether you are managing massive fleets of GPUs, need better handling of failures,
1824or simply looking for better ways to define resource fallback options, the upgrades
1925to DRA in 1.36 have something for you. Let's dive into the new features and graduations!
@@ -37,7 +43,7 @@ scheduling flexibility and cluster utilization.
3743
3844As DRA becomes the standard for resource allocation, bridging the gap with legacy systems
3945is crucial. The DRA
40- [ Extended Resource] ( https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-taints-and-tolerations )
46+ [ Extended Resource] ( https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#extended-resource )
4147feature allows users to request resources via traditional extended resources on a Pod.
4248This allows for a gradual transition to DRA, meaning application developers and
4349operators are not forced to immediately migrate their workloads to the ResourceClaim
@@ -71,6 +77,23 @@ devices or FPGAs—are fully prepared. By explicitly modeling resource readiness
7177prevents premature assignments that can lead to Pod failures, ensuring a much more robust
7278and predictable deployment process.
7379
80+ ** Resource Health Status (Beta)**
81+
82+ Knowing when a device has failed or become unhealthy is critical for workloads running on
83+ specialized hardware. With
84+ [ Resource Health Status] ( /docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring ) ,
85+ Kubernetes now exposes device health information directly in the Pod Status through the
86+ ` allocatedResourcesStatus ` field. When a DRA driver detects that an allocated device
87+ has become unhealthy, it reports this back to the kubelet, which surfaces it in each
88+ container's status.
89+
90+ In 1.36, the feature graduates to beta (enabled by default) and adds an optional ` message `
91+ field providing human readable context about the health status, such as error details or
92+ failure reasons. DRA drivers can also configure per device health check timeouts,
93+ allowing different hardware types to use appropriate timeout values based on their
94+ health reporting characteristics. This gives users and controllers crucial visibility
95+ to quickly identify and react to hardware failures.
96+
7497## New Features
7598
7699Beyond stabilizing existing capabilities, v1.36 introduces foundational new features
@@ -100,10 +123,25 @@ performance tuning.
100123
101124One of the most requested features from cluster administrators has been better visibility
102125into hardware capacity. The new
103- [ Resource Availability Visibility] ( add_link_here )
104- feature introduces robust mechanisms to query and expose the total capacity, allocated
105- usage, and available pool of DRA resources across the cluster. This unlocks better
106- integration with dashboards and capacity planning tools.
126+ [ DRAResourcePoolStatus] ( /docs/concepts/scheduling-eviction/dynamic-resource-allocation/#resource-pool-status )
127+ feature allows you to query the availability of devices in DRA resource pools. By creating a
128+ ` ResourcePoolStatusRequest ` object, you get a point-in-time snapshot of device counts
129+ — total, allocated, available, and unavailable — for each pool managed by a given
130+ driver. This enables better integration with dashboards and capacity planning tools.
131+
132+ ** List Types for Attributes**
133+
134+ With
135+ [ List Types for Attributes] ( /docs/concepts/scheduling-eviction/dynamic-resource-allocation/#list-type-attributes ) ,
136+ DRA can represent device attributes as typed lists (int, bool, string, and version), not
137+ just scalar values. This helps model real hardware topology, such as devices that belong
138+ to multiple PCIe roots or NUMA domains.
139+
140+ This feature also extends ` ResourceClaim ` constraint behavior to work naturally
141+ with both scalar and list values: ` matchAttribute ` now checks for a non-empty
142+ intersection, and ` distinctAttribute ` checks for pairwise disjoint values.
143+ It also introduces ` includes() ` function in DRA CEL, which lets device selectors keep working
144+ more easily when an attribute changes between scalar and list representations.
107145
108146** Device Allocation Ordering through Lexicographical Ordering**
109147
@@ -130,5 +168,5 @@ A good starting point is joining the WG Device Management
130168[ meetings] ( https://docs.google.com/document/d/1qxI87VqGtgN7EAJlqVfxx86HGKEAc2A3SKru8nJHNkQ/edit?tab=t.0#heading=h.tgg8gganowxq ) ,
131169which happen at US/EU and EU/APAC friendly time slots.
132170
133- Not all enhancement ideas are tracked as issues yet, so come talk to us if you wantto help or have some ideas yourself!
171+ Not all enhancement ideas are tracked as issues yet, so come talk to us if you want to help or have some ideas yourself!
134172We have work to do at all levels, from difficult core changes to usability enhancements in kubectl, which could be picked up by newcomers.
0 commit comments