Skip to content

Commit e3a6289

Browse files
committed
Blog post for DRA updates in 1.36
1 parent ee9f321 commit e3a6289

File tree

1 file changed

+134
-0
lines changed

1 file changed

+134
-0
lines changed
Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
---
2+
layout: blog
3+
title: "Kubernetes v1.36: DRA has graduated to GA"
4+
slug: dra-136-updates
5+
draft: true
6+
date: XXXX-XX-XX
7+
author: >
8+
The DRA team
9+
---
10+
11+
Dynamic Resource Allocation (DRA) has fundamentally changed how we handle hardware
12+
accelerators and specialized resources in Kubernetes. In the v1.36 release, DRA
13+
continues to mature, bringing a wave of feature graduations, critical usability
14+
improvements, and new capabilities that extends the flexibility of DRA to native
15+
resources like memory and CPU, and support for ResourceClaims in PodGroups.
16+
17+
Whether you are managing massive fleets of GPUs, need better handling of failures,
18+
or simply looking for better ways to define resource fallback options, the upgrades
19+
to DRA in 1.36 have something for you. Let's dive into the new features and graduations!
20+
21+
## Feature graduations
22+
23+
The community has been hard at work stabilizing core DRA concepts. In Kubernetes 1.36,
24+
several highly anticipated features have graduated to Beta and Stable.
25+
26+
**Prioritized List (Stable)**
27+
28+
Hardware heterogeneity is a reality in most clusters. With the
29+
[Prioritized List](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#prioritized-list)
30+
feature, you can confidently define fallback preferences when requesting
31+
devices. Instead of hardcoding a request for a specific device model, you can specify an
32+
ordered list of preferences (e.g., "Give me an H100, but if none are available, fall back
33+
to an A100"). The scheduler will evaluate these requests in order, drastically improving
34+
scheduling flexibility and cluster utilization.
35+
36+
**Extended Resource Support (Beta)**
37+
38+
As DRA becomes the standard for resource allocation, bridging the gap with legacy systems
39+
is crucial. The DRA
40+
[Extended Resource](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-taints-and-tolerations)
41+
feature allows users to request resources via traditional extended resources on a Pod.
42+
This allows for a gradual transition to DRA, meaning application developers and
43+
operators are not forced to immediately migrate their workloads to the ResourceClaim
44+
API.
45+
46+
**Partitionable Devices (Beta)**
47+
48+
Hardware accelerators are powerful, and sometimes a single workload doesn't need an
49+
entire device. The
50+
[Partitionable Devices](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#partitionable-devices)
51+
feature, provides native DRA support for carving physical hardware into smaller,
52+
logical instances (such as Multi-Instance GPUs). This allows administrators to
53+
safely and efficiently share expensive accelerators across multiple Pods.
54+
55+
**Device Taints (Beta)**
56+
57+
Just as you can taint a Kubernetes Node, you can now apply taints directly to specific DRA
58+
devices.
59+
[Device Taints and Tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-taints-and-tolerations)
60+
empower cluster administrators to manage hardware more effectively. You can taint faulty
61+
devices to prevent them from being allocated to standard claims, or reserve specific hardware
62+
for dedicated teams, specialized workloads, and experiments. Ultimately, only Pods with
63+
matching tolerations are permitted to claim these tainted devices.
64+
65+
**Device Binding Conditions (Beta)**
66+
67+
To improve scheduling reliability, the Kubernetes scheduler can now use the
68+
[Binding Conditions](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-taints-and-tolerations)
69+
feature to delay committing a Pod to a Node until its required external resources—such as attachable
70+
devices or FPGAs—are fully prepared. By explicitly modeling resource readiness, this
71+
prevents premature assignments that can lead to Pod failures, ensuring a much more robust
72+
and predictable deployment process.
73+
74+
## New Features
75+
76+
Beyond stabilizing existing capabilities, v1.36 introduces foundational new features
77+
that expand what DRA can do.
78+
79+
**ResourceClaim Support for Workloads**
80+
81+
To optimize large-scale AI/ML workloads that rely on strict topological scheduling, the
82+
[ResourceClaim Support for Workloads](add_link_here)
83+
feature enables Kubernetes to seamlessly manage shared resources across massive sets
84+
of Pods. By associating ResourceClaims or ResourceClaimTemplates with PodGroups,
85+
this feature eliminates previous scaling bottlenecks, such as the limit on the
86+
number of pods that can share a claim, and removes the burden of manual claim
87+
management from specialized orchestrators.
88+
89+
**DRA for Native Resources**
90+
91+
Why should DRA only be for external accelerators? In v1.36, we are introducing the first
92+
iterations of using the DRA API to manage Kubernetes native resources (like CPU and
93+
Memory). By bringing CPU and memory allocation under the DRA umbrella with the DRA
94+
[Native Resources](add_link_here)
95+
feature, users can leverage DRA's advanced placement, NUMA-awareness, and prioritization
96+
semantics for standard compute resources, paving the way for incredibly fine-grained
97+
performance tuning.
98+
99+
**DRA Resource Availability Visibility**
100+
101+
One of the most requested features from cluster administrators has been better visibility
102+
into hardware capacity. The new
103+
[Resource Availability Visibility](add_link_here)
104+
feature introduces robust mechanisms to query and expose the total capacity, allocated
105+
usage, and available pool of DRA resources across the cluster. This unlocks better
106+
integration with dashboards and capacity planning tools.
107+
108+
**Device Allocation Ordering through Lexicographical Ordering**
109+
110+
The Kubernetes scheduler has been updated to evaluate devices using lexicographical
111+
ordering based on resource pool and ResourceSlice names. This change empowers drivers
112+
to proactively influence the scheduling process, leading to improved throughput and
113+
more optimal scheduling decisions. To support this capability, the ResourceSlice
114+
controller toolkit now automatically generates names that reflect the exact device
115+
ordering specified by the driver author.
116+
117+
## What’s next?
118+
119+
This cycle introduced a wealth of new DRA features, and the momentum continues.
120+
Our focus remains on progressing existing features toward beta and stable releases
121+
while enhancing DRA's performance, scalability, and reliability. Additionally,
122+
integrating DRA with Workload-Aware and Topology-Aware Scheduling will be a key
123+
priority over the coming releases.
124+
125+
126+
## Getting involved
127+
128+
A good starting point is joining the WG Device Management
129+
[Slack channel](https://kubernetes.slack.com/archives/C0409NGC1TK) and
130+
[meetings](https://docs.google.com/document/d/1qxI87VqGtgN7EAJlqVfxx86HGKEAc2A3SKru8nJHNkQ/edit?tab=t.0#heading=h.tgg8gganowxq),
131+
which happen at US/EU and EU/APAC friendly time slots.
132+
133+
Not all enhancement ideas are tracked as issues yet, so come talk to us if you wantto help or have some ideas yourself!
134+
We have work to do at all levels, from difficult core changes to usability enhancements in kubectl, which could be picked up by newcomers.

0 commit comments

Comments
 (0)