Enable Dynamic Resource Allocation (DRA) #144

poussa · 2025-10-21T12:33:29Z

Implements: #132 (Option 2)

DRA is similar to k8s extended resources (device plugins) but has more capabilities. In this first DRA PR the following capabilities are introduced:

new dra block in values.yaml. If dra.enabled=true this block is used instead of the acceleratorblock.
Generation of decode pod resources.claims and resourcesClaims blocks
Generation of kind: ResourceClaimTemplate with deviceClassName, count, and selector fields.

More DRA capabilities will be added once the direction is set (i.e., is this the right way to enable DRA?).

Thoughts?

poussa · 2025-10-21T12:36:17Z

/cc @kalantar @jgchn @yankay

yankay · 2025-10-22T03:24:21Z

HI @poussa

Great PR 🎉🎉🎉

Would it be better to include examples/values-dra.yaml in the https://github.com/llm-d-incubation/llm-d-modelservice/blob/main/hack/generate-example-output.sh to generate-example-output.sh?
And add some docs in the https://github.com/llm-d-incubation/llm-d-modelservice/blob/main/examples/README.md

charts/llm-d-modelservice/templates/_dra.tpl

yankay · 2025-10-22T03:41:04Z

charts/llm-d-modelservice/values.yaml

+    class: gpu.nvidia.com
+    match: "exactly"
+    count: 1
+    selectors: {}


Is it necessary to enumerate all types? Could we just write them in the comments? :-)

This is bit tricky. The default values needs to be defined somewhere. I was hoping the json schema will be the place -- but it is not. The schema default values are not propagated to helm templates. So the default values (e.g. match: "excatly") needs to be defined somewhere. Either in values.yamlor in template code _dra.tpl

poussa · 2025-10-22T06:44:59Z

HI @poussa

Great PR 🎉🎉🎉

Would it be better to include examples/values-dra.yaml in the https://github.com/llm-d-incubation/llm-d-modelservice/blob/main/hack/generate-example-output.sh to generate-example-output.sh? And add some docs in the https://github.com/llm-d-incubation/llm-d-modelservice/blob/main/examples/README.md

Good idea, will do.

DRA is similar to k8s extended resources (device plugins) but has more capabilities. Signed-off-by: Sakari Poussa <[email protected]>

jgchn

My apologies in my lack of knowledge in this field. The specs look right to me. However, there is an .extraObjects at the top level for creating custom resources.

jgchn · 2025-10-23T01:51:06Z

examples/values-dra.yaml

+dra:
+  enabled: true
+  type: "intel-gaudi3-x2"
+  claimTemplates:
+  - name: intel-gaudi3-x2
+    class: gaudi.intel.com
+    match: "exactly"
+    count: 2
+    selectors:
+    - cel:
+        expression: device.attributes["gaudi.intel.com"].model == 'Gaudi3'


Apologies in advance for lack of knowledge in DRA. The specs look right to me. However, there is an .extraObjects at the top level for creating custom resources. It looks like there isn't that much abstraction that's going behind-the-scenes other than copying over the DRA definition in values.yaml. Could you clarify your use case?

I'm just wondering what is easier to maintain here, examples with DRA that use .extraObject or adding a .dra field. WDYT?

I would prefer dra object since we have accelerator object for extended resources. If we use dra object we can have the json schema validation which is not possible with extraObjects. The schema validation becomes important once we add more DRA features since the dra object may become quite complex.

jgchn

LGTM

jgchn · 2025-10-28T14:01:33Z

examples/values-dra.yaml

+
+modelArtifacts:
+  name: meta-llama/Llama-3.3-70B-Instruc
+  uri: "pvc+hf://model-pvc/meta-llama/Llama-3.3-70B-Instruc"


Suggested change

uri: "pvc+hf://model-pvc/meta-llama/Llama-3.3-70B-Instruc"

uri: "pvc+hf://model-pvc/meta-llama/Llama-3.3-70B-Instruct"

poussa · 2025-10-28T14:40:33Z

@jgchn thanks for the approve but do not merge yet. I am testing and fixing the PR still.

Signed-off-by: Sakari Poussa <[email protected]>

jgchn · 2025-10-29T13:49:05Z

@poussa looks like CI is failing. I think the following should fix the lint and pre-commit:

make bump-chart-version-minor 
make generate

and looks like there is a broken link somewhere.

github-actions bot requested review from jgchn, kalantar and yankay October 21, 2025 12:36

yankay reviewed Oct 22, 2025

View reviewed changes

Enable Dynamic Resource Allocation (DRA)

65314a5

DRA is similar to k8s extended resources (device plugins) but has more capabilities. Signed-off-by: Sakari Poussa <[email protected]>

poussa force-pushed the dra branch from ef9c0d3 to 65314a5 Compare October 22, 2025 07:35

jsonschema selector type object -> array

96b1c71

jgchn reviewed Oct 23, 2025

View reviewed changes

jgchn approved these changes Oct 28, 2025

View reviewed changes

poussa added 7 commits October 28, 2025 16:56

fix: claim name

a9d04f4

fix: json schema simplification

732b7f3

fix: clam selectort type to array

5803bc2

fix: add dra to example ouotput

3f481df

Signed-off-by: Sakari Poussa <[email protected]>

fix: simplify dra example

f55bb01

Signed-off-by: Sakari Poussa <[email protected]>

fix: add dra docs & exmaples

c4527cd

Signed-off-by: Sakari Poussa <[email protected]>

fix: dra example w/ gaudi

3e40514

Signed-off-by: Sakari Poussa <[email protected]>

	uri: "pvc+hf://model-pvc/meta-llama/Llama-3.3-70B-Instruc"
	uri: "pvc+hf://model-pvc/meta-llama/Llama-3.3-70B-Instruct"

Uh oh!

Enable Dynamic Resource Allocation (DRA) #144

Are you sure you want to change the base?

Enable Dynamic Resource Allocation (DRA) #144

Uh oh!

Conversation

poussa commented Oct 21, 2025

Uh oh!

poussa commented Oct 21, 2025

Uh oh!

yankay commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

yankay Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

poussa Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

poussa commented Oct 22, 2025

Uh oh!

jgchn left a comment

Choose a reason for hiding this comment

Uh oh!

jgchn Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

jgchn Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

poussa Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

jgchn left a comment

Choose a reason for hiding this comment

Uh oh!

jgchn Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

poussa commented Oct 28, 2025

Uh oh!

jgchn commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yankay commented Oct 22, 2025 •

edited

Loading