Skip to content

Conversation

@poussa
Copy link
Contributor

@poussa poussa commented Oct 21, 2025

Implements: #132 (Option 2)

DRA is similar to k8s extended resources (device plugins) but has more capabilities. In this first DRA PR the following capabilities are introduced:

  • new dra block in values.yaml. If dra.enabled=true this block is used instead of the acceleratorblock.
  • Generation of decode pod resources.claims and resourcesClaims blocks
  • Generation of kind: ResourceClaimTemplate with deviceClassName, count, and selector fields.

More DRA capabilities will be added once the direction is set (i.e., is this the right way to enable DRA?).

Thoughts?

@poussa
Copy link
Contributor Author

poussa commented Oct 21, 2025

/cc @kalantar @jgchn @yankay

@github-actions github-actions bot requested review from jgchn, kalantar and yankay October 21, 2025 12:36
@yankay
Copy link
Collaborator

yankay commented Oct 22, 2025

HI @poussa

Great PR 🎉🎉🎉

Would it be better to include examples/values-dra.yaml in the https://github.com/llm-d-incubation/llm-d-modelservice/blob/main/hack/generate-example-output.sh to generate-example-output.sh?
And add some docs in the https://github.com/llm-d-incubation/llm-d-modelservice/blob/main/examples/README.md

class: gpu.nvidia.com
match: "exactly"
count: 1
selectors: {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to enumerate all types? Could we just write them in the comments? :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bit tricky. The default values needs to be defined somewhere. I was hoping the json schema will be the place -- but it is not. The schema default values are not propagated to helm templates. So the default values (e.g. match: "excatly") needs to be defined somewhere. Either in values.yamlor in template code _dra.tpl

@poussa
Copy link
Contributor Author

poussa commented Oct 22, 2025

HI @poussa

Great PR 🎉🎉🎉

Would it be better to include examples/values-dra.yaml in the https://github.com/llm-d-incubation/llm-d-modelservice/blob/main/hack/generate-example-output.sh to generate-example-output.sh? And add some docs in the https://github.com/llm-d-incubation/llm-d-modelservice/blob/main/examples/README.md

Good idea, will do.

DRA is similar to k8s extended resources (device plugins) but has more capabilities.

Signed-off-by: Sakari Poussa <[email protected]>
Copy link
Collaborator

@jgchn jgchn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies in my lack of knowledge in this field. The specs look right to me. However, there is an .extraObjects at the top level for creating custom resources.

Comment on lines 12 to 22
dra:
enabled: true
type: "intel-gaudi3-x2"
claimTemplates:
- name: intel-gaudi3-x2
class: gaudi.intel.com
match: "exactly"
count: 2
selectors:
- cel:
expression: device.attributes["gaudi.intel.com"].model == 'Gaudi3'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies in advance for lack of knowledge in DRA. The specs look right to me. However, there is an .extraObjects at the top level for creating custom resources. It looks like there isn't that much abstraction that's going behind-the-scenes other than copying over the DRA definition in values.yaml. Could you clarify your use case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering what is easier to maintain here, examples with DRA that use .extraObject or adding a .dra field. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer dra object since we have accelerator object for extended resources. If we use dra object we can have the json schema validation which is not possible with extraObjects. The schema validation becomes important once we add more DRA features since the dra object may become quite complex.

Copy link
Collaborator

@jgchn jgchn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


modelArtifacts:
name: meta-llama/Llama-3.3-70B-Instruc
uri: "pvc+hf://model-pvc/meta-llama/Llama-3.3-70B-Instruc"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uri: "pvc+hf://model-pvc/meta-llama/Llama-3.3-70B-Instruc"
uri: "pvc+hf://model-pvc/meta-llama/Llama-3.3-70B-Instruct"

@poussa
Copy link
Contributor Author

poussa commented Oct 28, 2025

@jgchn thanks for the approve but do not merge yet. I am testing and fixing the PR still.

@jgchn
Copy link
Collaborator

jgchn commented Oct 29, 2025

@poussa looks like CI is failing. I think the following should fix the lint and pre-commit:

make bump-chart-version-minor 
make generate 

and looks like there is a broken link somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants