Investigate GPU allocation performance: device plugin vs. DRA

Time it takes to complete the test suite, roughly:
- 7 minutes with `nvidia.com/gpu` allocation done by the device plugin
- 12 minutes with `nvidia.com/gpu` allocation done via `DRAExtendedResource`

The difference is significant, and generally reproducible.

Example:
<img width="435" height="85" alt="Image" src="https://github.com/user-attachments/assets/293bd5e3-f56e-45a9-b81a-48fa3a7468ce" />

To a certain extent, _a_ difference is expected: the DRA flow involves more work, and more components to coordinate.

And maybe everything already flows as fast as it _can_, with DRA. However, I think it will be very interesting to at some point make a proper distributed profiling exercise, to see precisely where we spend how much time in the information flow. There's probably a way to make some hops more snappy, with better event propagation or some tweaks here and there.

Performance optimization is clearly less important than achieving correctness and doing simplification work. However, to support wide adoption of GPU allocation via DRA we have to understand performance implications when users migrate from the device plugin world to DRA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate GPU allocation performance: device plugin vs. DRA #928

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate GPU allocation performance: device plugin vs. DRA #928

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions