Skip to content

Investigate GPU allocation performance: device plugin vs. DRA #928

@jgehrcke

Description

@jgehrcke

Time it takes to complete the test suite, roughly:

  • 7 minutes with nvidia.com/gpu allocation done by the device plugin
  • 12 minutes with nvidia.com/gpu allocation done via DRAExtendedResource

The difference is significant, and generally reproducible.

Example:
Image

To a certain extent, a difference is expected: the DRA flow involves more work, and more components to coordinate.

And maybe everything already flows as fast as it can, with DRA. However, I think it will be very interesting to at some point make a proper distributed profiling exercise, to see precisely where we spend how much time in the information flow. There's probably a way to make some hops more snappy, with better event propagation or some tweaks here and there.

Performance optimization is clearly less important than achieving correctness and doing simplification work. However, to support wide adoption of GPU allocation via DRA we have to understand performance implications when users migrate from the device plugin world to DRA.

Metadata

Metadata

Assignees

No one assigned

    Labels

    perfissue/pr related to performance

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions