Skip to content

GPU product node label supports only one product type #1659

@brgavino

Description

@brgavino

Describe the bug
gpu.intel.com/product only supports one type, such as 'Flex_140' or 'Flex_170'; in the case when both types of cards are installed only 'Flex_140' will be written as the label value (due to order in https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/deployments/nfd/overlays/node-feature-rules/platform-labeling-rules.yaml). This is important when selecting for differing pod preferences, ie a pod may have better performance on Flex 170 than Flex 140, so an affinity for scheduling for nodes with Flex 170 is preferred. Other logic may handle the actual consumption of that resource.

To Reproduce
Steps to reproduce the behavior:

  1. Node has Flex 140 card installed and Flex 170 card installed
  2. Device plugins is installed per instructions
  3. apply platform-labeling-rules.yaml
  4. Only gpu.intel.com/product=Flex_140 is in labels

Expected behavior
The values for gpu.intel.com/product should follow the other labelled values, such as gpu.intel.com/device-id.0380-56c0.present=true or
gpu.intel.com/device-id.0380-56c1.present=true. At present, pods may select for these labels, which are not transparent to Intel product naming. Suggest something like gpu.intel.com/product.Flex_140.present=true etc. which could indicate any presence of commonly understood Intel GPU product names.

System (please complete the following information):

  • OpenShift 4.13.11
  • Device plugins version: v0.28.0
  • Hardware info: SPR with Flex 140 + Flex 170 in same system

Additional context
n/a

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions