Adding support for mixed workloads (container + passthrough) on the same node

Hi,

As of right now the GPU operator only supports only one mode of sandbox per node set through the `.spec.sandboxWorkloads.defaultWorkload` field, or an annotation on the node.

This makes perfect sense to segregate vgpu drivers nodes from the "standard" drivers nodes as they cannot coexist on the same kernel. 
But it would make sense to be able to mix containers and PCIe passthrough on the same node as you can have N cards using the `vfio-pci` (and exposed  to virtualization engines such as KVM), and the other cards just used by the `nvidia` driver to then be "forwarded" to containers using the nvidia-ctk. 

I've validated (by modifying the operator resources) that this is possible, a dirty PoC works by doing these steps in-order:
- Setup the nvidia driver on the node 
- Kill the `nvidia-persistenced` daemon (if present) temporarily to free the handles on the card, in order to unbind the selected cards from the nvidia driver
- Run the `/bin/vfio-manage.sh` script in the vfio-manager with `-d` (instead of `-all` currently) to bind the selected cards to the vfio driver
- Run the device plugin pod as usual to discover the VFIO bound devices and expose them to Kubevirt
- Proceed as usual with the container toolkit pod setup

This present a net gain for added flexibility when using multiple GPUs per node. 

Concerning the implementation we could imagine having either the selection of what each device is used for through a field in the ClusterPolicy object (map per node), or a special annotation (on each node).

If this sounds like something that could benefit users of the GPU operator I'd be willing to open a draft PR with a usable solution, let me know what you think. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding support for mixed workloads (container + passthrough) on the same node #1995

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adding support for mixed workloads (container + passthrough) on the same node #1995

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions