Skip to content

Conversation

Alex-Welsh
Copy link
Member

@Alex-Welsh Alex-Welsh commented Mar 28, 2025

This change is designed to drastically simplify pci passthrough GPU configuration.

The idea is that we have a data dictionary containing common GPU types, and templating to write that data to nova config files based on a simple group-to-gpu map.

Configuring passthrough is as simple as creating a dictionary like this:

gpu_group_map:
  compute_a100:
    - a100_80
  compute_v100:
    - v100_32
  compute_multi_gpu:
    - a100_80
    - v100_32

(and passing through the group names to Kolla-Ansible)

The pci-passthrough.yml playbook manages host configuration (and is pre-hooked to overcloud host configure)

Templates for nova-compute.conf, nova-api.conf, and nova-scheduler.conf have been added.

Changes have now been tested in a prod environment

@Alex-Welsh Alex-Welsh force-pushed the pci-passthrough-defaults branch from 115d514 to 8a06ffb Compare March 28, 2025 15:53
@Alex-Welsh Alex-Welsh requested a review from MoteHue March 28, 2025 16:02
@Alex-Welsh Alex-Welsh force-pushed the pci-passthrough-defaults branch from 0856eb0 to 9cd7720 Compare March 28, 2025 17:06
@priteau
Copy link
Member

priteau commented Apr 3, 2025

This should be using the stackhpc.linux collection if possible: stackhpc/ansible-collection-linux#28

@Alex-Welsh Alex-Welsh force-pushed the pci-passthrough-defaults branch from 2c3562c to bbd2eaa Compare April 8, 2025 10:26
@Alex-Welsh
Copy link
Member Author

This should be using the stackhpc.linux collection if possible: stackhpc/ansible-collection-linux#28

Agreed, though I'd rather get this merged so we can start using it, then update once the collection supports it

@Alex-Welsh Alex-Welsh marked this pull request as ready for review April 8, 2025 10:28
@Alex-Welsh Alex-Welsh requested a review from a team as a code owner April 8, 2025 10:28
@Alex-Welsh Alex-Welsh changed the title WIP: Add defaults for GPU PCI passthrough configuration Add defaults for GPU PCI passthrough Apr 8, 2025
@Alex-Welsh
Copy link
Member Author

I tried out the changes on a client deployment with three GPU types, worked very well

MoteHue
MoteHue previously requested changes Apr 9, 2025
Copy link
Contributor

@MoteHue MoteHue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some docs changes now.
I've used this at a different customer site, worked a treat

@Alex-Welsh
Copy link
Member Author

Since we're not far from the 2025.1 release, should we target that instead?

@MoteHue
Copy link
Contributor

MoteHue commented Apr 15, 2025

Since we're not far from the 2025.1 release, should we target that instead?

As long as we backport, because we'll still need this for deployments before the next upgrade season

@Alex-Welsh
Copy link
Member Author

Since we're not far from the 2025.1 release, should we target that instead?

As long as we backport, because we'll still need this for deployments before the next upgrade season

If we want it in 2024.1 anyway, we should just include it here and merge it up like normal

Copy link
Contributor

@MoteHue MoteHue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks @Alex-Welsh!

@Alex-Welsh Alex-Welsh merged commit 68fd836 into stackhpc/2024.1 Apr 17, 2025
14 checks passed
@Alex-Welsh Alex-Welsh deleted the pci-passthrough-defaults branch April 17, 2025 13:30
@bbezak
Copy link
Member

bbezak commented Aug 18, 2025

bringing defaults nova-scheduling filters for everybody - not sure if that is best way. Plus empty pci stanzas in default nova configs. Can this be refactored as a opt-in feature ? - potentially as a gpu-passthrough mix-in environment, or at least opt-in variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants