Add MIG config support when MIG-backed vGPU type#93
Conversation
65cac5a to
6348f09
Compare
|
/cc @cdesiniotis |
9c10da3 to
18cd4c8
Compare
9198b86 to
0b7ae0f
Compare
cdesiniotis
left a comment
There was a problem hiding this comment.
Thanks @mresvanis for the work on this! This is a great start.
The main feedback I have concerns the logic which converts vGPU configuration to mig-parted configuration.
cmd/nvidia-k8s-vgpu-dm/main.go
Outdated
| } | ||
|
|
||
| func determineMIGConfig(selectedConfig string) (string, error) { | ||
| vgpuType, err := types.ParseVGPUType(selectedConfig) |
There was a problem hiding this comment.
This will only work if selectedConfig is a valid vGPU type string, e.g. A40-48Q. However, this is not guaranteed as the keys in the vGPU config file can take on arbitrary values. With that said, I understand why you made this assumption -- all of the configs (except the default config) in the default vGPU ConfigMap included in the gpu-operator are named as valid vGPU type strings: https://github.com/NVIDIA/gpu-operator/blob/6324d2aca562edf46d93cbf9d2a0837ab5c12e59/assets/state-vgpu-device-manager/0500_configmap.yaml
This will definitely return an error when the default configuration is applied. For your reference, the default configuration is applied when a node is not labeled with nvidia.com/vgpu.config:
vgpu-device-manager/cmd/nvidia-k8s-vgpu-dm/main.go
Lines 195 to 207 in 5f37569
A "proper" solution would likely entail
- Parse the vGPU config file
- Get the
selectedConfigentry - Convert the vGPU config into an equivalent mig-parted config
- Write the mig-parted config to a file
- Invoke
reconfigureMIGand set theMigPartedConfigFileFlagto the file created in Step 4.
For example, the below vGPU configuration
custom-config:
- devices: [0]
vgpu-devices:
"A100-4C": 10
- devices: [1]
vgpu-devices:
"A100-1-5C": 2
"A100-2-10C": 1
"A100-3-20C": 1
would get converted to the following mig-parted configuration:
custom-config:
- devices: [0]
mig-enabled: false
- devices: [1]
mig-enabled: true
mig-devices:
"1g.5gb": 2
"2g.10gb": 1
"3g.20gb": 1
My proposed solution has the following benefits when compared to the current implementation:
- We no longer depend on the mig-parted configuration file. This simplifies UX -- for example, if one applied the
custom-configfrom my above example, thecustom-configentry would only need to be present in the vGPU ConfigMap and not also in the mig-parted ConfigMap. Additionally, this simplifies the GPU Operator implemenation -- the GPU Operator no longer needs to deploy the mig-parted ConfigMap for this use case. - No assumptions are made concerning how configurations are named.
- We naturally can support custom / mixed configurations like my example above. The current implementation assumes a single configuration, where a single vGPU type is used across all GPUs on a node.
There was a problem hiding this comment.
Thank you for the clear and detailed explanation! That definitely makes sense and thank you for validating the proper solution logic you first proposed in the design document, but I totally skipped until now (I apologize).
I think I already made the respective adjustments and I'm now looking through additional edge cases. PTAL and let me know if you think that's what you described above (or at least in the same direction) 🙏
There was a problem hiding this comment.
@mresvanis thanks for writing this in Go!
Out of scope: In my opinion, the goal state should be to update the mig-parted repository so that https://github.com/NVIDIA/mig-parted/blob/c0ae956dd7d5414d8a450d478f58f1e17f83ab2e/deployments/container/reconfigure-mig.sh is rewritten in Go. That way, both mig-manager and vgpu-device-manager can leverage the same code. Since the vgpu-device-manager does not need all of the functionality from reconfigure-mig.sh (e.g. vgpu-device-manager does not need reconfigure-mig.sh to stop / restart GPU clients in the operator namespace), we may need to introduce some new options to opt-out of certain functionality.
For now, I am okay if we proceed with introducing reconfigure_mig.go in this PR. Once we rewrite the code in Go in the mig-parted repo, we can update vgpu-device-manager to reuse that code accordingly. I will take the action item to update mig-parted.
There was a problem hiding this comment.
@mresvanis thanks for writing this in Go!
No worries, I saw that you were using Go in this project and I assumed the intention is to move the mig-parted reconfigure-mig.sh script to Go as well eventually.
Out of scope: In my opinion, the goal state should be to update the mig-parted repository so that https://github.com/NVIDIA/mig-parted/blob/c0ae956dd7d5414d8a450d478f58f1e17f83ab2e/deployments/container/reconfigure-mig.sh is rewritten in Go. That way, both mig-manager and vgpu-device-manager can leverage the same code. Since the vgpu-device-manager does not need all of the functionality from
reconfigure-mig.sh(e.g. vgpu-device-manager does not needreconfigure-mig.shto stop / restart GPU clients in the operator namespace), we may need to introduce some new options to opt-out of certain functionality.
I agree 100%, I also thought I should keep the scope of this change as tight as possible.
For now, I am okay if we proceed with introducing
reconfigure_mig.goin this PR. Once we rewrite the code in Go in the mig-parted repo, we can update vgpu-device-manager to reuse that code accordingly. I will take the action item to update mig-parted.
Awesome, please feel free to let me know if I can help with this refactoring. Happy to take this up (since I'm also the one introducing the tech debt here :) )
There was a problem hiding this comment.
Driven by the gosec linter errors G204: Audit use of command execution in reconfigure_mig.go I added validation for the options (at least for the ones that I think it makes sense).
Please let me know if adding the https://github.com/go-playground/validator package just for this purpose makes sense, otherwise I can roll back these changes and I think we can safely ignore the respective gosec lint errors.
There was a problem hiding this comment.
@mresvanis thanks for the work on this. I used what you have as a starting poing and pulled it into the mig-parted project in NVIDIA/mig-parted#216 where I'm slowly adding the missing functionality there.
My thinking was that we would create a package: github.com/NVIDIA/mig-parted/pkg/mig/reconfigure where we can expose this API and import it into the vgpu-device-manager. Did you want to "migrate" this PR there, or are you ok with me taking over?
There was a problem hiding this comment.
@elezar that's great, I think this makes total sense and it will reduce the debt we introduced here. I'm perfectly fine with you taking this over and please feel free to let me know if I can help in any way.
There was a problem hiding this comment.
See NVIDIA/mig-parted#218 and #101 for early drafts. I think getting these (or at least 218) in before NVIDIA/mig-parted#216 probably makes sense.
| imagePullPolicy: IfNotPresent | ||
| env: | ||
| - name: NAMESPACE | ||
| value: "gpu-operator" |
There was a problem hiding this comment.
Question: Since this DaemonSet is configured to run in the default namespace, should this also be default?
There was a problem hiding this comment.
There was a problem hiding this comment.
I am fine with a separate PR to update the samples.
82479b8 to
84fc0ef
Compare
2b96d3e to
05a6ccd
Compare
e7f4439 to
aeb4804
Compare
This change adds MIG configuration support via nvidia-mig-parted when MIG-backed vGPU types are included in the selected vGPU config, before the vGPU configuration takes place. Specifically: - Add CLI flags for MIG configuration options. - Include the nvidia-mig-parted binary in the container image. - Parse vGPU config to detect MIG requirements, convert the vGPU config to the respective MIG config and configure MIG before vGPUs via nvidia-mig-parted. - nvidia-mig-parted requires NVML. The NVIDIA driver library path is searched and used with the `LD_PRELOAD` env var when running nvidia-mig-parted commands. When NVML is not available, skip MIG configuration and proceed to vGPU configuration. This ensures that the vGPU Device Manager is backwards compatible with components that do not make NVML available to it. Signed-off-by: Michail Resvanis <mresvani@redhat.com>
Signed-off-by: Michail Resvanis <mresvani@redhat.com>
aeb4804 to
b1ea9dd
Compare
cdesiniotis
left a comment
There was a problem hiding this comment.
@mresvanis LGTM. Thanks for the hard work and patience on this!
| } | ||
| migModeChangeRequired = true | ||
| } | ||
|
|
There was a problem hiding this comment.
Should we not ste the state label to pending at this point?
There was a problem hiding this comment.
Thanks for catching that! I think the issue with handling the reboot during MIG configuration (when that's enabled) is that we always set the state label to pending before handling MIG configuration. The latter makes this check invalid, as the state label will never equal rebooting.
We could fix this by:
- checking if the state label is set to
rebootingbefore setting it topendinghere - setting the state label to
pendingafterassertMIGModeOnlyif it's set torebooting
I can open a PR with this change if you also think that this makes sense. WDYT?
This PR adds MIG configuration support via
nvidia-mig-partedwhen MIG-backed vGPU types are included in the selected vGPU configuration before the latter takes place.The changes include:
nvidia-mig-partedbinary andbusyboxin the container image.nvidia-mig-parted.nvidia-mig-partedrequires NVML. The NVIDIA driver library path is searched and used with theLD_PRELOADenv var when runningnvidia-mig-partedcommands. When NVML is not available, skip MIG configuration and proceed to vGPU configuration. This keeps the vGPU Device Manager backwards compatible with components that do not make NVML available to it.Design document