Conversation
|
/gcbrun |
| ctx := namespaces.WithNamespace(context.Background(), namespaces.Default) | ||
| if launchSpec.InstallGpuDriver { | ||
| if launchSpec.Experiments.EnableConfidentialGPUSupport { | ||
| installer := gpu.NewDriverInstaller(containerdClient, launchSpec, logger) | ||
| err = installer.InstallGPUDrivers(ctx) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to install gpu drivers: %v", err) | ||
| } | ||
| } else { | ||
| logger.Info("Confidential GPU support experiment flag is not enabled for this project. Ensure that it is enabled when tee-install-gpu-driver is set to true") | ||
| return fmt.Errorf("confidential gpu support experiment flag is not enabled") | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Any reason why moving this block around?
There was a problem hiding this comment.
For trusted space, there was a comment to move this block : #497 (comment). Followed the same here as it is at the beginning of startLauncher
| if deviceInfo != deviceinfo.NO_GPU { | ||
| logger.Error("GPU is attached, tee-install-gpu-driver is not set") | ||
| return fmt.Errorf("tee-install-gpu-driver is expected to set to true when GPU is attached") | ||
| } |
There was a problem hiding this comment.
Did you check with Rene on this requirement? This would be a breaking change to CS on H100 because a regular workload can't even get run within with a GPU device attached.
There was a problem hiding this comment.
Agreed. but without this check, if a GPU is present but the driver installation flag is missing, workloads relying on the GPU would likely fail. The nature of that failure would depend on how the workload manages its GPU dependency – it might fall back to CPU usage or fail or terminate entirely. This check ensures predictable behavior when a GPU is attached but installation flag is not set.
I assumed that attaching a GPU implies its intended use, thus requiring the driver installation flag to guarantee necessary drivers are present. I can confirm this with Rene.
There was a problem hiding this comment.
workloads relying on the GPU would likely fail
When would this happen? Why would we pass through the GPU device if they don't set the flag?
There was a problem hiding this comment.
When would this happen?
When GPU is attached to the VM but customer didn't set the tee-install-gpu-driver flag. In this case, workload is expecting an interaction with GPU device but because we didn't install drivers (as flag was not set), GPU device would not available to the workload container and workload execution might fail.
Why would we pass through the GPU device if they don't set the flag?
We do not make the GPU device available to the container if this flag is not set. This check was just added to fail deterministically (when GPU is attached but mistakenly or knowingly installation flag is not set).
There was a problem hiding this comment.
When GPU is attached to the VM but customer didn't set the tee-install-gpu-driver flag. In this case, workload is expecting an interaction with GPU device but because we didn't install drivers (as flag was not set), GPU device would not available to the workload container and workload execution might fail.
I understand, but we could print a warning log instead of stopping launch. Please double check with Rene.
There was a problem hiding this comment.
I checked this with Rene. He suggested to fail upfront instead of getting into a condition where workload might not work.
| if deviceInfo != deviceinfo.NO_GPU { | ||
| logger.Error("GPU is attached, tee-install-gpu-driver is not set") | ||
| return fmt.Errorf("tee-install-gpu-driver is expected to set to true when GPU is attached") | ||
| } |
There was a problem hiding this comment.
When GPU is attached to the VM but customer didn't set the tee-install-gpu-driver flag. In this case, workload is expecting an interaction with GPU device but because we didn't install drivers (as flag was not set), GPU device would not available to the workload container and workload execution might fail.
I understand, but we could print a warning log instead of stopping launch. Please double check with Rene.
This PR contains the following changes:
container_runner.startLauncherfunction.Testing: