Skip to content

Measure GPU CC mode #569

Merged
meetrajvala merged 4 commits intocs_cgpu_h100from
mhvcgpu3
May 13, 2025
Merged

Measure GPU CC mode #569
meetrajvala merged 4 commits intocs_cgpu_h100from
mhvcgpu3

Conversation

@meetrajvala
Copy link
Contributor

@meetrajvala meetrajvala commented May 2, 2025

This PR contains the following changes:

  • Adds the function for measuring the GPU CC mode and uses it in container_runner.
  • Moves the GPU driver installation steps later in the startLauncher function.
  • Added check for scenario when GPU is attached but required metadata flag is not passed for driver installation.
  • Adds new image test for the scenario where cGPU is attached but driver installation related metadata flag is not passed in the VM creation command.

Testing:

  • Image integration tests for confidential GPU ran successfully.
  • Unit tests for newly added function

@meetrajvala
Copy link
Contributor Author

/gcbrun

Comment on lines -192 to -205
ctx := namespaces.WithNamespace(context.Background(), namespaces.Default)
if launchSpec.InstallGpuDriver {
if launchSpec.Experiments.EnableConfidentialGPUSupport {
installer := gpu.NewDriverInstaller(containerdClient, launchSpec, logger)
err = installer.InstallGPUDrivers(ctx)
if err != nil {
return fmt.Errorf("failed to install gpu drivers: %v", err)
}
} else {
logger.Info("Confidential GPU support experiment flag is not enabled for this project. Ensure that it is enabled when tee-install-gpu-driver is set to true")
return fmt.Errorf("confidential gpu support experiment flag is not enabled")
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why moving this block around?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For trusted space, there was a comment to move this block : #497 (comment). Followed the same here as it is at the beginning of startLauncher

Comment on lines +249 to +252
if deviceInfo != deviceinfo.NO_GPU {
logger.Error("GPU is attached, tee-install-gpu-driver is not set")
return fmt.Errorf("tee-install-gpu-driver is expected to set to true when GPU is attached")
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check with Rene on this requirement? This would be a breaking change to CS on H100 because a regular workload can't even get run within with a GPU device attached.

Copy link
Contributor Author

@meetrajvala meetrajvala May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. but without this check, if a GPU is present but the driver installation flag is missing, workloads relying on the GPU would likely fail. The nature of that failure would depend on how the workload manages its GPU dependency – it might fall back to CPU usage or fail or terminate entirely. This check ensures predictable behavior when a GPU is attached but installation flag is not set.

I assumed that attaching a GPU implies its intended use, thus requiring the driver installation flag to guarantee necessary drivers are present. I can confirm this with Rene.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workloads relying on the GPU would likely fail

When would this happen? Why would we pass through the GPU device if they don't set the flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this happen?

When GPU is attached to the VM but customer didn't set the tee-install-gpu-driver flag. In this case, workload is expecting an interaction with GPU device but because we didn't install drivers (as flag was not set), GPU device would not available to the workload container and workload execution might fail.

Why would we pass through the GPU device if they don't set the flag?

We do not make the GPU device available to the container if this flag is not set. This check was just added to fail deterministically (when GPU is attached but mistakenly or knowingly installation flag is not set).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When GPU is attached to the VM but customer didn't set the tee-install-gpu-driver flag. In this case, workload is expecting an interaction with GPU device but because we didn't install drivers (as flag was not set), GPU device would not available to the workload container and workload execution might fail.

I understand, but we could print a warning log instead of stopping launch. Please double check with Rene.

Copy link
Contributor Author

@meetrajvala meetrajvala May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this with Rene. He suggested to fail upfront instead of getting into a condition where workload might not work.

Comment on lines +249 to +252
if deviceInfo != deviceinfo.NO_GPU {
logger.Error("GPU is attached, tee-install-gpu-driver is not set")
return fmt.Errorf("tee-install-gpu-driver is expected to set to true when GPU is attached")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When GPU is attached to the VM but customer didn't set the tee-install-gpu-driver flag. In this case, workload is expecting an interaction with GPU device but because we didn't install drivers (as flag was not set), GPU device would not available to the workload container and workload execution might fail.

I understand, but we could print a warning log instead of stopping launch. Please double check with Rene.

@meetrajvala meetrajvala merged commit 493e491 into cs_cgpu_h100 May 13, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants