Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 4 additions & 8 deletions cluster-autoscaler/cloudprovider/gce/dynamicresources.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,12 @@ limitations under the License.

package gce

import apiv1 "k8s.io/api/core/v1"

const (
// DraGPUDriver name of the driver used to expose NVIDIA GPU resources
DraGPUDriver = "gpu.nvidia.com"
// DraGPULabel is the label added to nodes with GPU resource exposed via DRA.
DraGPULabel = "cloud.google.com/gke-gpu-dra-driver"
import (
apiv1 "k8s.io/api/core/v1"
"k8s.io/autoscaler/cluster-autoscaler/utils/gpu"
)

// GpuDraDriverEnabled checks whether GPU driver is enabled on the node
func GpuDraDriverEnabled(node *apiv1.Node) bool {
return node.Labels[DraGPULabel] == "true"
return node.Labels[gpu.DraGPULabelGKE] == "true"
}
2 changes: 1 addition & 1 deletion cluster-autoscaler/cloudprovider/gce/gce_cloud_provider.go
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ func (gce *GceCloudProvider) GetNodeGpuConfig(node *apiv1.Node) *cloudprovider.G
// so we overwrite extended resource name as it won't ever
// be there
if GpuDraDriverEnabled(node) {
gpuConfig.DraDriverName = DraGPUDriver
gpuConfig.DraDriverName = gpu.DraGPUDriver
gpuConfig.ExtendedResourceName = ""
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@ import (
apiv1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/api/resource"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/autoscaler/cluster-autoscaler/cloudprovider/gce"
testprovider "k8s.io/autoscaler/cluster-autoscaler/cloudprovider/test"
"k8s.io/autoscaler/cluster-autoscaler/context"
"k8s.io/autoscaler/cluster-autoscaler/utils/gpu"
Expand Down Expand Up @@ -157,7 +156,7 @@ func TestFilterOutNodesWithUnreadyResources(t *testing.T) {
ObjectMeta: metav1.ObjectMeta{
Name: "nodeGPUViaDra",
Labels: map[string]string{
gce.DraGPULabel: "true",
gpu.DraGPULabelGKE: "true",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, sorry for missing this part during review. IMO this test code shouldn't depend on anything GCE/GKE-specific. And it in fact doesn't, this test case doesn't really test anything at the moment. The Node is treated as ready because it doesn't have the GPU label that TestCloudProvider.GPULabel() returns, not because it has the DRA enablement label.

I think something like this would be much better (and would actually test the DRA part of the logic):

  1. Extend TestCloudProvider to allow configuring the result of TestCloudProvider.GetNodeGpuConfig() from the test code.
  2. Change this test to configure the result of TestCloudProvider.GetNodeGpuConfig() differently in different test cases.
  3. Make this one into a test case where the Node does have the TestCloudProvider.GPULabel() label, and the behavior is different based on the result of TestCloudProvider.GetNodeGpuConfig() (the Node is ready if DraDriverName is set, the Node is not ready if ExtendedResourceName is set).

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's a quick & dirty draft of what #1 might look like (and also a sanity check that I'm following your logic):

#8604

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the DraGPULabelGKE is not required for the test, then i'm in favor of using a more neutral label like something from the test provider. as long as we don't introduce the circular dependency ;)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackfrancis Yup, this is exactly what I meant in step 1 - LGTMd the PR!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

},
CreationTimestamp: metav1.NewTime(start),
},
Expand Down
7 changes: 7 additions & 0 deletions cluster-autoscaler/utils/gpu/gpu.go
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,13 @@ const (
MetricsNoGPU = ""
)

const (
// DraGPUDriver name of the driver used to expose NVIDIA GPU resources
DraGPUDriver = "gpu.nvidia.com"
// DraGPULabelGKE is the label added to nodes with GPU resource exposed via DRA.
DraGPULabelGKE = "cloud.google.com/gke-gpu-dra-driver"
)

// GetGpuInfoForMetrics returns the name of the custom resource and the GPU used on the node or empty string if there's no GPU
// if the GPU type is unknown, "generic" is returned
// NOTE: current implementation is GKE/GCE-specific
Expand Down
Loading