Skip to content

Commit 6fc4c59

Browse files
authored
feat: remove cluster wide logic from namespace restricted operator (#3934)
Signed-off-by: Hannah Zhang <[email protected]>
1 parent 22d910a commit 6fc4c59

File tree

12 files changed

+137
-86
lines changed

12 files changed

+137
-86
lines changed

benchmarks/profiler/utils/profiler_argparse.py

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -158,13 +158,13 @@ def create_profiler_parser() -> argparse.Namespace:
158158
parser.add_argument(
159159
"--min-num-gpus-per-engine",
160160
type=int,
161-
default=config.get("hardware", {}).get("min_num_gpus_per_engine", 0),
161+
default=config.get("hardware", {}).get("min_num_gpus_per_engine", 1),
162162
help="minimum number of GPUs per engine",
163163
)
164164
parser.add_argument(
165165
"--max-num-gpus-per-engine",
166166
type=int,
167-
default=config.get("hardware", {}).get("max_num_gpus_per_engine", 0),
167+
default=config.get("hardware", {}).get("max_num_gpus_per_engine", 8),
168168
help="maximum number of GPUs per engine",
169169
)
170170
parser.add_argument(
@@ -245,9 +245,15 @@ def create_profiler_parser() -> argparse.Namespace:
245245
parser.add_argument(
246246
"--num-gpus-per-node",
247247
type=int,
248-
default=config.get("hardware", {}).get("num_gpus_per_node", 0),
248+
default=config.get("hardware", {}).get("num_gpus_per_node", 8),
249249
help="Number of GPUs per node for MoE models - this will be the granularity when searching for the best TEP/DEP size",
250250
)
251+
parser.add_argument(
252+
"--enable-gpu-discovery",
253+
action="store_true",
254+
default=config.get("hardware", {}).get("enable_gpu_discovery", False),
255+
help="Enable automatic GPU discovery from Kubernetes cluster nodes. When enabled, overrides any manually specified hardware configuration. Requires cluster-wide node access permissions.",
256+
)
251257

252258
# Dynamically add all planner arguments from planner_argparse.py
253259
add_planner_arguments_to_parser(parser, prefix="planner-")
@@ -305,6 +311,9 @@ def create_profiler_parser() -> argparse.Namespace:
305311
if not args.model and not args.config:
306312
parser.error("--model or --config is required (provide at least one)")
307313

308-
auto_generate_search_space(args)
314+
# Run auto-generation if GPU discovery is enabled
315+
# This will override any manually specified hardware parameters
316+
if args.enable_gpu_discovery:
317+
auto_generate_search_space(args)
309318

310319
return args

deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentrequests.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,15 @@ spec:
138138
Example: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
139139
type: string
140140
type: object
141+
enableGpuDiscovery:
142+
default: false
143+
description: |-
144+
EnableGpuDiscovery controls whether the profiler should automatically discover GPU
145+
resources from the Kubernetes cluster nodes. When enabled, the profiler will override
146+
any manually specified hardware configuration (min_num_gpus_per_engine, max_num_gpus_per_engine,
147+
num_gpus_per_node) with values detected from the cluster.
148+
Requires cluster-wide node access permissions - only available with cluster-scoped operators.
149+
type: boolean
141150
model:
142151
description: |-
143152
Model specifies the model to deploy (e.g., "Qwen/Qwen3-0.6B", "meta-llama/Llama-3-70b").

deploy/cloud/helm/platform/components/operator/templates/_validation.tpl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ Prevents all conflict scenarios:
3737
{{- end -}}
3838

3939
{{- if $namespaceRestrictedOperators -}}
40-
{{- fail (printf "VALIDATION ERROR: Cannot install cluster-wide Dynamo operator. Found existing namespace-restricted Dynamo operators in namespaces: %s. This would create resource conflicts as both the cluster-wide operator and namespace-restricted operators would manage the same DGDs/DCDs. Either:\n1. Use one of the existing namespace-restricted operators for your specific namespace, or\n2. Uninstall all existing namespace-restricted operators first, or\n3. Install this operator in namespace-restricted mode: --set namespaceRestriction.enabled=true" (join ", " ($namespaceRestrictedOperators | uniq))) -}}
40+
{{- fail (printf "VALIDATION ERROR: Cannot install cluster-wide Dynamo operator. Found existing namespace-restricted Dynamo operators in namespaces: %s. This would create resource conflicts as both the cluster-wide operator and namespace-restricted operators would manage the same DGDs/DCDs. Either:\n1. Use one of the existing namespace-restricted operators for your specific namespace, or\n2. Uninstall all existing namespace-restricted operators first, or\n3. Install this operator in namespace-restricted mode: --set dynamo-operator.namespaceRestriction.enabled=true" (join ", " ($namespaceRestrictedOperators | uniq))) -}}
4141
{{- end -}}
4242
{{- end -}}
4343

deploy/cloud/helm/platform/components/operator/templates/deployment.yaml

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -124,9 +124,7 @@ spec:
124124
- --mpi-run-ssh-secret-name={{ .Values.dynamo.mpiRun.secretName }}
125125
- --mpi-run-ssh-secret-namespace={{ .Release.Namespace }}
126126
{{- end }}
127-
{{- if .Values.namespaceRestriction.enabled }}
128-
- --dgdr-profiling-cluster-role-name={{ include "dynamo-operator.fullname" . }}-{{ .Release.Namespace }}-dgdr-profiling-nodes
129-
{{- else }}
127+
{{- if not .Values.namespaceRestriction.enabled }}
130128
- --dgdr-profiling-cluster-role-name={{ include "dynamo-operator.fullname" . }}-dgdr-profiling
131129
- --planner-cluster-role-name={{ include "dynamo-operator.fullname" . }}-planner
132130
{{- end }}

deploy/cloud/helm/platform/components/operator/templates/profiling-job-rbac.yaml

Lines changed: 0 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -70,35 +70,6 @@ roleRef:
7070
kind: Role
7171
name: dgdr-profiling-job
7272
subjects:
73-
- kind: ServiceAccount
74-
name: dgdr-profiling-job
75-
namespace: {{ .Release.Namespace }}
76-
---
77-
apiVersion: rbac.authorization.k8s.io/v1
78-
kind: ClusterRole
79-
metadata:
80-
name: {{ include "dynamo-operator.fullname" . }}-{{ .Release.Namespace }}-dgdr-profiling-nodes
81-
labels:
82-
{{- include "dynamo-operator.labels" . | nindent 4 }}
83-
app.kubernetes.io/component: dgdr-profiling
84-
rules:
85-
# Nodes - cluster-scoped resource needed for profiling
86-
- apiGroups: [""]
87-
resources: ["nodes"]
88-
verbs: ["get", "list", "watch"]
89-
---
90-
apiVersion: rbac.authorization.k8s.io/v1
91-
kind: ClusterRoleBinding
92-
metadata:
93-
name: {{ include "dynamo-operator.fullname" . }}-{{ .Release.Namespace }}-dgdr-profiling-nodes
94-
labels:
95-
{{- include "dynamo-operator.labels" . | nindent 4 }}
96-
app.kubernetes.io/component: dgdr-profiling
97-
roleRef:
98-
apiGroup: rbac.authorization.k8s.io
99-
kind: ClusterRole
100-
name: {{ include "dynamo-operator.fullname" . }}-{{ .Release.Namespace }}-dgdr-profiling-nodes
101-
subjects:
10273
- kind: ServiceAccount
10374
name: dgdr-profiling-job
10475
namespace: {{ .Release.Namespace }}

deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,15 @@ type DynamoGraphDeploymentRequestSpec struct {
114114
// +kubebuilder:validation:Enum=vllm;sglang;trtllm
115115
Backend string `json:"backend"`
116116

117+
// EnableGpuDiscovery controls whether the profiler should automatically discover GPU
118+
// resources from the Kubernetes cluster nodes. When enabled, the profiler will override
119+
// any manually specified hardware configuration (min_num_gpus_per_engine, max_num_gpus_per_engine,
120+
// num_gpus_per_node) with values detected from the cluster.
121+
// Requires cluster-wide node access permissions - only available with cluster-scoped operators.
122+
// +kubebuilder:default=false
123+
// +kubebuilder:validation:Optional
124+
EnableGpuDiscovery bool `json:"enableGpuDiscovery,omitempty"`
125+
117126
// ProfilingConfig provides the complete configuration for the profiling job.
118127
// This configuration is passed directly to the profiler.
119128
// The structure matches the profile_sla config format exactly (see ProfilingConfigSpec for schema).

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentrequests.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,15 @@ spec:
138138
Example: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
139139
type: string
140140
type: object
141+
enableGpuDiscovery:
142+
default: false
143+
description: |-
144+
EnableGpuDiscovery controls whether the profiler should automatically discover GPU
145+
resources from the Kubernetes cluster nodes. When enabled, the profiler will override
146+
any manually specified hardware configuration (min_num_gpus_per_engine, max_num_gpus_per_engine,
147+
num_gpus_per_node) with values detected from the cluster.
148+
Requires cluster-wide node access permissions - only available with cluster-scoped operators.
149+
type: boolean
141150
model:
142151
description: |-
143152
Model specifies the model to deploy (e.g., "Qwen/Qwen3-0.6B", "meta-llama/Llama-3-70b").

deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -720,6 +720,11 @@ func (r *DynamoGraphDeploymentRequestReconciler) validateSpec(ctx context.Contex
720720
return errors.New("profilingConfig.config is required and must not be empty")
721721
}
722722

723+
// Validate enableGpuDiscovery is only true for cluster-wide operators
724+
if dgdr.Spec.EnableGpuDiscovery && r.Config.RestrictedNamespace != "" {
725+
return errors.New("enableGpuDiscovery can only be set to true for cluster-wide operators. Namespace-restricted operators cannot access cluster nodes for GPU discovery. Please set enableGpuDiscovery to false and provide hardware configuration (hardware.min_num_gpus_per_engine, hardware.max_num_gpus_per_engine, hardware.num_gpus_per_node) in profilingConfig.config")
726+
}
727+
723728
// Validate ConfigMap if provided (for the DGD base config)
724729
if dgdr.Spec.ProfilingConfig.ConfigMapRef != nil {
725730
cm := &corev1.ConfigMap{}
@@ -937,6 +942,12 @@ func (r *DynamoGraphDeploymentRequestReconciler) createProfilingJob(ctx context.
937942
"--profile-config", string(configYAML),
938943
}
939944

945+
// Add --enable-gpu-discovery flag based on DGDR spec
946+
// GPU discovery requires cluster-wide node access
947+
if dgdr.Spec.EnableGpuDiscovery {
948+
profilerArgs = append(profilerArgs, "--enable-gpu-discovery")
949+
}
950+
940951
// Use profiler image from profilingConfig
941952
imageName := dgdr.Spec.ProfilingConfig.ProfilerImage
942953
logger.Info("Using profiler image", "image", imageName)

docs/benchmarks/sla_driven_profiling.md

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,16 +42,45 @@ The recommended way to profile models is through DGDRs. Sample configurations ar
4242
- **`profile_sla_moe_dgdr.yaml`**: MoE model profiling
4343

4444
The Dynamo Operator automatically:
45-
1. Discovers GPU resources
45+
1. Discovers GPU resources (cluster-scoped operators only)
4646
2. Runs profiling (AIPerf on real engines or AI Configurator simulation)
4747
3. Generates optimal DGD configuration with SLA planner
4848
4. Deploys the DGD to your cluster
4949

5050
See the [Quick Start Guide](/docs/planner/sla_planner_quickstart.md) for prerequisites and detailed instructions.
5151

52+
## Hardware Configuration
53+
54+
Hardware parameters have sensible defaults and are **optional** - you can override them if needed:
55+
56+
```yaml
57+
profilingConfig:
58+
config:
59+
# Override hardware defaults if needed
60+
hardware:
61+
min_num_gpus_per_engine: 1
62+
max_num_gpus_per_engine: 8
63+
num_gpus_per_node: 8
64+
65+
# Only needed when using AI Configurator (sweep.use_ai_configurator: true)
66+
sweep:
67+
aic_system: h200_sxm # GPU type for AI Configurator (h100_sxm, h200_sxm, etc.)
68+
```
69+
70+
### Automatic GPU Discovery (Optional Feature)
71+
72+
Cluster-scoped operators can optionally enable automatic GPU discovery to detect hardware from cluster nodes. When enabled, hardware config is auto-detected and overrides any manually specified values.
73+
74+
```yaml
75+
spec:
76+
enableGpuDiscovery: true
77+
```
78+
79+
This feature is only available with cluster-scoped operators (`namespaceRestriction.enabled=false`) as it requires cluster-wide node access permissions. It is not available for namespace-restricted operators.
80+
5281
## Profiling Method
5382

54-
1. **GPU Discovery**: Detects available GPUs and their specifications
83+
1. **Hardware Setup**: Uses defaults or user-specified hardware configuration. Optionally, cluster-scoped operators can enable automatic GPU discovery to detect specifications from cluster nodes.
5584
2. **Identify Sweep Ranges**: Automatically determine minimum and maximum number of GPUs per engine. Minimum is determined by the model size and GPU VRAM. Maximum is set to one node for dense model and 4 nodes for MoE models.
5685
3. **Parallelization Mapping Sweep**: Use the input ISL and OSL, test the performance of the engines with different parallelization mappings. For dense models, we test different TP sizes for both prefill and decode. For MoE models, we test different TEP sizes for prefill and DEP sizes for decode.
5786
- **Prefill**: For prefill, since there is no in-flight batching (assume isl is long enough to saturate the GPU), we directly measure the TTFT for a request with given isl without kv-reusing. For example, the below plot shows the prefill parallelization mapping sweep results for H100 for deepseek-ai/DeepSeek-R1-Distill-Llama-8B.

0 commit comments

Comments
 (0)