Skip to content

Commit e9058cc

Browse files
committed
Call out GKE-specific labels/namespace for service monitor
1 parent e881087 commit e9058cc

File tree

2 files changed

+17
-0
lines changed

2 files changed

+17
-0
lines changed

ai/vllm-deployment/hpa/gpu-dcgm-exporter-service.yaml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,21 +2,32 @@
22
# pods. The Prometheus Operator's ServiceMonitor will target this Service
33
# to discover and scrape the GPU metrics. This is especially important
44
# because the exporter pods are part of a DaemonSet, and their IPs can change.
5+
#
6+
# NOTE: This configuration is specific to GKE, which automatically deploys the
7+
# DCGM exporter in the 'gke-managed-system' namespace. For other cloud
8+
# providers or on-premise clusters, you would need to deploy your own DCGM
9+
# exporter (e.g., via a Helm chart) and update this Service's 'namespace'
10+
# and 'labels' to match your deployment.
511

612
apiVersion: v1
713
kind: Service
814
metadata:
915
name: gke-managed-dcgm-exporter
16+
# GKE-SPECIFIC: GKE deploys its managed DCGM exporter in this namespace.
17+
# On other platforms, this would be the namespace where you deploy the exporter.
1018
namespace: gke-managed-system
1119
labels:
1220
# This label is critical. The ServiceMonitor uses this label to find this
1321
# specific Service. If the labels don't match, Prometheus will not be
1422
# able to discover the metrics endpoint.
23+
# GKE-SPECIFIC: This label is used by GKE's managed service. For a custom
24+
# deployment, you would use a more generic label like 'nvidia-dcgm-exporter'.
1525
app.kubernetes.io/name: gke-managed-dcgm-exporter
1626
spec:
1727
selector:
1828
# This selector tells the Service which pods to route traffic to.
1929
# It must match the labels on the DCGM exporter pods.
30+
# GKE-SPECIFIC: This selector matches the labels on GKE's managed DCGM pods.
2031
app.kubernetes.io/name: gke-managed-dcgm-exporter
2132
ports:
2233
- # The 'name' of this port is important. The ServiceMonitor will specifically

ai/vllm-deployment/hpa/gpu-service-monitor.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,17 @@ spec:
1717
# the labels on the 'gke-managed-dcgm-exporter' Service.
1818
selector:
1919
matchLabels:
20+
# GKE-SPECIFIC: This label matches the Service for GKE's managed DCGM
21+
# exporter. If you are using a different DCGM deployment, you must
22+
# update this label to match the label of the corresponding Service.
2023
app.kubernetes.io/name: gke-managed-dcgm-exporter
2124
# This selector specifies which namespace to search for the target Service.
2225
# For GKE, the DCGM service is in 'gke-managed-system'.
2326
namespaceSelector:
2427
matchNames:
28+
# GKE-SPECIFIC: This is the namespace for GKE's managed DCGM exporter.
29+
# For other environments, this should be the namespace where you have
30+
# deployed the DCGM exporter Service.
2531
- gke-managed-system
2632
endpoints:
2733
- port: metrics

0 commit comments

Comments
 (0)