Skip to content

Commit 6e01132

Browse files
committed
added prometheus receivers to AI Pod workshop
1 parent 377b3dd commit 6e01132

File tree

3 files changed

+376
-0
lines changed

3 files changed

+376
-0
lines changed
Lines changed: 230 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,230 @@
1+
---
2+
title: Configure the Prometheus Receiver
3+
linkTitle: 7. Configure the Prometheus Receiver
4+
weight: 7
5+
time: 10 minutes
6+
---
7+
8+
Now that our LLM is up and running, we'll add the Prometheus receiver to our
9+
OpenTelemetry collector to gather metrics from it.
10+
11+
## Capture the NVIDIA DCGM Exporter metrics
12+
13+
The NVIDIA DCGM exporter is running in our OpenShift cluster. It
14+
exposes GPU metrics that we can send to Splunk.
15+
16+
To do this, let's customize the configuration of the collector by editing the
17+
`otel-collector-values.yaml` file that we used earlier when deploying the collector.
18+
19+
Add the following content, just below the `kubeletstats` section:
20+
21+
``` yaml
22+
receiver_creator/nvidia:
23+
# Name of the extensions to watch for endpoints to start and stop.
24+
watch_observers: [ k8s_observer ]
25+
receivers:
26+
prometheus/dcgm:
27+
config:
28+
config:
29+
scrape_configs:
30+
- job_name: gpu-metrics
31+
scrape_interval: 10s
32+
static_configs:
33+
- targets:
34+
- '`endpoint`:9400'
35+
rule: type == "pod" && labels["app"] == "nvidia-dcgm-exporter"
36+
```
37+
38+
This tells the collector to look for pods with a label of `app=nvidia-dcgm-exporter`.
39+
And when it finds a pod with this label, scrape the `/v1/metrics` endpoint using port 9400.
40+
41+
To ensure the receiver is used, we'll need to add a new pipeline to the `otel-collector-values.yaml` file
42+
as well.
43+
44+
Add the following code to the bottom of the file:
45+
46+
``` yaml
47+
service:
48+
pipelines:
49+
metrics/nvidia-metrics:
50+
exporters:
51+
- signalfx
52+
processors:
53+
- memory_limiter
54+
- batch
55+
- resourcedetection
56+
- resource
57+
receivers:
58+
- receiver_creator/nvidia
59+
```
60+
61+
Before applying the changes, let's add one more Prometheus receiver in the next section.
62+
63+
## Capture the NVIDIA NIM metrics
64+
65+
The `meta-llama-3-2-1b-instruct` LLM that we just deployed with NVIDIA NIM also
66+
includes a Prometheus endpoint that we can scrape with the collector. Let's add the
67+
following to the `otel-collector-values.yaml` file, just below the receiver we added earlier:
68+
69+
``` yaml
70+
prometheus/nim-llm:
71+
config:
72+
config:
73+
scrape_configs:
74+
- job_name: nim-for-llm-metrics
75+
scrape_interval: 10s
76+
metrics_path: /v1/metrics
77+
static_configs:
78+
- targets:
79+
- '`endpoint`:8000'
80+
rule: type == "pod" && labels["app"] == "meta-llama-3-2-1b-instruct"
81+
```
82+
83+
This tells the collector to look for pods with a label of `app=meta-llama-3-2-1b-instruct`.
84+
And when it finds a pod with this label, scrape the `/v1/metrics` endpoint using port 8000.
85+
86+
There's no need to make changes to the pipeline, as this receiver will already be picked up
87+
as part of the `receiver_creator/nvidia` receiver.
88+
89+
## Add a Filter Processor
90+
91+
Prometheus endpoints can expose a large number of metrics, sometimes with high cardinality.
92+
93+
Let's add a filter processor that defines exactly what metrics we want to send to Splunk.
94+
Specifically, we'll send only the metrics that are utilized by a dashboard chart or an
95+
alert detector.
96+
97+
Add the following code to the `otel-collector-values.yaml` file, after the exporters section
98+
but before the receivers section:
99+
100+
``` yaml
101+
processors:
102+
filter/metrics_to_be_included:
103+
metrics:
104+
# Include only metrics used in charts and detectors
105+
include:
106+
match_type: strict
107+
metric_names:
108+
- DCGM_FI_DEV_FB_FREE
109+
- DCGM_FI_DEV_FB_USED
110+
- DCGM_FI_DEV_GPU_TEMP
111+
- DCGM_FI_DEV_GPU_UTIL
112+
- DCGM_FI_DEV_MEM_CLOCK
113+
- DCGM_FI_DEV_MEM_COPY_UTIL
114+
- DCGM_FI_DEV_MEMORY_TEMP
115+
- DCGM_FI_DEV_POWER_USAGE
116+
- DCGM_FI_DEV_SM_CLOCK
117+
- DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION
118+
- DCGM_FI_PROF_DRAM_ACTIVE
119+
- DCGM_FI_PROF_GR_ENGINE_ACTIVE
120+
- DCGM_FI_PROF_PCIE_RX_BYTES
121+
- DCGM_FI_PROF_PCIE_TX_BYTES
122+
- DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
123+
- generation_tokens_total
124+
- go_info
125+
- go_memstats_alloc_bytes
126+
- go_memstats_alloc_bytes_total
127+
- go_memstats_buck_hash_sys_bytes
128+
- go_memstats_frees_total
129+
- go_memstats_gc_sys_bytes
130+
- go_memstats_heap_alloc_bytes
131+
- go_memstats_heap_idle_bytes
132+
- go_memstats_heap_inuse_bytes
133+
- go_memstats_heap_objects
134+
- go_memstats_heap_released_bytes
135+
- go_memstats_heap_sys_bytes
136+
- go_memstats_last_gc_time_seconds
137+
- go_memstats_lookups_total
138+
- go_memstats_mallocs_total
139+
- go_memstats_mcache_inuse_bytes
140+
- go_memstats_mcache_sys_bytes
141+
- go_memstats_mspan_inuse_bytes
142+
- go_memstats_mspan_sys_bytes
143+
- go_memstats_next_gc_bytes
144+
- go_memstats_other_sys_bytes
145+
- go_memstats_stack_inuse_bytes
146+
- go_memstats_stack_sys_bytes
147+
- go_memstats_sys_bytes
148+
- go_sched_gomaxprocs_threads
149+
- gpu_cache_usage_perc
150+
- gpu_total_energy_consumption_joules
151+
- http.server.active_requests
152+
- num_request_max
153+
- num_requests_running
154+
- num_requests_waiting
155+
- process_cpu_seconds_total
156+
- process_max_fds
157+
- process_open_fds
158+
- process_resident_memory_bytes
159+
- process_start_time_seconds
160+
- process_virtual_memory_bytes
161+
- process_virtual_memory_max_bytes
162+
- promhttp_metric_handler_requests_in_flight
163+
- promhttp_metric_handler_requests_total
164+
- prompt_tokens_total
165+
- python_gc_collections_total
166+
- python_gc_objects_collected_total
167+
- python_gc_objects_uncollectable_total
168+
- python_info
169+
- request_finish_total
170+
- request_success_total
171+
- system.cpu.time
172+
- e2e_request_latency_seconds
173+
- time_to_first_token_seconds
174+
- time_per_output_token_seconds
175+
- request_prompt_tokens
176+
- request_generation_tokens
177+
```
178+
179+
Ensure this processor is included in the pipeline we added earlier to the
180+
bottom of the file:
181+
182+
``` bash
183+
service:
184+
pipelines:
185+
metrics/nvidia-metrics:
186+
exporters:
187+
- signalfx
188+
processors:
189+
- memory_limiter
190+
- filter/metrics_to_be_included
191+
- batch
192+
- resourcedetection
193+
- resource
194+
receivers:
195+
- receiver_creator/nvidia
196+
```
197+
198+
## Verify Changes
199+
200+
Before applying the configuration changes to the collector, take a moment to compare the
201+
contents of your modified `otel-collector-values.yaml` file with the `otel-collector-values-with-nvidia.yaml` file.
202+
Update your file as needed to ensure the contents match. Remember that indentation is important
203+
for `yaml` files, and needs to be precise.
204+
205+
## Update the OpenTelemetry Collector Config
206+
207+
Now we can update the OpenTelemetry collector configuration by running the
208+
following Helm command:
209+
210+
``` bash
211+
helm upgrade splunk-otel-collector \
212+
--set="clusterName=$CLUSTER_NAME" \
213+
--set="environment=$ENVIRONMENT_NAME" \
214+
--set="splunkObservability.accessToken=$SPLUNK_ACCESS_TOKEN" \
215+
--set="splunkObservability.realm=$SPLUNK_REALM" \
216+
--set="splunkPlatform.endpoint=$SPLUNK_HEC_URL" \
217+
--set="splunkPlatform.token=$SPLUNK_HEC_TOKEN" \
218+
--set="splunkPlatform.index=$SPLUNK_INDEX" \
219+
-f ./otel-collector/otel-collector-values.yaml \
220+
-n otel \
221+
splunk-otel-collector-chart/splunk-otel-collector
222+
```
223+
224+
## Confirm Metrics are Sent to Splunk
225+
226+
Navigate to the [Cisco AI Pod](https://app.us1.signalfx.com/#/dashboard/GvmWJyPA4Ak?startTime=-15m&endTime=Now&variables%5B%5D=K8s%20cluster%3Dk8s.cluster.name:%5B%22rosa-test%22%5D&groupId=GvmVcarA4AA&configId=GuzVkWWA4BE)
227+
dashboard in Splunk Observability Cloud. Ensure it's filtered on your OpenShift cluster name, and that
228+
the charts are populated as in the following example:
229+
230+
![Kubernetes Pods](../images/Cisco-AI-Pod-dashboard.png)
560 KB
Loading
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
distribution: openshift
2+
readinessProbe:
3+
initialDelaySeconds: 180
4+
livenessProbe:
5+
initialDelaySeconds: 180
6+
operator:
7+
enabled: false
8+
operatorcrds:
9+
installed: false
10+
gateway:
11+
enabled: false
12+
splunkObservability:
13+
profilingEnabled: true
14+
clusterReceiver:
15+
resources:
16+
limits:
17+
cpu: 200m
18+
memory: 2000Mi
19+
agent:
20+
discovery:
21+
enabled: true
22+
resources:
23+
limits:
24+
cpu: 200m
25+
memory: 2000Mi
26+
config:
27+
exporters:
28+
signalfx:
29+
send_otlp_histograms: true
30+
processors:
31+
filter/metrics_to_be_included:
32+
metrics:
33+
# Include only metrics used in charts and detectors
34+
include:
35+
match_type: strict
36+
metric_names:
37+
- DCGM_FI_DEV_FB_FREE
38+
- DCGM_FI_DEV_FB_USED
39+
- DCGM_FI_DEV_GPU_TEMP
40+
- DCGM_FI_DEV_GPU_UTIL
41+
- DCGM_FI_DEV_MEM_CLOCK
42+
- DCGM_FI_DEV_MEM_COPY_UTIL
43+
- DCGM_FI_DEV_MEMORY_TEMP
44+
- DCGM_FI_DEV_POWER_USAGE
45+
- DCGM_FI_DEV_SM_CLOCK
46+
- DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION
47+
- DCGM_FI_PROF_DRAM_ACTIVE
48+
- DCGM_FI_PROF_GR_ENGINE_ACTIVE
49+
- DCGM_FI_PROF_PCIE_RX_BYTES
50+
- DCGM_FI_PROF_PCIE_TX_BYTES
51+
- DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
52+
- generation_tokens_total
53+
- go_info
54+
- go_memstats_alloc_bytes
55+
- go_memstats_alloc_bytes_total
56+
- go_memstats_buck_hash_sys_bytes
57+
- go_memstats_frees_total
58+
- go_memstats_gc_sys_bytes
59+
- go_memstats_heap_alloc_bytes
60+
- go_memstats_heap_idle_bytes
61+
- go_memstats_heap_inuse_bytes
62+
- go_memstats_heap_objects
63+
- go_memstats_heap_released_bytes
64+
- go_memstats_heap_sys_bytes
65+
- go_memstats_last_gc_time_seconds
66+
- go_memstats_lookups_total
67+
- go_memstats_mallocs_total
68+
- go_memstats_mcache_inuse_bytes
69+
- go_memstats_mcache_sys_bytes
70+
- go_memstats_mspan_inuse_bytes
71+
- go_memstats_mspan_sys_bytes
72+
- go_memstats_next_gc_bytes
73+
- go_memstats_other_sys_bytes
74+
- go_memstats_stack_inuse_bytes
75+
- go_memstats_stack_sys_bytes
76+
- go_memstats_sys_bytes
77+
- go_sched_gomaxprocs_threads
78+
- gpu_cache_usage_perc
79+
- gpu_total_energy_consumption_joules
80+
- http.server.active_requests
81+
- num_request_max
82+
- num_requests_running
83+
- num_requests_waiting
84+
- process_cpu_seconds_total
85+
- process_max_fds
86+
- process_open_fds
87+
- process_resident_memory_bytes
88+
- process_start_time_seconds
89+
- process_virtual_memory_bytes
90+
- process_virtual_memory_max_bytes
91+
- promhttp_metric_handler_requests_in_flight
92+
- promhttp_metric_handler_requests_total
93+
- prompt_tokens_total
94+
- python_gc_collections_total
95+
- python_gc_objects_collected_total
96+
- python_gc_objects_uncollectable_total
97+
- python_info
98+
- request_finish_total
99+
- request_success_total
100+
- system.cpu.time
101+
- e2e_request_latency_seconds
102+
- time_to_first_token_seconds
103+
- time_per_output_token_seconds
104+
- request_prompt_tokens
105+
- request_generation_tokens
106+
receivers:
107+
kubeletstats:
108+
insecure_skip_verify: true
109+
receiver_creator/nvidia:
110+
# Name of the extensions to watch for endpoints to start and stop.
111+
watch_observers: [ k8s_observer ]
112+
receivers:
113+
prometheus/dcgm:
114+
config:
115+
config:
116+
scrape_configs:
117+
- job_name: gpu-metrics
118+
scrape_interval: 10s
119+
static_configs:
120+
- targets:
121+
- '`endpoint`:9400'
122+
rule: type == "pod" && labels["app"] == "nvidia-dcgm-exporter"
123+
prometheus/nim-llm:
124+
config:
125+
config:
126+
scrape_configs:
127+
- job_name: nim-for-llm-metrics
128+
scrape_interval: 10s
129+
metrics_path: /v1/metrics
130+
static_configs:
131+
- targets:
132+
- '`endpoint`:8000'
133+
rule: type == "pod" && labels["app"] == "meta-llama-3-2-1b-instruct"
134+
service:
135+
pipelines:
136+
metrics/nvidia-metrics:
137+
exporters:
138+
- signalfx
139+
processors:
140+
- memory_limiter
141+
- filter/metrics_to_be_included
142+
- batch
143+
- resourcedetection
144+
- resource
145+
receivers:
146+
- receiver_creator/nvidia

0 commit comments

Comments
 (0)