|
| 1 | +--- |
| 2 | +title: Configure the Prometheus Receiver |
| 3 | +linkTitle: 7. Configure the Prometheus Receiver |
| 4 | +weight: 7 |
| 5 | +time: 10 minutes |
| 6 | +--- |
| 7 | + |
| 8 | +Now that our LLM is up and running, we'll add the Prometheus receiver to our |
| 9 | +OpenTelemetry collector to gather metrics from it. |
| 10 | + |
| 11 | +## Capture the NVIDIA DCGM Exporter metrics |
| 12 | + |
| 13 | +The NVIDIA DCGM exporter is running in our OpenShift cluster. It |
| 14 | +exposes GPU metrics that we can send to Splunk. |
| 15 | + |
| 16 | +To do this, let's customize the configuration of the collector by editing the |
| 17 | +`otel-collector-values.yaml` file that we used earlier when deploying the collector. |
| 18 | + |
| 19 | +Add the following content, just below the `kubeletstats` section: |
| 20 | + |
| 21 | +``` yaml |
| 22 | + receiver_creator/nvidia: |
| 23 | + # Name of the extensions to watch for endpoints to start and stop. |
| 24 | + watch_observers: [ k8s_observer ] |
| 25 | + receivers: |
| 26 | + prometheus/dcgm: |
| 27 | + config: |
| 28 | + config: |
| 29 | + scrape_configs: |
| 30 | + - job_name: gpu-metrics |
| 31 | + scrape_interval: 10s |
| 32 | + static_configs: |
| 33 | + - targets: |
| 34 | + - '`endpoint`:9400' |
| 35 | + rule: type == "pod" && labels["app"] == "nvidia-dcgm-exporter" |
| 36 | +``` |
| 37 | +
|
| 38 | +This tells the collector to look for pods with a label of `app=nvidia-dcgm-exporter`. |
| 39 | +And when it finds a pod with this label, scrape the `/v1/metrics` endpoint using port 9400. |
| 40 | + |
| 41 | +To ensure the receiver is used, we'll need to add a new pipeline to the `otel-collector-values.yaml` file |
| 42 | +as well. |
| 43 | + |
| 44 | +Add the following code to the bottom of the file: |
| 45 | + |
| 46 | +``` yaml |
| 47 | + service: |
| 48 | + pipelines: |
| 49 | + metrics/nvidia-metrics: |
| 50 | + exporters: |
| 51 | + - signalfx |
| 52 | + processors: |
| 53 | + - memory_limiter |
| 54 | + - batch |
| 55 | + - resourcedetection |
| 56 | + - resource |
| 57 | + receivers: |
| 58 | + - receiver_creator/nvidia |
| 59 | +``` |
| 60 | + |
| 61 | +Before applying the changes, let's add one more Prometheus receiver in the next section. |
| 62 | + |
| 63 | +## Capture the NVIDIA NIM metrics |
| 64 | + |
| 65 | +The `meta-llama-3-2-1b-instruct` LLM that we just deployed with NVIDIA NIM also |
| 66 | +includes a Prometheus endpoint that we can scrape with the collector. Let's add the |
| 67 | +following to the `otel-collector-values.yaml` file, just below the receiver we added earlier: |
| 68 | + |
| 69 | +``` yaml |
| 70 | + prometheus/nim-llm: |
| 71 | + config: |
| 72 | + config: |
| 73 | + scrape_configs: |
| 74 | + - job_name: nim-for-llm-metrics |
| 75 | + scrape_interval: 10s |
| 76 | + metrics_path: /v1/metrics |
| 77 | + static_configs: |
| 78 | + - targets: |
| 79 | + - '`endpoint`:8000' |
| 80 | + rule: type == "pod" && labels["app"] == "meta-llama-3-2-1b-instruct" |
| 81 | +``` |
| 82 | +
|
| 83 | +This tells the collector to look for pods with a label of `app=meta-llama-3-2-1b-instruct`. |
| 84 | +And when it finds a pod with this label, scrape the `/v1/metrics` endpoint using port 8000. |
| 85 | + |
| 86 | +There's no need to make changes to the pipeline, as this receiver will already be picked up |
| 87 | +as part of the `receiver_creator/nvidia` receiver. |
| 88 | + |
| 89 | +## Add a Filter Processor |
| 90 | + |
| 91 | +Prometheus endpoints can expose a large number of metrics, sometimes with high cardinality. |
| 92 | + |
| 93 | +Let's add a filter processor that defines exactly what metrics we want to send to Splunk. |
| 94 | +Specifically, we'll send only the metrics that are utilized by a dashboard chart or an |
| 95 | +alert detector. |
| 96 | + |
| 97 | +Add the following code to the `otel-collector-values.yaml` file, after the exporters section |
| 98 | +but before the receivers section: |
| 99 | + |
| 100 | +``` yaml |
| 101 | + processors: |
| 102 | + filter/metrics_to_be_included: |
| 103 | + metrics: |
| 104 | + # Include only metrics used in charts and detectors |
| 105 | + include: |
| 106 | + match_type: strict |
| 107 | + metric_names: |
| 108 | + - DCGM_FI_DEV_FB_FREE |
| 109 | + - DCGM_FI_DEV_FB_USED |
| 110 | + - DCGM_FI_DEV_GPU_TEMP |
| 111 | + - DCGM_FI_DEV_GPU_UTIL |
| 112 | + - DCGM_FI_DEV_MEM_CLOCK |
| 113 | + - DCGM_FI_DEV_MEM_COPY_UTIL |
| 114 | + - DCGM_FI_DEV_MEMORY_TEMP |
| 115 | + - DCGM_FI_DEV_POWER_USAGE |
| 116 | + - DCGM_FI_DEV_SM_CLOCK |
| 117 | + - DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION |
| 118 | + - DCGM_FI_PROF_DRAM_ACTIVE |
| 119 | + - DCGM_FI_PROF_GR_ENGINE_ACTIVE |
| 120 | + - DCGM_FI_PROF_PCIE_RX_BYTES |
| 121 | + - DCGM_FI_PROF_PCIE_TX_BYTES |
| 122 | + - DCGM_FI_PROF_PIPE_TENSOR_ACTIVE |
| 123 | + - generation_tokens_total |
| 124 | + - go_info |
| 125 | + - go_memstats_alloc_bytes |
| 126 | + - go_memstats_alloc_bytes_total |
| 127 | + - go_memstats_buck_hash_sys_bytes |
| 128 | + - go_memstats_frees_total |
| 129 | + - go_memstats_gc_sys_bytes |
| 130 | + - go_memstats_heap_alloc_bytes |
| 131 | + - go_memstats_heap_idle_bytes |
| 132 | + - go_memstats_heap_inuse_bytes |
| 133 | + - go_memstats_heap_objects |
| 134 | + - go_memstats_heap_released_bytes |
| 135 | + - go_memstats_heap_sys_bytes |
| 136 | + - go_memstats_last_gc_time_seconds |
| 137 | + - go_memstats_lookups_total |
| 138 | + - go_memstats_mallocs_total |
| 139 | + - go_memstats_mcache_inuse_bytes |
| 140 | + - go_memstats_mcache_sys_bytes |
| 141 | + - go_memstats_mspan_inuse_bytes |
| 142 | + - go_memstats_mspan_sys_bytes |
| 143 | + - go_memstats_next_gc_bytes |
| 144 | + - go_memstats_other_sys_bytes |
| 145 | + - go_memstats_stack_inuse_bytes |
| 146 | + - go_memstats_stack_sys_bytes |
| 147 | + - go_memstats_sys_bytes |
| 148 | + - go_sched_gomaxprocs_threads |
| 149 | + - gpu_cache_usage_perc |
| 150 | + - gpu_total_energy_consumption_joules |
| 151 | + - http.server.active_requests |
| 152 | + - num_request_max |
| 153 | + - num_requests_running |
| 154 | + - num_requests_waiting |
| 155 | + - process_cpu_seconds_total |
| 156 | + - process_max_fds |
| 157 | + - process_open_fds |
| 158 | + - process_resident_memory_bytes |
| 159 | + - process_start_time_seconds |
| 160 | + - process_virtual_memory_bytes |
| 161 | + - process_virtual_memory_max_bytes |
| 162 | + - promhttp_metric_handler_requests_in_flight |
| 163 | + - promhttp_metric_handler_requests_total |
| 164 | + - prompt_tokens_total |
| 165 | + - python_gc_collections_total |
| 166 | + - python_gc_objects_collected_total |
| 167 | + - python_gc_objects_uncollectable_total |
| 168 | + - python_info |
| 169 | + - request_finish_total |
| 170 | + - request_success_total |
| 171 | + - system.cpu.time |
| 172 | + - e2e_request_latency_seconds |
| 173 | + - time_to_first_token_seconds |
| 174 | + - time_per_output_token_seconds |
| 175 | + - request_prompt_tokens |
| 176 | + - request_generation_tokens |
| 177 | +``` |
| 178 | + |
| 179 | +Ensure this processor is included in the pipeline we added earlier to the |
| 180 | +bottom of the file: |
| 181 | + |
| 182 | +``` bash |
| 183 | + service: |
| 184 | + pipelines: |
| 185 | + metrics/nvidia-metrics: |
| 186 | + exporters: |
| 187 | + - signalfx |
| 188 | + processors: |
| 189 | + - memory_limiter |
| 190 | + - filter/metrics_to_be_included |
| 191 | + - batch |
| 192 | + - resourcedetection |
| 193 | + - resource |
| 194 | + receivers: |
| 195 | + - receiver_creator/nvidia |
| 196 | +``` |
| 197 | + |
| 198 | +## Verify Changes |
| 199 | + |
| 200 | +Before applying the configuration changes to the collector, take a moment to compare the |
| 201 | +contents of your modified `otel-collector-values.yaml` file with the `otel-collector-values-with-nvidia.yaml` file. |
| 202 | +Update your file as needed to ensure the contents match. Remember that indentation is important |
| 203 | +for `yaml` files, and needs to be precise. |
| 204 | + |
| 205 | +## Update the OpenTelemetry Collector Config |
| 206 | + |
| 207 | +Now we can update the OpenTelemetry collector configuration by running the |
| 208 | +following Helm command: |
| 209 | + |
| 210 | +``` bash |
| 211 | +helm upgrade splunk-otel-collector \ |
| 212 | + --set="clusterName=$CLUSTER_NAME" \ |
| 213 | + --set="environment=$ENVIRONMENT_NAME" \ |
| 214 | + --set="splunkObservability.accessToken=$SPLUNK_ACCESS_TOKEN" \ |
| 215 | + --set="splunkObservability.realm=$SPLUNK_REALM" \ |
| 216 | + --set="splunkPlatform.endpoint=$SPLUNK_HEC_URL" \ |
| 217 | + --set="splunkPlatform.token=$SPLUNK_HEC_TOKEN" \ |
| 218 | + --set="splunkPlatform.index=$SPLUNK_INDEX" \ |
| 219 | + -f ./otel-collector/otel-collector-values.yaml \ |
| 220 | + -n otel \ |
| 221 | + splunk-otel-collector-chart/splunk-otel-collector |
| 222 | +``` |
| 223 | + |
| 224 | +## Confirm Metrics are Sent to Splunk |
| 225 | + |
| 226 | +Navigate to the [Cisco AI Pod](https://app.us1.signalfx.com/#/dashboard/GvmWJyPA4Ak?startTime=-15m&endTime=Now&variables%5B%5D=K8s%20cluster%3Dk8s.cluster.name:%5B%22rosa-test%22%5D&groupId=GvmVcarA4AA&configId=GuzVkWWA4BE) |
| 227 | +dashboard in Splunk Observability Cloud. Ensure it's filtered on your OpenShift cluster name, and that |
| 228 | +the charts are populated as in the following example: |
| 229 | + |
| 230 | + |
0 commit comments