Skip to content

Commit b2b38de

Browse files
committed
Collect GPU usage metrics with prometheus
We use [prometheus node exporter](https://github.com/prometheus/node_exporter), deployed as part of our prometheus chart, to collect metrics about CPU and memory usage. This deploys NVIDIA's [dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) which collects information about GPU usage. As we work towards more cost monitoring and usage monitoring, collecting this information should allow us to help users get more bang for the buck from their GPU use. Since we only collect information after the exporters are deployed, this starts the information collection process even if it's not directly visible to end users. Works towards https://2i2c.productboard.com/entity-detail/features/30046512, initially requested as part of https://2i2c.freshdesk.com/a/tickets/2545.
1 parent b68995e commit b2b38de

File tree

3 files changed

+23
-0
lines changed

3 files changed

+23
-0
lines changed

helm-charts/support/Chart.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,3 +59,9 @@ dependencies:
5959
version: "0.0.1-0.dev.git.72.hadbe1d4"
6060
repository: https://2i2c.org/gcp-filestore-backups
6161
condition: gcpFilestoreBackups.enabled
62+
63+
# Provide metrics about GPU usage
64+
# https://github.com/NVIDIA/dcgm-exporter
65+
- name: dcgm-exporter
66+
version: 3.6.1
67+
repository: https://nvidia.github.io/dcgm-exporter/helm-charts

helm-charts/support/values.schema.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,9 @@ properties:
4242
global:
4343
type: object
4444
additionalProperties: true
45+
dcgm-exporter:
46+
type: object
47+
additionalProperties: true
4548

4649
# These provide values for objects we create, so we validate their schema
4750
# to the best of our ability.

helm-charts/support/values.yaml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -494,6 +494,20 @@ cryptnono:
494494
aws-ce-grafana-backend:
495495
enabled: false
496496

497+
dcgm-exporter:
498+
serviceMonitor:
499+
enabled: false
500+
podAnnotations:
501+
prometheus.io/path: "/metrics"
502+
prometheus.io/port: "12121"
503+
prometheus.io/scrape: "true"
504+
tolerations:
505+
- key: nvidia.com/gpu
506+
operator: Equal
507+
value: present
508+
effect: NoSchedule
509+
510+
497511
# Configuration of templates provided directly by this chart
498512
# -------------------------------------------------------------------------------
499513
#

0 commit comments

Comments
 (0)