Skip to content

Collect GPU usage metrics with prometheus#5296

Draft
yuvipanda wants to merge 2 commits into2i2c-org:mainfrom
yuvipanda:gpu-metrics
Draft

Collect GPU usage metrics with prometheus#5296
yuvipanda wants to merge 2 commits into2i2c-org:mainfrom
yuvipanda:gpu-metrics

Conversation

@yuvipanda
Copy link
Member

We use prometheus node exporter, deployed as part of our prometheus chart, to collect metrics about CPU and memory usage.

This deploys NVIDIA's dcgm-exporter which collects information about GPU usage.

As we work towards more cost monitoring and usage monitoring, collecting this information should allow us to help users get more bang for the buck from their GPU use. Since we only collect information after the exporters are deployed, this starts the information collection process even if it's not directly visible to end users.

Works towards https://2i2c.productboard.com/entity-detail/features/30046512, initially requested as part of https://2i2c.freshdesk.com/a/tickets/2545.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 19, 2024

Merging this PR will trigger the following deployment actions.

Support deployments

Cloud Provider Cluster Name Reason for Redeploy
aws bnext-bio Support helm chart has been modified
aws smithsonian Support helm chart has been modified
aws jupyter-health Support helm chart has been modified
aws nmfs-openscapes Support helm chart has been modified
gcp hhmi Support helm chart has been modified
kubeconfig utoronto Support helm chart has been modified
aws openscapeshub Support helm chart has been modified
aws ucmerced Support helm chart has been modified
aws reflective Support helm chart has been modified
aws 2i2c-aws-us Support helm chart has been modified
aws opensci Support helm chart has been modified
kubeconfig projectpythia-binder Support helm chart has been modified
gcp 2i2c-uk Support helm chart has been modified
aws aimatx-2i2c-hub Support helm chart has been modified
gcp dubois Support helm chart has been modified
aws temple Support helm chart has been modified
aws berkeley-geojupyter Support helm chart has been modified
aws nasa-cryo Support helm chart has been modified
gcp 2i2c Support helm chart has been modified
gcp leap Support helm chart has been modified
aws projectpythia Support helm chart has been modified
aws strudel Support helm chart has been modified
aws nasa-ghg-hub Support helm chart has been modified
gcp awi-ciroh Support helm chart has been modified
gcp cloudbank Support helm chart has been modified
kubeconfig 2i2c-jetstream2 Support helm chart has been modified
aws victor Support helm chart has been modified
aws disasters Support helm chart has been modified
aws earthscope Support helm chart has been modified
aws nasa-veda Support helm chart has been modified
aws maap Support helm chart has been modified

Staging deployments

No staging hub upgrades will be triggered

Production deployments

No production hub upgrades will be triggered

@yuvipanda
Copy link
Member Author

Unfortunately this doesn't work on GCP yet:

  Warning  FailedCreate  9m (x18 over 19m)  daemonset-controller  Error creating: insufficient quota to match these scopes: [{PriorityClass In [system-node-critical system-cluster-critical]}]

@yuvipanda yuvipanda marked this pull request as draft December 19, 2024 03:27
@yuvipanda
Copy link
Member Author

We can set .priorityClassName to get around this on GCP. But we don't have a clean way to schedule this only on GPU nodes yet, as it will just crash and burn on non-GPU nodes.

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this pull request Mar 19, 2025
We use [prometheus node exporter](https://github.com/prometheus/node_exporter),
deployed as part of our prometheus chart, to collect metrics about
CPU and memory usage.

This deploys NVIDIA's [dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter)
which collects information about GPU usage.

As we work towards more cost monitoring and usage monitoring,
collecting this information should allow us to help users get more
bang for the buck from their GPU use. Since we only collect information
after the exporters are deployed, this starts the information collection
process even if it's not directly visible to end users.

Works towards https://2i2c.productboard.com/entity-detail/features/30046512,
initially requested as part of https://2i2c.freshdesk.com/a/tickets/2545.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant