Skip to content

MIG device support for hpc_job metric labels #369

@jbrobstw

Description

@jbrobstw

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

Please provide a clear description of the problem this feature solves

Currently the hpc_job metric label can only be applied to full GPUs, even though there are metrics for individual MIG partitions on MIGed GPUs. So in the case of a job running only on one 1g MIG partition, then all the metrics associated with that GPU will have the label for that job, and those metrics may be duplicated in the event a separate job is running on a different partition on the same GPU. Ideally there would be an option to apply the hpc_job label more granularly on MIG-enabled GPUs which would cleaner queries of those metrics when necessary (e.g. for a Slurm HPC cluster).

Feature Description

On machines with MIG-enabled GPUs where the executable is called with --hpc-job-mapping-dir=<HPC_DIR> or DCGM_HPC_JOB_MAPPING_DIR=<HPC_DIR> is set and the <HPC_DIR> directory contains files with names reflecting both the GPU and MIG partition of that GPU (e.g. 0.2, 1.3, etc.) and whose contents are one jobid per line, dcgm-exporter shall set the hpc_job label only to metrics with a matching gpu label and a matching label associated with a MIG partition ID (GPU_I_ID is currently set for metrics on MIG devices and would probably work just fine, but my instinct would be to implement a label associated with the EntityID as part of this change and use that instead) parsed from the name of the file (e.g. 0.2 for GPU=1 and GPU_I_ID=2) for each job id contained in the file.

Describe your ideal solution

Either parse filenames in the form GPU.GPU_I_ID and apply the label accordingly (code below), or keep the filename parsing the same and add extra parsing to the file contents that allow specifying the MIG partition ID alongside the jobid (e.g. file 0 contains the line 9: jobid42, so hpc_job="jobid42" would get applied to metrics with gpu="0" and GPU_I_ID="9").

Additional context

Here's the diff for a small change I did to implement this in our cluster, which is currently running as expected.

diff --git a/pkg/dcgmexporter/hpc.go b/pkg/dcgmexporter/hpc.go
index e360b09..61a95c3 100644
--- a/pkg/dcgmexporter/hpc.go
+++ b/pkg/dcgmexporter/hpc.go
@@ -18,6 +18,7 @@ package dcgmexporter

 import (
        "bufio"
+       "fmt"
        sysOS "os"
        "path"
        "strconv"
@@ -73,7 +74,7 @@ func (p *hpcMapper) Process(metrics MetricsByCounter, sysInfo SystemInfo) error
        for counter := range metrics {
                var modifiedMetrics []Metric
                for _, metric := range metrics[counter] {
-                       jobs, exists := gpuToJobMap[metric.GPU]
+                       jobs, exists := gpuToJobMap[getJobMapID(metric)]
                        if exists {
                                for _, job := range jobs {
                                        modifiedMetric, err := deepCopy(metric)
@@ -146,7 +147,7 @@ func getGPUFiles(dirPath string) ([]string, error) {
                        continue // Skip directories
                }

-               _, err = strconv.Atoi(file.Name())
+               _, err = strconv.ParseFloat(file.Name(), 64)
                if err != nil {
                        logrus.Debugf("HPC mapper: file %q name doesn't match with GPU ID convention", file.Name())
                        continue
@@ -156,3 +157,10 @@ func getGPUFiles(dirPath string) ([]string, error) {

        return mappingFiles, nil
 }
+
+func getJobMapID(m Metric) (string) {
+       if m.MigProfile != "" {
+               return fmt.Sprintf("%s.%s", m.GPU, m.GPUInstanceID)
+       }
+       return m.GPU
+}

And here's a sample of the metrics from one of our machines. You can see job 2115078 requested 4 1g.10gb partitions and the rest of the metrics for gpu 0 and 1 are not marked with that job id (and also that whoever submitted that job needs some training concerning their resource utilization).

DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-0ba88705-8527-ed9e-bc07-1b45788d4ef9",pci_bus_id="00000000:01:00.0",device="nvidia0",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="1g.10gb",GPU_I_ID="13",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115078"} 0.784913
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-0ba88705-8527-ed9e-bc07-1b45788d4ef9",pci_bus_id="00000000:01:00.0",device="nvidia0",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="1g.10gb",GPU_I_ID="14",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115078"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-0ba88705-8527-ed9e-bc07-1b45788d4ef9",pci_bus_id="00000000:01:00.0",device="nvidia0",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="2g.20gb",GPU_I_ID="5",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115116"} 0.290336
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-0ba88705-8527-ed9e-bc07-1b45788d4ef9",pci_bus_id="00000000:01:00.0",device="nvidia0",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="3g.40gb",GPU_I_ID="1",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115107"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="1",UUID="GPU-de420485-0d50-8d33-81e9-e6fa6c1d0a00",pci_bus_id="00000000:41:00.0",device="nvidia1",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="1g.10gb",GPU_I_ID="13",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115078"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="1",UUID="GPU-de420485-0d50-8d33-81e9-e6fa6c1d0a00",pci_bus_id="00000000:41:00.0",device="nvidia1",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="1g.10gb",GPU_I_ID="14",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115078"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="1",UUID="GPU-de420485-0d50-8d33-81e9-e6fa6c1d0a00",pci_bus_id="00000000:41:00.0",device="nvidia1",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="2g.20gb",GPU_I_ID="5",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115116"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="1",UUID="GPU-de420485-0d50-8d33-81e9-e6fa6c1d0a00",pci_bus_id="00000000:41:00.0",device="nvidia1",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="3g.40gb",GPU_I_ID="1",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="2",UUID="GPU-05fc3ac0-23d4-0a3b-d3ea-01f7f303efbe",pci_bus_id="00000000:81:00.0",device="nvidia2",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="1g.10gb",GPU_I_ID="13",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115081"} 0.709745
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="2",UUID="GPU-05fc3ac0-23d4-0a3b-d3ea-01f7f303efbe",pci_bus_id="00000000:81:00.0",device="nvidia2",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="1g.10gb",GPU_I_ID="14",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115081"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="2",UUID="GPU-05fc3ac0-23d4-0a3b-d3ea-01f7f303efbe",pci_bus_id="00000000:81:00.0",device="nvidia2",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="2g.20gb",GPU_I_ID="5",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115115"} 0.290853
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="2",UUID="GPU-05fc3ac0-23d4-0a3b-d3ea-01f7f303efbe",pci_bus_id="00000000:81:00.0",device="nvidia2",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="3g.40gb",GPU_I_ID="1",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="3",UUID="GPU-8c50fdca-3c3e-29c1-9e95-ba571dd1b382",pci_bus_id="00000000:C1:00.0",device="nvidia3",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="1g.10gb",GPU_I_ID="9",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115081"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="3",UUID="GPU-8c50fdca-3c3e-29c1-9e95-ba571dd1b382",pci_bus_id="00000000:C1:00.0",device="nvidia3",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="1g.10gb",GPU_I_ID="10",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115081"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="3",UUID="GPU-8c50fdca-3c3e-29c1-9e95-ba571dd1b382",pci_bus_id="00000000:C1:00.0",device="nvidia3",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="2g.20gb",GPU_I_ID="3",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115115"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="3",UUID="GPU-8c50fdca-3c3e-29c1-9e95-ba571dd1b382",pci_bus_id="00000000:C1:00.0",device="nvidia3",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="3g.40gb",GPU_I_ID="2",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08"} 0.000000

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions