Skip to content

Performance Metrics Overview

moebiusband73 edited this page Dec 18, 2020 · 1 revision

In the ProPE project the focus is to include metrics which mainly quantify resource utilization. Below lists specify the name of the metric, what it means, the smallest granularity it is valid for and how these values can be acquired. If one uses likwid-perfctr for measuring HPM metrics all below metrics can be acquired using 2 performance groups: MEM_DP and FLOPS_SP.

Basic metrics

  • cpu_used - CPU core utilization (between 0 and 1) / cpu level / kernel fs
  • ipc - avg ipc of active cores (cores executing instructions) / cpu level / HPM
  • mem_used - memory capacity used / node level / kernel fs
  • mem_bw - memory bandwidth / socket level / HPM
  • flops_any - total flop rate with DP flops scaled up / cpu level / HPM
  • rapl_power - CPU power consumption / socket level / HPM
  • lustre_bw - total lustre fs bandwidth / node level / kernel fs
  • ib_bw - total infiniband or omnipath bandwidth / node level / kernel fs
  • gpu_used - GPU utilization / GPU level / NVML (NVIDIA GPUs only)
  • gpu_mem_used - GPU memory capacity used / GPU level / NVML (NVIDIA GPUs only)
  • gpu_power - GPU power consumption / GPU level / NVML (NVIDIA GPUs only)

Extended metrics

  • clock - avg core frequency / cpu level / HPM
  • flops_sp - SP flop rate / cpu level / HPM
  • flops_dp - DP flop rate / cpu level / HPM
  • eth_read_bw - Ethernet read bandwidth / node level / kernel fs
  • eth_write_bw - Ethernet write bandwidth / node level / kernel fs
  • lustre_read_bw - Lustre read bandwidth / node level / kernel fs
  • lustre_write_bw - Lustre write bandwidth / node level / kernel fs
  • lustre_read_req- Lustre read requests / node level / kernel fs
  • lustre_write_req - Lustre write requests / node level / kernel fs
  • lustre_inodes - Lustre inodes / node level / kernel fs
  • lustre_accesses - Lustre open close / node level / kernel fs
  • lustre_fsync - Lustre fsync / node level / kernel fs
  • lustre_create - Lustre create / node level / kernel fs
  • ib_read_bw - Infiniband, Omnipath read bandwidth / node level / kernel fs
  • ib_write_bw - Infiniband, Omnipath write bandwidth / node level / kernel fs
  • ib_congestion - Infiniband, Omnipath congestion / node level / kernel fs

InfluxDB schema

InfluxDB uses its own nomenclature for the database schema. Still there are according structures in relational database speak. In InfluxDB a database is structured into measurements, a measurements has tags (strings) and fields (numbers). A measurement in InfluxDB is similar to a table in SQL, where a tag is a column with an index optimized for queries and fields are regular columns without index.

For a low overhead reporting and storage of metrics in InfluxDB it would make sense to put together metrics into one measurement. All fields in one measurement statement must have the same timestamp. For timestamp granularity seconds should be the right choice.

One could for example use the smallest topological entity on the node level as measurements:

  • cpu
    • tags: host, cpu
    • fields: load, cpi, flops_any, clock
  • socket
    • tags: host, socket
    • fields: rapl_power, mem_bw
  • node
    • tags: host
    • fields: mem_used, lustre_bw, ib_bw

As there are so many IO and Network related additional metrics it could make sense to create extra measurements for them:

  • network

    • tags: host
    • fields: ib_read_bw, ib_write_bw, eth_read_bw, eth_write_bw
  • fileIO

    • tags: host
    • fields: lustre_read_bw, lustre_write_bw, lustre_read_requests, lustre_write_requests, lustre_create, lustre_open, lustre_close, lustre_seek, lustre_fsync

The following InfluxDB measurements are currently used in Dresden's ProPE-database (one measurement per data source)

  • cpu

    • tags: hostname, cpu
    • fields: used
  • infiniband

    • tags: hostname
    • fields: bw
  • likwid_cpu

    • tags: hostname, cpu
    • fields: cpi, flops_any
  • likwid_socket

    • tags: hostname, cpu
    • fields: mem_bw, rapl_power
  • lustre[_scratch|highiops] (Dresden has two lustre file systems)

    • tags: hostname
    • fields: read_bw, write_bw, read_requests, write_requests, create, open, close, seek, fsync
  • memory

    • tags: hostname
    • fields: used
  • nvml

    • tags: hostname, gpu
    • fields: gpu_used, mem_used, power, temp

Clone this wiki locally