Missing CPU Metrics for Libvirt Instances #402

vurmil · 2025-08-26T15:10:20Z

vurmil
Aug 26, 2025

I'm using ceems version 0.10.2 to collect metrics from Libvirt, but I'm unable to get CPU usage metrics for the virtual machines. While other metrics, such as memory and some CPU-related counters (ceems_compute_unit_cpu_user_seconds_total, ceems_compute_unit_cpu_system_seconds_total), are being scraped, key metrics like CPU utilization are missing.

Additionally, I'm seeing the following error in the logs, which seems to be related to the issue:

time=2025-08-26T15:01:06.269Z level=ERROR source=libvirt.go:499 msg="Failed to run inside security context" collector=libvirt instance_id=instance-0000004f err="stat /etc/libvirt/qemu/instance-0000004f.xml: no such file or directory"
This error indicates that ceems is failing to find the domain XML files for the Libvirt instances in the expected path, /etc/libvirt/qemu/.

Steps Taken to Troubleshoot
Checked existing metrics: I've confirmed that the Prometheus exporter is successfully scraping some metrics, but none of them provide detailed CPU usage (e.g., in percentage or utilization). The only CPU-related metrics available are:

ceems_compute_unit_cpu_psi_seconds
ceems_compute_unit_cpu_system_seconds_total
ceems_compute_unit_cpu_user_seconds_total
ceems_compute_unit_cpus

Analyzed the error: The log error clearly states that the instance-0000004f.xml file is not found at /etc/libvirt/qemu/.

Attempted a symlink fix: To address the no such file or directory error, I created a symbolic link for the directory. However, this did not resolve the problem. Instead, after creating the symlink, the ceems exporter stopped finding any instances at all.

Environment Details
ceems version: 0.10.2

Operating System: Ubuntu 24.04

Libvirt version: 10.0.0

Metrics:

# HELP ceems_compute_unit_cpu_psi_seconds Total CPU PSI in seconds
# TYPE ceems_compute_unit_cpu_psi_seconds gauge
ceems_compute_unit_cpu_psi_seconds{cgrouphostname="",hostname="node01",manager="libvirt",uuid="instance-0000004f"} 1555.242543
# HELP ceems_compute_unit_cpu_system_seconds_total Total job CPU system seconds
# TYPE ceems_compute_unit_cpu_system_seconds_total counter
ceems_compute_unit_cpu_system_seconds_total{cgrouphostname="",hostname="node01",manager="libvirt",uuid="instance-0000004f"} 27282.467533
# HELP ceems_compute_unit_cpu_user_seconds_total Total job CPU user seconds
# TYPE ceems_compute_unit_cpu_user_seconds_total counter
ceems_compute_unit_cpu_user_seconds_total{cgrouphostname="",hostname="node01",manager="libvirt",uuid="instance-0000004f"} 88819.629639
# HELP ceems_compute_unit_cpus Total number of job CPUs
# TYPE ceems_compute_unit_cpus gauge
ceems_compute_unit_cpus{cgrouphostname="",hostname="node01",manager="libvirt",uuid="instance-0000004f"} 2
# HELP ceems_compute_unit_memory_cache_bytes Memory cache used in bytes
# TYPE ceems_compute_unit_memory_cache_bytes gauge
ceems_compute_unit_memory_cache_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="instance-0000004f"} 98304
# HELP ceems_compute_unit_memory_fail_count Memory fail count
# TYPE ceems_compute_unit_memory_fail_count gauge
ceems_compute_unit_memory_fail_count{cgrouphostname="",hostname="node01",manager="libvirt",uuid="instance-0000004f"} 0
# HELP ceems_compute_unit_memory_psi_seconds Total memory PSI in seconds
# TYPE ceems_compute_unit_memory_psi_seconds gauge
ceems_compute_unit_memory_psi_seconds{cgrouphostname="",hostname="node01",manager="libvirt",uuid="instance-0000004f"} 0
# HELP ceems_compute_unit_memory_rss_bytes Memory RSS used in bytes
# TYPE ceems_compute_unit_memory_rss_bytes gauge
ceems_compute_unit_memory_rss_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="instance-0000004f"} 3.139444736e+09
# HELP ceems_compute_unit_memory_total_bytes Memory total in bytes
# TYPE ceems_compute_unit_memory_total_bytes gauge
ceems_compute_unit_memory_total_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="instance-0000004f"} 1.081857384448e+12
# HELP ceems_compute_unit_memory_used_bytes Memory used in bytes
# TYPE ceems_compute_unit_memory_used_bytes gauge
ceems_compute_unit_memory_used_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="instance-0000004f"} 3.152396288e+09
# HELP ceems_compute_unit_memsw_fail_count Swap fail count
# TYPE ceems_compute_unit_memsw_fail_count gauge
ceems_compute_unit_memsw_fail_count{cgrouphostname="",hostname="node01",manager="libvirt",uuid="instance-0000004f"} 0
# HELP ceems_compute_unit_memsw_total_bytes Swap total in bytes
# TYPE ceems_compute_unit_memsw_total_bytes gauge
ceems_compute_unit_memsw_total_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="instance-0000004f"} 8.589930496e+09
# HELP ceems_compute_unit_memsw_used_bytes Swap used in bytes
# TYPE ceems_compute_unit_memsw_used_bytes gauge
ceems_compute_unit_memsw_used_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="instance-0000004f"} 0
# HELP ceems_compute_units Total number of jobs
# TYPE ceems_compute_units gauge
.......

Answered by mahendrapaipuri

Sep 2, 2025

@vurmil New release v0.11.0 has been made with support for runtime XML files for libvirt. There are some breaking changes in metric labeling. Please consult the changelog.

View full answer

mahendrapaipuri · 2025-08-27T05:39:37Z

mahendrapaipuri
Aug 27, 2025
Maintainer

Hello @vurmil

Thanks for the detailed report. CPU usage is a derived metric that can be estimated using the following three metrics

ceems_compute_unit_cpu_system_seconds_total
ceems_compute_unit_cpu_user_seconds_total
ceems_compute_unit_cpus

It is generally a good idea to always export raw metrics and use Prometheus' recording rules to estimate the derived metrics. To estimate the CPU usage percentage, you will need to use following query:

(
    irate(ceems_compute_unit_cpu_user_seconds_total[1m])
  +
    irate(ceems_compute_unit_cpu_system_seconds_total[1m])
) * 100
/
  (ceems_compute_unit_cpus > 0)

Or you can add this rule to your Prometheus instance which will estimate the CPU usage and memory usage of virtual machines in real time and create a new metric.

Regarding the error about XML file: time=2025-08-26T15:01:06.269Z level=ERROR source=libvirt.go:499 msg="Failed to run inside security context" collector=libvirt instance_id=instance-0000004f err="stat /etc/libvirt/qemu/instance-0000004f.xml: no such file or directory", it is important to read the file to get VM's UUID. Can you confirm that the file /etc/libvirt/qemu/instance-0000004f.xml exists on the hypervisor where the VM is running?

Cheers!

0 replies

vurmil · 2025-08-27T10:13:16Z

vurmil
Aug 27, 2025
Author

Hello @mahendrapaipuri

Regarding the error about the XML file, I can confirm that the file does not exist in the location you mentioned. The instance-0000004f.xml file is located at a different path: /run/libvirt/qemu/instance-0000004f.xml.

Despite the file not being in the expected location, we are successfully collecting metrics for this virtual machine and data is being stored in Prometheus. We are getting a CPU usage value of 5.78175166667279 from the following expression:

(irate(ceems_compute_unit_cpu_user_seconds_total[1m]) + irate(ceems_compute_unit_cpu_system_seconds_total[1m])) * 100 / (ceems_compute_unit_cpus > 0)

However, I've noticed an inconsistency in the collected data. The metric labels show uuid="instance-0000004f", which appears to be the instance name, not the UUID. The instance_id in the log error also uses the instance name. This is an issue as the metric should ideally be collected and labeled with the VM's true UUID for proper identification and tracking.

Could you please assist me in understanding why the uuid label is being populated with the instance name instead of the actual UUID?

1 reply

mahendrapaipuri Aug 27, 2025
Maintainer

Thanks @vurmil for the confirmation.

UUIDs are not available because of exporter not able to find the XML files for the VMs correctly. Could you please add --collector.libvirt.xml-dir=/run/libvirt/qemu to the exporter CLI args? This should fix the error and the uuid labels in the metrics should have VM's UUID. Currently this CLI option is hidden and you will not see it in ceems_exporter --help but that is fine. It exists and it will ensure that the exporter will look into correct directoy.

Could you please tell me what is your Virtual Machine Manager (VMM)? Is it Openstack or something else? /etc/libvirt/qemu used to be the default location for VM's XML files. Maybe this might have changed in newer libvirt versions. I will take a look into it.

Cheers!

vurmil · 2025-08-27T12:11:36Z

vurmil
Aug 27, 2025
Author

Hello, @mahendrapaipuri

Thank you for your guidance. I have added the --collector.libvirt.xml-dir=/run/libvirt/qemu flag to the ceems_exporter CLI arguments.

I can confirm that the ceems-exporter has stopped logging the XML file error. However, the uuid field is now completely empty in the collected metrics. There are no labels containing uuid at all.

# HELP ceems_compute_unit_cpu_psi_seconds Total CPU PSI in seconds
# TYPE ceems_compute_unit_cpu_psi_seconds gauge
ceems_compute_unit_cpu_psi_seconds{cgrouphostname="",hostname="node01",manager="libvirt",uuid=""} 1672.228112
# HELP ceems_compute_unit_cpu_system_seconds_total Total job CPU system seconds
# TYPE ceems_compute_unit_cpu_system_seconds_total counter
ceems_compute_unit_cpu_system_seconds_total{cgrouphostname="",hostname="node01",manager="libvirt",uuid=""} 29249.532252
# HELP ceems_compute_unit_cpu_user_seconds_total Total job CPU user seconds
# TYPE ceems_compute_unit_cpu_user_seconds_total counter
ceems_compute_unit_cpu_user_seconds_total{cgrouphostname="",hostname="node01",manager="libvirt",uuid=""} 95252.970887
# HELP ceems_compute_unit_cpus Total number of job CPUs
# TYPE ceems_compute_unit_cpus gauge
ceems_compute_unit_cpus{cgrouphostname="",hostname="node01",manager="libvirt",uuid=""} 2
# HELP ceems_compute_unit_memory_cache_bytes Memory cache used in bytes
# TYPE ceems_compute_unit_memory_cache_bytes gauge
ceems_compute_unit_memory_cache_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid=""} 98304

I am using OpenStack 2025.1 as my Virtual Machine Manager.

Here is the content of the XML file:

cat /run/libvirt/qemu/instance-0000004f.xml

<!--
WARNING: THIS IS AN AUTO-GENERATED FILE. CHANGES TO IT ARE LIKELY TO BE
OVERWRITTEN AND LOST. Changes to this xml configuration should be made using:
  virsh edit instance-0000004f
or other application using the libvirt API.
-->

<domstatus state='running' reason='booted' pid='1252374'>
  <monitor path='/var/lib/libvirt/qemu/domain-3-instance-0000004f/monitor.sock' type='unix'/>
  <namespaces>
    <mount/>
  </namespaces>
  <vcpus>
    <vcpu id='0' pid='1252579'/>
    <vcpu id='1' pid='1252583'/>
  </vcpus>
  <qemuCaps>
    <flag name='kvm'/>
    <flag name='sdl'/>
    <flag name='hda-duplex'/>
    <flag name='ccid-emulated'/>
    <flag name='ccid-passthru'/>
    <flag name='piix3-usb-uhci'/>
    <flag name='piix4-usb-uhci'/>
    <flag name='usb-ehci'/>
    <flag name='ich9-usb-ehci1'/>
    <flag name='pci-ohci'/>
    <flag name='usb-redir'/>
    <flag name='usb-hub'/>
    <flag name='ich9-ahci'/>
    <flag name='virtio-blk-pci.scsi'/>
    <flag name='scsi-disk.channel'/>
    <flag name='scsi-block'/>
    <flag name='hda-micro'/>
    <flag name='nec-usb-xhci'/>
    <flag name='lsi'/>
    <flag name='virtio-scsi-pci'/>
    <flag name='usb-redir.filter'/>
    <flag name='seccomp-sandbox'/>
    <flag name='vnc'/>
    <flag name='VGA'/>
    <flag name='cirrus-vga'/>
    <flag name='vmware-svga'/>
    <flag name='usb-serial'/>
    <flag name='virtio-rng'/>
    <flag name='rng-random'/>
    <flag name='rng-egd'/>
    <flag name='megasas'/>
    <flag name='tpm-passthrough'/>
    <flag name='tpm-tis'/>
    <flag name='pci-bridge'/>
    <flag name='vfio-pci'/>
    <flag name='dmi-to-pci-bridge'/>
    <flag name='usb-storage'/>
    <flag name='virtio-mmio'/>
    <flag name='ich9-intel-hda'/>
    <flag name='kvm-pit-lost-tick-policy'/>
    <flag name='pvpanic'/>
    <flag name='usb-kbd'/>
    <flag name='usb-audio'/>
    <flag name='rtc-reset-reinjection'/>
    <flag name='migrate-rdma'/>
    <flag name='VGA.vgamem_mb'/>
    <flag name='vmware-svga.vgamem_mb'/>
    <flag name='pc-dimm'/>
    <flag name='machine-vmport-opt'/>
    <flag name='pci-serial'/>
    <flag name='gpex-pcihost'/>
    <flag name='ioh3420'/>
    <flag name='x3130-upstream'/>
    <flag name='xio3130-downstream'/>
    <flag name='rtl8139'/>
    <flag name='e1000'/>
    <flag name='virtio-net'/>
    <flag name='virtio-gpu'/>
    <flag name='virtio-keyboard'/>
    <flag name='virtio-mouse'/>
    <flag name='virtio-tablet'/>
    <flag name='virtio-input-host'/>
    <flag name='virtio-balloon-pci.deflate-on-oom'/>
    <flag name='mptsas1068'/>
    <flag name='pxb'/>
    <flag name='pxb-pcie'/>
    <flag name='intel-iommu'/>
    <flag name='virtio-vga'/>
    <flag name='ivshmem-plain'/>
    <flag name='ivshmem-doorbell'/>
    <flag name='vhost-scsi'/>
    <flag name='query-cpu-model-expansion'/>
    <flag name='nvdimm'/>
    <flag name='pcie-root-port'/>
    <flag name='query-cpu-definitions'/>
    <flag name='qemu-xhci'/>
    <flag name='intel-iommu.intremap'/>
    <flag name='intel-iommu.caching-mode'/>
    <flag name='intel-iommu.eim'/>
    <flag name='intel-iommu.device-iotlb'/>
    <flag name='chardev-reconnect'/>
    <flag name='vmcoreinfo'/>
    <flag name='isa-serial'/>
    <flag name='pcie-pci-bridge'/>
    <flag name='nbd-tls'/>
    <flag name='tpm-crb'/>
    <flag name='pr-manager-helper'/>
    <flag name='screendump_device'/>
    <flag name='hda-output'/>
    <flag name='vmgenid'/>
    <flag name='vhost-vsock'/>
    <flag name='tpm-emulator'/>
    <flag name='mch'/>
    <flag name='mch.extended-tseg-mbytes'/>
    <flag name='egl-headless'/>
    <flag name='memory-backend-memfd'/>
    <flag name='memory-backend-memfd.hugetlb'/>
    <flag name='egl-headless.rendernode'/>
    <flag name='memory-backend-file.pmem'/>
    <flag name='nvdimm.unarmed'/>
    <flag name='virtio-pci-non-transitional'/>
    <flag name='nbd-bitmap'/>
    <flag name='x86-max-cpu'/>
    <flag name='cpu-unavailable-features'/>
    <flag name='canonical-cpu-features'/>
    <flag name='bochs-display'/>
    <flag name='migration-file-drop-cache'/>
    <flag name='dbus-vmstate'/>
    <flag name='vhost-user-gpu'/>
    <flag name='vhost-user-vga'/>
    <flag name='incremental-backup'/>
    <flag name='ramfb'/>
    <flag name='drive-nvme'/>
    <flag name='smp-dies'/>
    <flag name='i8042'/>
    <flag name='rng-builtin'/>
    <flag name='vhost-user-fs'/>
    <flag name='query-named-block-nodes.flat'/>
    <flag name='blockdev-snapshot.allow-write-only-overlay'/>
    <flag name='blockdev-reopen'/>
    <flag name='fsdev.multidevs'/>
    <flag name='pcie-root-port.hotplug'/>
    <flag name='aio.io_uring'/>
    <flag name='tcg'/>
    <flag name='virtio-blk-pci.scsi.default.disabled'/>
    <flag name='pvscsi'/>
    <flag name='cpu.migratable'/>
    <flag name='intel-iommu.aw-bits'/>
    <flag name='numa.hmat'/>
    <flag name='usb-host.hostdevice'/>
    <flag name='virtio-balloon.free-page-reporting'/>
    <flag name='block-export-add'/>
    <flag name='netdev.vhost-vdpa'/>
    <flag name='dc390'/>
    <flag name='am53c974'/>
    <flag name='virtio-pmem-pci'/>
    <flag name='vhost-user-fs.bootindex'/>
    <flag name='vhost-user-blk'/>
    <flag name='cpu-max'/>
    <flag name='memory-backend-file.x-use-canonical-path-for-ramblock-id'/>
    <flag name='migration-param.block-bitmap-mapping'/>
    <flag name='vnc-power-control'/>
    <flag name='object.qapified'/>
    <flag name='rotation-rate'/>
    <flag name='compat-deprecated'/>
    <flag name='acpi-index'/>
    <flag name='input-linux'/>
    <flag name='confidential-guest-support'/>
    <flag name='set-action'/>
    <flag name='virtio-blk.queue-size'/>
    <flag name='virtio-mem-pci'/>
    <flag name='memory-backend-file.reserve'/>
    <flag name='piix4.acpi-root-pci-hotplug'/>
    <flag name='netdev.json'/>
    <flag name='query-dirty-rate'/>
    <flag name='rbd-encryption'/>
    <flag name='sev-guest-kernel-hashes'/>
    <flag name='sev-inject-launch-secret'/>
    <flag name='device.json+hotplug'/>
    <flag name='virtio-mem-pci.prealloc'/>
    <flag name='calc-dirty-rate'/>
    <flag name='dirtyrate-param.mode'/>
    <flag name='blockdev.nbd.tls-hostname'/>
    <flag name='memory-backend-file.prealloc-threads'/>
    <flag name='virtio-iommu-pci'/>
    <flag name='virtio-iommu.boot-bypass'/>
    <flag name='virtio-net.rss'/>
    <flag name='chardev.qemu-vdagent'/>
    <flag name='display-dbus'/>
    <flag name='iothread.thread-pool-max'/>
    <flag name='usb-host.guest-resets-all'/>
    <flag name='migration.blocked-reasons'/>
    <flag name='query-stats'/>
    <flag name='query-stats-schemas'/>
    <flag name='thread-context'/>
    <flag name='screenshot-format-png'/>
    <flag name='machine-hpet'/>
    <flag name='netdev.stream'/>
    <flag name='virtio-crypto'/>
    <flag name='pvpanic-pci'/>
    <flag name='netdev.stream.reconnect'/>
    <flag name='virtio-gpu.blob'/>
    <flag name='rbd-encryption-layering'/>
    <flag name='rbd-encryption-luks-any'/>
    <flag name='qcow2-discard-no-unref'/>
    <flag name='run-with.async-teardown'/>
  </qemuCaps>
  <devices>
    <device alias='input0'/>
    <device alias='pci.16'/>
    <device alias='pci.7'/>
    <device alias='pci.13'/>
    <device alias='pci.4'/>
    <device alias='pci.1'/>
    <device alias='pci.10'/>
    <device alias='serial0'/>
    <device alias='usb'/>
    <device alias='balloon0'/>
    <device alias='pci.18'/>
    <device alias='pci.9'/>
    <device alias='pci.6'/>
    <device alias='pci.15'/>
    <device alias='pci.12'/>
    <device alias='pci.3'/>
    <device alias='rng0'/>
    <device alias='input1'/>
    <device alias='pci.17'/>
    <device alias='pci.8'/>
    <device alias='pci.14'/>
    <device alias='pci.5'/>
    <device alias='pci.11'/>
    <device alias='pci.2'/>
    <device alias='net0'/>
    <device alias='video0'/>
    <device alias='ua-a3c6c231-d11d-4e5a-865e-0e204c5a0d1b'/>
  </devices>
  <libDir path='/var/lib/libvirt/qemu/domain-3-instance-0000004f'/>
  <channelTargetDir path='/run/libvirt/qemu/channel/3-instance-0000004f'/>
  <cpu mode='host-model' check='partial'>
    <topology sockets='2' dies='1' cores='1' threads='1'/>
  </cpu>
  <rememberOwner/>
  <nodename index='1'/>
  <fdset index='1'/>
  <blockjobs active='no'/>
  <agentTimeout>-2</agentTimeout>
  <domain type='kvm' id='3'>
    <name>instance-0000004f</name>
    <uuid>2dcd7203-555c-4b8a-8700-711a92af820a</uuid>
    <metadata>
      <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.1">
        <nova:package version="31.0.1"/>
        <nova:name>K8S-W02</nova:name>
        <nova:creationTime>2025-08-14 12:33:29</nova:creationTime>
        <nova:flavor name="m1.medium">
          <nova:memory>4096</nova:memory>
          <nova:disk>40</nova:disk>
          <nova:swap>0</nova:swap>
          <nova:ephemeral>0</nova:ephemeral>
          <nova:vcpus>2</nova:vcpus>
        </nova:flavor>
        <nova:owner>
          <nova:user uuid="d3dda58c2e2949fe881732a4e8181a2a">admin</nova:user>
          <nova:project uuid="79749c34b0de4a9c9bda0df7115867b8">admin</nova:project>
        </nova:owner>
        <nova:root type="volume" uuid=""/>
        <nova:ports>
          <nova:port uuid="b4349b87-e002-4bf0-b943-4c09aea7b280">
            <nova:ip type="fixed" address="10.0.5.145" ipVersion="4"/>
          </nova:port>
        </nova:ports>
      </nova:instance>
    </metadata>
    <memory unit='KiB'>4194304</memory>
    <currentMemory unit='KiB'>4194304</currentMemory>
    <vcpu placement='static'>2</vcpu>
    <resource>
      <partition>/machine</partition>
    </resource>
    <sysinfo type='smbios'>
      <system>
        <entry name='manufacturer'>OpenStack Foundation</entry>
        <entry name='product'>OpenStack Nova</entry>
        <entry name='version'>31.0.1</entry>
        <entry name='serial'>2dcd7203-555c-4b8a-8700-711a92af820a</entry>
        <entry name='uuid'>2dcd7203-555c-4b8a-8700-711a92af820a</entry>
        <entry name='family'>Virtual Machine</entry>
      </system>
    </sysinfo>
    <os>
      <type arch='x86_64' machine='pc-q35-8.2'>hvm</type>
      <boot dev='hd'/>
      <smbios mode='sysinfo'/>
    </os>
    <features>
      <acpi/>
      <apic/>
      <vmcoreinfo state='on'/>
    </features>
    <cpu mode='custom' match='exact' check='full'>
      <model fallback='forbid'>EPYC-Genoa</model>
      <vendor>AMD</vendor>
      <topology sockets='2' dies='1' cores='1' threads='1'/>
      <feature policy='require' name='x2apic'/>
      <feature policy='require' name='tsc-deadline'/>
      <feature policy='require' name='hypervisor'/>
      <feature policy='require' name='tsc_adjust'/>
      <feature policy='require' name='spec-ctrl'/>
      <feature policy='require' name='stibp'/>
      <feature policy='require' name='flush-l1d'/>
      <feature policy='require' name='arch-capabilities'/>
      <feature policy='require' name='ssbd'/>
      <feature policy='require' name='cmp_legacy'/>
      <feature policy='require' name='virt-ssbd'/>
      <feature policy='require' name='lbrv'/>
      <feature policy='require' name='tsc-scale'/>
      <feature policy='require' name='vmcb-clean'/>
      <feature policy='require' name='flushbyasid'/>
      <feature policy='require' name='pause-filter'/>
      <feature policy='require' name='pfthreshold'/>
      <feature policy='require' name='v-vmsave-vmload'/>
      <feature policy='require' name='vgif'/>
      <feature policy='require' name='rdctl-no'/>
      <feature policy='require' name='skip-l1dfl-vmentry'/>
      <feature policy='require' name='mds-no'/>
      <feature policy='require' name='pschange-mc-no'/>
      <feature policy='require' name='gds-no'/>
      <feature policy='require' name='topoext'/>
    </cpu>
    <clock offset='utc'>
      <timer name='pit' tickpolicy='delay'/>
      <timer name='rtc' tickpolicy='catchup'/>
      <timer name='hpet' present='no'/>
    </clock>
    <on_poweroff>destroy</on_poweroff>
    <on_reboot>restart</on_reboot>
    <on_crash>destroy</on_crash>
    <devices>
      <emulator>/usr/bin/qemu-system-x86_64</emulator>
      <disk type='network' device='disk'>
        <driver name='qemu' type='raw' cache='writeback' discard='unmap'/>
        <auth username='cinder'>
          <secret type='ceph' uuid='9575cec0-1291-467a-9ee3-d29abdbd5161'/>
        </auth>
        <source protocol='rbd' name='volumes/volume-a3c6c231-d11d-4e5a-865e-0e204c5a0d1b' tlsFromConfig='0' index='1'>
          <host name='198.19.1.21' port='6789'/>
          <host name='198.19.1.22' port='6789'/>
          <host name='198.19.1.23' port='6789'/>
          <privateData>
            <nodenames>
              <nodename type='storage' name='libvirt-1-storage'/>
              <nodename type='format' name='libvirt-1-format'/>
            </nodenames>
            <objects>
              <secret type='auth' alias='libvirt-1-storage-auth-secret0'/>
            </objects>
          </privateData>
        </source>
        <target dev='vda' bus='virtio'/>
        <serial>a3c6c231-d11d-4e5a-865e-0e204c5a0d1b</serial>
        <alias name='ua-a3c6c231-d11d-4e5a-865e-0e204c5a0d1b'/>
        <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
        <privateData>
          <qom name='/machine/peripheral/ua-a3c6c231-d11d-4e5a-865e-0e204c5a0d1b/virtio-backend'/>
        </privateData>
        <diskSecretsPlacement auth='true'/>
      </disk>
      <controller type='pci' index='0' model='pcie-root'>
        <alias name='pcie.0'/>
      </controller>
      <controller type='pci' index='1' model='pcie-root-port'>
        <model name='pcie-root-port'/>
        <target chassis='1' port='0x10'/>
        <alias name='pci.1'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
      </controller>
      <controller type='pci' index='2' model='pcie-root-port'>
        <model name='pcie-root-port'/>
        <target chassis='2' port='0x11'/>
        <alias name='pci.2'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
      </controller>
      <controller type='pci' index='3' model='pcie-root-port'>
        <model name='pcie-root-port'/>
        <target chassis='3' port='0x12'/>
        <alias name='pci.3'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
      </controller>
      <controller type='pci' index='4' model='pcie-root-port'>
        <model name='pcie-root-port'/>
        <target chassis='4' port='0x13'/>
        <alias name='pci.4'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
      </controller>
      <controller type='pci' index='5' model='pcie-root-port'>
        <model name='pcie-root-port'/>
        <target chassis='5' port='0x14'/>
        <alias name='pci.5'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
      </controller>
      <controller type='pci' index='6' model='pcie-root-port'>
        <model name='pcie-root-port'/>
        <target chassis='6' port='0x15'/>
        <alias name='pci.6'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/>
      </controller>
      <controller type='pci' index='7' model='pcie-root-port'>
        <model name='pcie-root-port'/>
        <target chassis='7' port='0x16'/>
        <alias name='pci.7'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x6'/>
      </controller>
      <controller type='pci' index='8' model='pcie-root-port'>
        <model name='pcie-root-port'/>
        <target chassis='8' port='0x17'/>
        <alias name='pci.8'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x7'/>
      </controller>
      <controller type='pci' index='9' model='pcie-root-port'>
        <model name='pcie-root-port'/>
        <target chassis='9' port='0x18'/>
        <alias name='pci.9'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0' multifunction='on'/>
      </controller>
      <controller type='pci' index='10' model='pcie-root-port'>
        <model name='pcie-root-port'/>
        <target chassis='10' port='0x19'/>
        <alias name='pci.10'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x1'/>
      </controller>
      <controller type='pci' index='11' model='pcie-root-port'>
        <model name='pcie-root-port'/>
        <target chassis='11' port='0x1a'/>
        <alias name='pci.11'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x2'/>
      </controller>
      <controller type='pci' index='12' model='pcie-root-port'>
        <model name='pcie-root-port'/>
        <target chassis='12' port='0x1b'/>
        <alias name='pci.12'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x3'/>
      </controller>
      <controller type='pci' index='13' model='pcie-root-port'>
        <model name='pcie-root-port'/>
        <target chassis='13' port='0x1c'/>
        <alias name='pci.13'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x4'/>
      </controller>
      <controller type='pci' index='14' model='pcie-root-port'>
        <model name='pcie-root-port'/>
        <target chassis='14' port='0x1d'/>
        <alias name='pci.14'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x5'/>
      </controller>
      <controller type='pci' index='15' model='pcie-root-port'>
        <model name='pcie-root-port'/>
        <target chassis='15' port='0x1e'/>
        <alias name='pci.15'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x6'/>
      </controller>
      <controller type='pci' index='16' model='pcie-root-port'>
        <model name='pcie-root-port'/>
        <target chassis='16' port='0x1f'/>
        <alias name='pci.16'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x7'/>
      </controller>
      <controller type='usb' index='0' model='piix3-uhci'>
        <alias name='usb'/>
        <address type='pci' domain='0x0000' bus='0x12' slot='0x01' function='0x0'/>
      </controller>
      <controller type='sata' index='0'>
        <alias name='ide'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
      </controller>
      <controller type='pci' index='17' model='pcie-root-port'>
        <model name='pcie-root-port'/>
        <target chassis='17' port='0x20'/>
        <alias name='pci.17'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
      </controller>
      <controller type='pci' index='18' model='pcie-to-pci-bridge'>
        <model name='pcie-pci-bridge'/>
        <alias name='pci.18'/>
        <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
      </controller>
      <interface type='ethernet'>
        <mac address='fa:16:3e:cf:c8:aa'/>
        <target dev='tapb4349b87-e0'/>
        <model type='virtio'/>
        <mtu size='1442'/>
        <alias name='net0'/>
        <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
      </interface>
      <serial type='pty'>
        <source path='/dev/pts/2'/>
        <log file='/var/lib/nova/instances/2dcd7203-555c-4b8a-8700-711a92af820a/console.log' append='off'/>
        <target type='isa-serial' port='0'>
          <model name='isa-serial'/>
        </target>
        <alias name='serial0'/>
      </serial>
      <console type='pty' tty='/dev/pts/2'>
        <source path='/dev/pts/2'/>
        <log file='/var/lib/nova/instances/2dcd7203-555c-4b8a-8700-711a92af820a/console.log' append='off'/>
        <target type='serial' port='0'/>
        <alias name='serial0'/>
      </console>
      <input type='tablet' bus='usb'>
        <alias name='input0'/>
        <address type='usb' bus='0' port='1'/>
      </input>
      <input type='keyboard' bus='usb'>
        <alias name='input1'/>
        <address type='usb' bus='0' port='2'/>
      </input>
      <input type='mouse' bus='ps2'>
        <alias name='input2'/>
      </input>
      <input type='keyboard' bus='ps2'>
        <alias name='input3'/>
      </input>
      <graphics type='vnc' port='5900' autoport='yes' websocketGenerated='no' listen='198.19.1.21'>
        <listen type='address' address='198.19.1.21' fromConfig='0' autoGenerated='no'/>
      </graphics>
      <audio id='1' type='none'/>
      <video>
        <model type='virtio' heads='1' primary='yes'/>
        <alias name='video0'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
      </video>
      <watchdog model='itco' action='reset'>
        <alias name='watchdog0'/>
      </watchdog>
      <memballoon model='virtio'>
        <stats period='10'/>
        <alias name='balloon0'/>
        <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
      </memballoon>
      <rng model='virtio'>
        <backend model='random'>/dev/urandom</backend>
        <alias name='rng0'/>
        <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
      </rng>
    </devices>
    <seclabel type='dynamic' model='dac' relabel='yes'>
      <label>+42436:+42436</label>
      <imagelabel>+42436:+42436</imagelabel>
    </seclabel>
  </domain>
</domstatus>

11 replies

mahendrapaipuri Aug 28, 2025
Maintainer

@vurmil I have made a PR which necessary changes to support runtime directory as well. You can download the exporter binary from the CI artifacts. Could you please try this patch and see if it works as intended on your deployment? With a new version, there is no need to pass --collector.libvirt.xml-dir CLI flag.

Cheers

vurmil Aug 28, 2025
Author

@mahendrapaipuri This looks much better, thanks.

My colleague in the server room is servicing a host, and I don't have a fully functional OpenStack test environment to confirm everything is working properly. The OpenStack API isn't working for me right now, and CEMS has a problem with the token, but I'm sure the exporter is working correctly now :) Thanks again. I'll let you know later if the whole process is okay.

# HELP ceems_compute_unit_blkio_write_total_bytes Total block IO write bytes
# TYPE ceems_compute_unit_blkio_write_total_bytes gauge
ceems_compute_unit_blkio_write_total_bytes{cgrouphostname="",device="dm-1",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 4096
# HELP ceems_compute_unit_blkio_write_total_requests Total block IO write requests
# TYPE ceems_compute_unit_blkio_write_total_requests gauge
ceems_compute_unit_blkio_write_total_requests{cgrouphostname="",device="dm-1",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 1
# HELP ceems_compute_unit_cpu_psi_seconds Total CPU PSI in seconds
# TYPE ceems_compute_unit_cpu_psi_seconds gauge
ceems_compute_unit_cpu_psi_seconds{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 1804.145451
# HELP ceems_compute_unit_cpu_system_seconds_total Total job CPU system seconds
# TYPE ceems_compute_unit_cpu_system_seconds_total counter
ceems_compute_unit_cpu_system_seconds_total{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 32116.986989
# HELP ceems_compute_unit_cpu_user_seconds_total Total job CPU user seconds
# TYPE ceems_compute_unit_cpu_user_seconds_total counter
ceems_compute_unit_cpu_user_seconds_total{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 102468.603563
# HELP ceems_compute_unit_cpus Total number of job CPUs
# TYPE ceems_compute_unit_cpus gauge
ceems_compute_unit_cpus{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 2
# HELP ceems_compute_unit_memory_cache_bytes Memory cache used in bytes
# TYPE ceems_compute_unit_memory_cache_bytes gauge
ceems_compute_unit_memory_cache_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 98304
# HELP ceems_compute_unit_memory_fail_count Memory fail count
# TYPE ceems_compute_unit_memory_fail_count gauge
ceems_compute_unit_memory_fail_count{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 0
# HELP ceems_compute_unit_memory_psi_seconds Total memory PSI in seconds
# TYPE ceems_compute_unit_memory_psi_seconds gauge
ceems_compute_unit_memory_psi_seconds{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 0
# HELP ceems_compute_unit_memory_rss_bytes Memory RSS used in bytes
# TYPE ceems_compute_unit_memory_rss_bytes gauge
ceems_compute_unit_memory_rss_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 3.252555776e+09
# HELP ceems_compute_unit_memory_total_bytes Memory total in bytes
# TYPE ceems_compute_unit_memory_total_bytes gauge
ceems_compute_unit_memory_total_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 1.081857384448e+12
# HELP ceems_compute_unit_memory_used_bytes Memory used in bytes
# TYPE ceems_compute_unit_memory_used_bytes gauge
ceems_compute_unit_memory_used_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 3.266187264e+09
# HELP ceems_compute_unit_memsw_fail_count Swap fail count
# TYPE ceems_compute_unit_memsw_fail_count gauge
ceems_compute_unit_memsw_fail_count{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 0
# HELP ceems_compute_unit_memsw_total_bytes Swap total in bytes
# TYPE ceems_compute_unit_memsw_total_bytes gauge
ceems_compute_unit_memsw_total_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 8.589930496e+09
# HELP ceems_compute_unit_memsw_used_bytes Swap used in bytes
# TYPE ceems_compute_unit_memsw_used_bytes gauge
ceems_compute_unit_memsw_used_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 0
# HELP ceems_compute_units Total number of jobs
# TYPE ceems_compute_units gauge
ceems_compute_units{hostname="node01",manager="libvirt"} 1

mahendrapaipuri Aug 28, 2025
Maintainer

@vurmil Awesome. Thanks a lot for quick testing. Appreciate it. I will merge it and try to cut a new release soon!

mahendrapaipuri Sep 2, 2025
Maintainer

@vurmil New release v0.11.0 has been made with support for runtime XML files for libvirt. There are some breaking changes in metric labeling. Please consult the changelog.

Answer selected by mahendrapaipuri

vurmil Sep 4, 2025
Author

I'm still seeing a problem with the ceems_exporter that seems to go beyond the initial tests.

Even though I can see some ceems_compute_unit metrics in the /metrics endpoint, the crucial CPU usage metric, ceems_compute_unit_cpu_usage, is missing.

Here's what I've found so far:

Existing Metrics: The exporter successfully scrapes other metrics for my VMs, such as ceems_compute_unit_cpu_system_seconds_total, ceems_compute_unit_cpus, and all memory-related metrics. This indicates that the libvirt collector is at least partially working.

cgroups Data: I've manually confirmed that the CPU usage data for my VMs exists in the system. The cgroup files, like /sys/fs/cgroup/machine/qemu-706-instance-0000004f.libvirt-qemu/cpu.stat, contain valid values for usage_usec and user_usec. This confirms the data is present at the source.

Missing Metrics: Despite the data being available, the specific ceems_compute_unit_cpu_usage metric is not being exposed by the exporter. It seems there's a disconnect between what the exporter is able to read and what it's exposing. I noticed that the cgroup directory names are not the same as the VM UUIDs, but rather follow the pattern qemu-706-instance-0000004f.libvirt-qemu. Could this naming convention be causing the exporter to fail to correctly match the data?

I'm happy to provide additional logs or information if it helps you diagnose this issue.

# ls -l /sys/fs/cgroup/machine/ |grep qemu
drwxr-xr-x 5 root root 0 Sep  4 18:27 qemu-706-instance-0000004f.libvirt-qemu
drwxr-xr-x 5 root root 0 Sep  4 18:27 qemu-729-instance-00000022.libvirt-qemu
drwxr-xr-x 5 root root 0 Sep  4 18:27 qemu-731-instance-0000005a.libvirt-qemu
drwxr-xr-x 5 root root 0 Sep  4 18:27 qemu-732-instance-0000005d.libvirt-qemu

# ls -l /sys/fs/cgroup/machine/qemu-706-instance-0000004f.libvirt-qemu/
total 0
-r--r--r-- 1 root root 0 Sep  4 18:27 cgroup.controllers
-r--r--r-- 1 root root 0 Sep  4 18:27 cgroup.events
-rw-r--r-- 1 root root 0 Sep  4 18:27 cgroup.freeze
--w------- 1 root root 0 Sep  4 18:27 cgroup.kill
-rw-r--r-- 1 root root 0 Sep  4 18:27 cgroup.max.depth
-rw-r--r-- 1 root root 0 Sep  4 18:27 cgroup.max.descendants
-rw-r--r-- 1 root root 0 Sep  4 18:27 cgroup.pressure
-rw-r--r-- 1 root root 0 Sep  4 01:13 cgroup.procs
-r--r--r-- 1 root root 0 Sep  4 18:27 cgroup.stat
-rw-r--r-- 1 root root 0 Sep  4 01:13 cgroup.subtree_control
-rw-r--r-- 1 root root 0 Sep  4 18:27 cgroup.threads
-rw-r--r-- 1 root root 0 Sep  4 18:27 cgroup.type
-rw-r--r-- 1 root root 0 Sep  4 18:27 cpu.idle
-rw-r--r-- 1 root root 0 Sep  4 18:27 cpu.max
-rw-r--r-- 1 root root 0 Sep  4 18:27 cpu.max.burst
-rw-r--r-- 1 root root 0 Sep  4 18:27 cpu.pressure
-rw-r--r-- 1 root root 0 Sep  4 18:27 cpuset.cpus
-r--r--r-- 1 root root 0 Sep  4 18:27 cpuset.cpus.effective
-rw-r--r-- 1 root root 0 Sep  4 18:27 cpuset.cpus.exclusive
-r--r--r-- 1 root root 0 Sep  4 18:27 cpuset.cpus.exclusive.effective
-rw-r--r-- 1 root root 0 Sep  4 18:27 cpuset.cpus.partition
-rw-r--r-- 1 root root 0 Sep  4 18:27 cpuset.mems
-r--r--r-- 1 root root 0 Sep  4 18:27 cpuset.mems.effective
-r--r--r-- 1 root root 0 Sep  4 18:27 cpu.stat
-r--r--r-- 1 root root 0 Sep  4 18:27 cpu.stat.local
-rw-r--r-- 1 root root 0 Sep  4 18:27 cpu.uclamp.max
-rw-r--r-- 1 root root 0 Sep  4 18:27 cpu.uclamp.min
-rw-r--r-- 1 root root 0 Sep  4 18:27 cpu.weight
-rw-r--r-- 1 root root 0 Sep  4 18:27 cpu.weight.nice
drwxr-xr-x 2 root root 0 Sep  4 18:27 emulator
-rw-r--r-- 1 root root 0 Sep  4 18:27 io.max
-rw-r--r-- 1 root root 0 Sep  4 18:27 io.pressure
-rw-r--r-- 1 root root 0 Sep  4 18:27 io.prio.class
-r--r--r-- 1 root root 0 Sep  4 18:27 io.stat
-rw-r--r-- 1 root root 0 Sep  4 18:27 io.weight
-r--r--r-- 1 root root 0 Sep  4 18:27 memory.current
-r--r--r-- 1 root root 0 Sep  4 18:27 memory.events
-r--r--r-- 1 root root 0 Sep  4 18:27 memory.events.local
-rw-r--r-- 1 root root 0 Sep  4 18:27 memory.high
-rw-r--r-- 1 root root 0 Sep  4 18:27 memory.low
-rw-r--r-- 1 root root 0 Sep  4 18:27 memory.max
-rw-r--r-- 1 root root 0 Sep  4 18:27 memory.min
-r--r--r-- 1 root root 0 Sep  4 18:27 memory.numa_stat
-rw-r--r-- 1 root root 0 Sep  4 18:27 memory.oom.group
-r--r--r-- 1 root root 0 Sep  4 18:27 memory.peak
-rw-r--r-- 1 root root 0 Sep  4 18:27 memory.pressure
--w------- 1 root root 0 Sep  4 18:27 memory.reclaim
-r--r--r-- 1 root root 0 Sep  4 18:27 memory.stat
-r--r--r-- 1 root root 0 Sep  4 18:27 memory.swap.current
-r--r--r-- 1 root root 0 Sep  4 18:27 memory.swap.events
-rw-r--r-- 1 root root 0 Sep  4 18:27 memory.swap.high
-rw-r--r-- 1 root root 0 Sep  4 18:27 memory.swap.max
-r--r--r-- 1 root root 0 Sep  4 18:27 memory.swap.peak
-r--r--r-- 1 root root 0 Sep  4 18:27 memory.zswap.current
-rw-r--r-- 1 root root 0 Sep  4 18:27 memory.zswap.max
-rw-r--r-- 1 root root 0 Sep  4 18:27 memory.zswap.writeback
drwxr-xr-x 2 root root 0 Sep  4 18:27 vcpu0
drwxr-xr-x 2 root root 0 Sep  4 18:27 vcpu1

# cat /sys/fs/cgroup/machine/qemu-706-instance-0000004f.libvirt-qemu/cpu.stat
usage_usec 7624783531
user_usec 5894832419
system_usec 1729951111
core_sched.force_idle_usec 0
nr_periods 0
nr_throttled 0
throttled_usec 0
nr_bursts 0
burst_usec 0

Why Do the Names Have This Structure?
The name structure qemu-706-instance-0000004f.libvirt-qemu is made up of several key components that Libvirt uses to identify and manage virtual machines (VMs):

qemu: This indicates that the cgroup container is for a QEMU process.
706: This is the PID (Process ID) of the main QEMU process. It's a unique number assigned by the operating system to every running process.
instance-0000004f: This is a UUID (Universally Unique Identifier) that Libvirt generates for each virtual machine. This value is permanent and does not change when the VM is restarted, unlike the PID.
libvirt-qemu: This suffix clearly indicates that the name was generated by Libvirt.

x.x.x.x/metrics (cgroups missing?) - maybe this is the problem



# HELP ceems_compute_unit_cpu_system_seconds_total Total job CPU system seconds

# TYPE ceems_compute_unit_cpu_system_seconds_total counter

ceems_compute_unit_cpu_system_seconds_total{cgrouphostname="",hostname="node01",manager="libvirt",uuid="26e1afa9-e553-4eb1-bc28-8651bc794194"} 152.40381

ceems_compute_unit_cpu_system_seconds_total{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2723224e-76c4-4882-a825-87a6c28a1228"} 17414.004313

ceems_compute_unit_cpu_system_seconds_total{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 1682.534799

ceems_compute_unit_cpu_system_seconds_total{cgrouphostname="",hostname="node01",manager="libvirt",uuid="bc53b3e9-89e0-49fb-941f-2e8f8bb9503b"} 18396.708952

# HELP ceems_compute_unit_cpu_user_seconds_total Total job CPU user seconds

# TYPE ceems_compute_unit_cpu_user_seconds_total counter

ceems_compute_unit_cpu_user_seconds_total{cgrouphostname="",hostname="node01",manager="libvirt",uuid="26e1afa9-e553-4eb1-bc28-8651bc794194"} 406.177377

ceems_compute_unit_cpu_user_seconds_total{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2723224e-76c4-4882-a825-87a6c28a1228"} 20754.48814

ceems_compute_unit_cpu_user_seconds_total{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 5731.781339

ceems_compute_unit_cpu_user_seconds_total{cgrouphostname="",hostname="node01",manager="libvirt",uuid="bc53b3e9-89e0-49fb-941f-2e8f8bb9503b"} 22131.994789

# HELP ceems_compute_unit_cpus Total number of job CPUs

# TYPE ceems_compute_unit_cpus gauge

ceems_compute_unit_cpus{cgrouphostname="",hostname="node01",manager="libvirt",uuid="26e1afa9-e553-4eb1-bc28-8651bc794194"} 2

ceems_compute_unit_cpus{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2723224e-76c4-4882-a825-87a6c28a1228"} 2

ceems_compute_unit_cpus{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 2

ceems_compute_unit_cpus{cgrouphostname="",hostname="node01",manager="libvirt",uuid="bc53b3e9-89e0-49fb-941f-2e8f8bb9503b"} 2

# HELP ceems_compute_unit_memory_cache_bytes Memory cache used in bytes

# TYPE ceems_compute_unit_memory_cache_bytes gauge

ceems_compute_unit_memory_cache_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="26e1afa9-e553-4eb1-bc28-8651bc794194"} 98304

ceems_compute_unit_memory_cache_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2723224e-76c4-4882-a825-87a6c28a1228"} 0

ceems_compute_unit_memory_cache_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 98304

ceems_compute_unit_memory_cache_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="bc53b3e9-89e0-49fb-941f-2e8f8bb9503b"} 0

# HELP ceems_compute_unit_memory_fail_count Memory fail count

# TYPE ceems_compute_unit_memory_fail_count gauge

ceems_compute_unit_memory_fail_count{cgrouphostname="",hostname="node01",manager="libvirt",uuid="26e1afa9-e553-4eb1-bc28-8651bc794194"} 0

ceems_compute_unit_memory_fail_count{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2723224e-76c4-4882-a825-87a6c28a1228"} 0

ceems_compute_unit_memory_fail_count{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 0

ceems_compute_unit_memory_fail_count{cgrouphostname="",hostname="node01",manager="libvirt",uuid="bc53b3e9-89e0-49fb-941f-2e8f8bb9503b"} 0

# HELP ceems_compute_unit_memory_rss_bytes Memory RSS used in bytes

# TYPE ceems_compute_unit_memory_rss_bytes gauge

ceems_compute_unit_memory_rss_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="26e1afa9-e553-4eb1-bc28-8651bc794194"} 1.540296704e+09

ceems_compute_unit_memory_rss_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2723224e-76c4-4882-a825-87a6c28a1228"} 4.9823744e+07

ceems_compute_unit_memory_rss_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 1.957965824e+09

ceems_compute_unit_memory_rss_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="bc53b3e9-89e0-49fb-941f-2e8f8bb9503b"} 4.343808e+07

# HELP ceems_compute_unit_memory_total_bytes Memory total in bytes

# TYPE ceems_compute_unit_memory_total_bytes gauge

ceems_compute_unit_memory_total_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="26e1afa9-e553-4eb1-bc28-8651bc794194"} 1.081857384448e+12

ceems_compute_unit_memory_total_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2723224e-76c4-4882-a825-87a6c28a1228"} 1.081857384448e+12

ceems_compute_unit_memory_total_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 1.081857384448e+12

ceems_compute_unit_memory_total_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="bc53b3e9-89e0-49fb-941f-2e8f8bb9503b"} 1.081857384448e+12

# HELP ceems_compute_unit_memory_used_bytes Memory used in bytes

# TYPE ceems_compute_unit_memory_used_bytes gauge

ceems_compute_unit_memory_used_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="26e1afa9-e553-4eb1-bc28-8651bc794194"} 1.550077952e+09

ceems_compute_unit_memory_used_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2723224e-76c4-4882-a825-87a6c28a1228"} 5.718016e+07

ceems_compute_unit_memory_used_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="2dcd7203-555c-4b8a-8700-711a92af820a"} 1.968685056e+09

ceems_compute_unit_memory_used_bytes{cgrouphostname="",hostname="node01",manager="libvirt",uuid="bc53b3e9-89e0-49fb-941f-2e8f8bb9503b"} 5.0130944e+07

# HELP ceems_compute_units Total number of jobs

# TYPE ceems_compute_units gauge

ceems_compute_units{hostname="node01",manager="libvirt"} 4

# HELP ceems_cpu_count Number of CPUs.

# TYPE ceems_cpu_count gauge

ceems_cpu_count{hostname="node01"} 128

# HELP ceems_cpu_per_core_count Number of logical CPUs per physical core.

# TYPE ceems_cpu_per_core_count gauge

ceems_cpu_per_core_count{hostname="node01"} 2

# HELP ceems_cpu_seconds_total Seconds the CPUs spent in each mode.

# TYPE ceems_cpu_seconds_total counter

ceems_cpu_seconds_total{hostname="node01",mode="idle"} 2.83584237e+08

ceems_cpu_seconds_total{hostname="node01",mode="iowait"} 32047.32

ceems_cpu_seconds_total{hostname="node01",mode="irq"} 0

ceems_cpu_seconds_total{hostname="node01",mode="nice"} 137.03

ceems_cpu_seconds_total{hostname="node01",mode="softirq"} 41534.42

ceems_cpu_seconds_total{hostname="node01",mode="steal"} 0

ceems_cpu_seconds_total{hostname="node01",mode="system"} 4.86514034e+06

ceems_cpu_seconds_total{hostname="node01",mode="user"} 1.146293848e+07

....

that's the whole beginning, all from libvirt

mahendrapaipuri Sep 5, 2025
Maintainer

Hello @vurmil

Thanks for testing the new version.

I think there is some misunderstanding here. As I mentioned in my previous comment, the CPU usage (in %) is not available in cgroups directly. cgroups present only CPU times in user and kernel spaces and that is what exporter exports. The CPU usage (in %) must be computed using the "raw" CPU times exported by exporter. I recommend you to install this Prometheus rule file in Prometheus recording rules and then the CPU and memory usage (in %) will be estimated in real time. They will be estimated under uuid:ceems_cpu_usage:ratio_irate and uuid:ceems_memory_usage:ratio names, respectively.

From your cgroups layout, I see there are 4 VMs on the hypervisor and I see exporter exporting metrics for four VMs. So, I guess exporter is picking up all the VMs and not missing any. The naming of cgroups should not be an issue as the exporter uses regex to find all the cgroup folders for the VMs on the hypervisor.

Let me know if that answers your question.

vurmil · 2025-09-05T08:23:59Z

vurmil
Sep 5, 2025
Author

Sorry, I have two environments and I got it mixed up. Of course, I have the Prometheus rule added.

When I inspect the data in Graphana, it doesn't show the data. Do you know why?

More info:

Inspect: ⚙️ Average Utilization by admin
curl -X 'GET' -H 'Accept: application/json;q=0.9,text/plain' -H 'Accept-Encoding: gzip' -H 'X-Grafana-User: xxxxxxxx' 'http://198.19.1.21:9020/api/v1/usage/current/admin?cluster_id=os-0&field=avg_cpu_usage&field=avg_cpu_mem_usage&field=avg_gpu_usage&field=avg_gpu_mem_usage&field=username&from=1749281619&project=admin&to=1757057619&user=admin'

return

{"status":"success","data":[]}

in api logs
time=2025-09-05T08:03:52.661Z level=INFO source=middleware.go:98 msg=middleware logged_user=admin url="/api/v1/usage/current/admin?cluster_id=os-0&field=avg_cpu_usage&field=avg_cpu_mem_usage&field=avg_gpu_usage&field=avg_gpu_mem_usage&field=username&from=1749281619&project=admin&to=1757057619&user=admin" time=2025-09-05T08:03:52.662Z level=DEBUG source=querier.go:230 msg="DB query" query="SELECT DISTINCT json_each.key AS name FROM units, json_each(avg_gpu_mem_usage) WHERE json_each.key IS NOT NULL AND project IN (?) AND cluster_id IN (?) AND (last_updated_at BETWEEN (?) AND (?))" queryParams=admin,os-0,2025-06-07T07:30:00,2025-09-05T07:45:00 time=2025-09-05T08:03:52.662Z level=DEBUG source=querier.go:230 msg="DB query" query="SELECT DISTINCT json_each.key AS name FROM units, json_each(avg_cpu_usage) WHERE json_each.key IS NOT NULL AND project IN (?) AND cluster_id IN (?) AND (last_updated_at BETWEEN (?) AND (?))" queryParams=admin,os-0,2025-06-07T07:30:00,2025-09-05T07:45:00 time=2025-09-05T08:03:52.663Z level=DEBUG source=querier.go:230 msg="DB query" query="SELECT DISTINCT json_each.key AS name FROM units, json_each(avg_cpu_mem_usage) WHERE json_each.key IS NOT NULL AND project IN (?) AND cluster_id IN (?) AND (last_updated_at BETWEEN (?) AND (?))" queryParams=admin,os-0,2025-06-07T07:30:00,2025-09-05T07:45:00 time=2025-09-05T08:03:52.663Z level=DEBUG source=querier.go:230 msg="DB query" query="SELECT DISTINCT json_each.key AS name FROM units, json_each(avg_gpu_usage) WHERE json_each.key IS NOT NULL AND project IN (?) AND cluster_id IN (?) AND (last_updated_at BETWEEN (?) AND (?))" queryParams=admin,os-0,2025-06-07T07:30:00,2025-09-05T07:45:00 time=2025-09-05T08:03:52.664Z level=DEBUG source=querier.go:230 msg="DB query" query="SELECT json_object() AS avg_cpu_usage,json_object() AS avg_cpu_mem_usage,json_object() AS avg_gpu_usage,json_object() AS avg_gpu_mem_usage,username FROM (units AS u LEFT JOIN json_each(total_time_seconds,'$.alloc_cputime') AS total_time_seconds_alloc_cputime LEFT JOIN json_each(total_time_seconds,'$.alloc_cpumemtime') AS total_time_seconds_alloc_cpumemtime LEFT JOIN json_each(total_time_seconds,'$.alloc_gputime') AS total_time_seconds_alloc_gputime LEFT JOIN json_each(total_time_seconds,'$.alloc_gpumemtime') AS total_time_seconds_alloc_gpumemtime) WHERE project IN (SELECT name FROM projects WHERE EXISTS (SELECT 1 FROM json_each(users) WHERE value IN (?,?))) AND project IN (?) AND cluster_id IN (?) AND (last_updated_at BETWEEN (?) AND (?)) GROUP BY project,username ORDER BY cluster_id ASC, username ASC, project ASC " queryParams=admin:svc,admin,admin,os-0,2025-06-07T07:30:00,2025-09-05T07:45:00 time=2025-09-05T08:03:52.664Z level=DEBUG source=helpers.go:171 msg="usage admin endpoint" duration=2.023653ms

if I remove all parameters: http://198.19.1.21:9020/api/v1/usage/current/admin - then it starts showing what data

curl -X 'GET' -H 'Accept: application/json;q=0.9,text/plain' -H 'Accept-Encoding: gzip' -H 'X-Grafana-User: admin' 'http://198.19.1.21:9020/api/v1/usage/current/admin'

{"status":"success","data":[{"cluster_id":"os-0","resource_manager":"openstack","num_units":11,"project":"admin","groupname":"","username":"admin","total_time_seconds":{"alloc_cpumemtime":1909501468.61924005,"alloc_cputime":699280.32298068,"alloc_gpumemtime":0,"alloc_gputime":0,"walltime":145683.40062100}}]}

empty CPU columns

I see that this data is missing in the return JSON

curl -X 'GET' -H 'Accept: application/json;q=0.9,text/plain' -H 'Accept-Encoding: gzip' -H 'X-Grafana-User: admin' 'http://198.19.1.21:9020/api/v1/units/admin?cluster_id=os-0&from=1757056525&project=admin&running=1&to=1757056825&user=admin'

{"status":"success","data":[{"cluster_id":"os-0","resource_manager":"openstack","uuid":"26e1afa9-e553-4eb1-bc28-8651bc794194","name":"MSSQL","project":"admin","username":"admin","created_at":"2025-08-06T13:55:27+0000","started_at":"2025-09-03T13:54:44+0000","ended_at":"N/A","created_at_ts":1754488527000,"started_at_ts":1756907684000,"elapsed":"1-18:12:52","state":"ACTIVE","allocation":{"disk":40,"extra_specs":{},"mem":4096,"name":"m1.medium","swap":0,"vcpus":2},"total_time_seconds":{"alloc_cpumemtime":119835521.81138636,"alloc_cputime":58513.43838447,"alloc_gpumemtime":0,"alloc_gputime":0,"walltime":29256.71919224},"tags":{"az":"W1-az","metadata":{"HA_Enabled":"True","instance_ha_enabled":"true"},"power_state":"RUNNING","reservation_id":"r-akif5a6i","tags":[]}},{"cluster_id":"os-0","resource_manager":"openstack","uuid":"2723224e-76c4-4882-a825-87a6c28a1228","name":"NVIDIA-License","project":"admin","username":"admin","created_at":"2025-09-04T07:49:12+0000","started_at":"2025-09-04T07:49:25+0000","ended_at":"N/A","created_at_ts":1756972152000,"started_at_ts":1756972165000,"elapsed":"1-00:18:11","state":"ACTIVE","allocation":{"disk":40,"extra_specs":{},"mem":4096,"name":"m1.medium","swap":0,"vcpus":2},"total_time_seconds":{"alloc_cpumemtime":119835521.81138636,"alloc_cputime":58513.43838447,"alloc_gpumemtime":0,"alloc_gputime":0,"walltime":29256.71919224},"tags":{"az":"W1-az","metadata":{},"power_state":"RUNNING","reservation_id":"r-kunh83n4","tags":[]}},{"cluster_id":"os-0","resource_manager":"openstack","uuid":"2dcd7203-555c-4b8a-8700-711a92af820a","name":"K8S-W02","project":"admin","username":"admin","created_at":"2025-08-06T14:09:13+0000","started_at":"2025-08-06T14:09:38+0000","ended_at":"N/A","created_at_ts":1754489353000,"started_at_ts":1754489378000,"elapsed":"29-17:57:58","state":"ACTIVE","allocation":{"disk":40,"extra_specs":{},"mem":4096,"name":"m1.medium","swap":0,"vcpus":2},"total_time_seconds":{"alloc_cpumemtime":119835521.81138636,"alloc_cputime":58513.43838447,"alloc_gpumemtime":0,"alloc_gputime":0,"walltime":29256.71919224},"tags":{"az":"W1-az","metadata":{"HA_Enabled":"True"},"power_state":"RUNNING","reservation_id":"r-49gy1n4c","tags":[]}},{"cluster_id":"os-0","resource_manager":"openstack","uuid":"43dde474-bbff-4d17-b4d5-a7f23eb48550","name":"K8S-W03","project":"admin","username":"admin","created_at":"2025-08-06T14:09:35+0000","started_at":"2025-08-06T14:10:03+0000","ended_at":"N/A","created_at_ts":1754489375000,"started_at_ts":1754489403000,"elapsed":"29-17:57:33","state":"SHUTOFF","allocation":{"disk":40,"extra_specs":{},"mem":4096,"name":"m1.medium","swap":0,"vcpus":2},"total_time_seconds":{"alloc_cpumemtime":0,"alloc_cputime":0,"alloc_gpumemtime":0,"alloc_gputime":0,"walltime":0},"tags":{"az":"W1-az","metadata":{"HA_Enabled":"True","instance_ha_enabled":"true"},"power_state":"SHUTDOWN","reservation_id":"r-crhgosga","tags":[]}},{"cluster_id":"os-0","resource_manager":"openstack","uuid":"5f5b453a-9e9b-4b56-9c87-0006f5077713","name":"K8S-W01","project":"admin","username":"admin","created_at":"2025-08-06T14:08:54+0000","started_at":"2025-08-06T14:09:19+0000","ended_at":"N/A","created_at_ts":1754489334000,"started_at_ts":1754489359000,"elapsed":"29-17:58:17","state":"SHUTOFF","allocation":{"disk":40,"extra_specs":{},"mem":4096,"name":"m1.medium","swap":0,"vcpus":2},"total_time_seconds":{"alloc_cpumemtime":0,"alloc_cputime":0,"alloc_gpumemtime":0,"alloc_gputime":0,"walltime":0},"tags":{"az":"W1-az","metadata":{"HA_Enabled":"True","instance_ha_enabled":"true"},"power_state":"SHUTDOWN","reservation_id":"r-974i0fn7","tags":[]}},{"cluster_id":"os-0","resource_manager":"openstack","uuid":"61e036d6-62f3-4888-8cb4-70c853d1cc38","name":"K8S-M01","project":"admin","username":"admin","created_at":"2025-08-06T14:07:53+0000","started_at":"2025-08-06T14:08:24+0000","ended_at":"N/A","created_at_ts":1754489273000,"started_at_ts":1754489304000,"elapsed":"29-17:59:12","state":"SHUTOFF","allocation":{"disk":40,"extra_specs":{},"mem":4096,"name":"m1.medium","swap":0,"vcpus":2},"total_time_seconds":{"alloc_cpumemtime":0,"alloc_cputime":0,"alloc_gpumemtime":0,"alloc_gputime":0,"walltime":0},"tags":{"az":"W1-az","metadata":{"HA_Enabled":"True","instance_ha_enabled":"true"},"power_state":"SHUTDOWN","reservation_id":"r-x6i1f8on","tags":[]}},{"cluster_id":"os-0","resource_manager":"openstack","uuid":"bc53b3e9-89e0-49fb-941f-2e8f8bb9503b","name":"NVIDIA-LICENSE","project":"admin","username":"admin","created_at":"2025-09-04T07:08:26+0000","started_at":"2025-09-04T07:09:56+0000","ended_at":"N/A","created_at_ts":1756969706000,"started_at_ts":1756969796000,"elapsed":"1-00:57:40","state":"ACTIVE","allocation":{"disk":40,"extra_specs":{},"mem":4096,"name":"m1.medium","swap":0,"vcpus":2},"total_time_seconds":{"alloc_cpumemtime":119835521.81138636,"alloc_cputime":58513.43838447,"alloc_gpumemtime":0,"alloc_gputime":0,"walltime":29256.71919224},"tags":{"az":"W1-az","metadata":{},"power_state":"RUNNING","reservation_id":"r-8mtgiqn6","tags":[]}},{"cluster_id":"os-0","resource_manager":"openstack","uuid":"c4d410eb-eb8d-4a2b-a704-7c54316b8b3c","name":"K8S-M03","project":"admin","username":"admin","created_at":"2025-08-06T14:08:33+0000","started_at":"2025-08-06T14:08:58+0000","ended_at":"N/A","created_at_ts":1754489313000,"started_at_ts":1754489338000,"elapsed":"29-17:58:38","state":"SHUTOFF","allocation":{"disk":40,"extra_specs":{},"mem":4096,"name":"m1.medium","swap":0,"vcpus":2},"total_time_seconds":{"alloc_cpumemtime":0,"alloc_cputime":0,"alloc_gpumemtime":0,"alloc_gputime":0,"walltime":0},"tags":{"az":"W1-az","metadata":{"HA_Enabled":"True","instance_ha_enabled":"true"},"power_state":"SHUTDOWN","reservation_id":"r-qufrxlwj","tags":[]}},{"cluster_id":"os-0","resource_manager":"openstack","uuid":"d9252ef3-81f4-4bd1-ac30-775364e4bf2f","name":"gpu-server","project":"admin","username":"admin","created_at":"2025-09-04T10:30:19+0000","started_at":"2025-09-04T10:30:38+0000","ended_at":"N/A","created_at_ts":1756981819000,"started_at_ts":1756981838000,"elapsed":"21:36:58","state":"ACTIVE","allocation":{"disk":100,"extra_specs":{"pci_passthrough:alias":"nvidia-a40:1"},"mem":49152,"name":"a40_flavor","swap":0,"vcpus":16},"total_time_seconds":{"alloc_cpumemtime":1438026261.73663640,"alloc_cputime":468107.50707573,"alloc_gpumemtime":0,"alloc_gputime":0,"walltime":29256.71919224},"tags":{"az":"W1-az","metadata":{},"power_state":"RUNNING","reservation_id":"r-igsgoal0","tags":[]}},{"cluster_id":"os-0","resource_manager":"openstack","uuid":"d962d6aa-70de-4839-95c8-7d9719285bd2","name":"K8S-M02","project":"admin","username":"admin","created_at":"2025-08-06T14:08:13+0000","started_at":"2025-08-06T14:08:41+0000","ended_at":"N/A","created_at_ts":1754489293000,"started_at_ts":1754489321000,"elapsed":"29-17:58:55","state":"SHUTOFF","allocation":{"disk":40,"extra_specs":{},"mem":4096,"name":"m1.medium","swap":0,"vcpus":2},"total_time_seconds":{"alloc_cpumemtime":0,"alloc_cputime":0,"alloc_gpumemtime":0,"alloc_gputime":0,"walltime":0},"tags":{"az":"W1-az","metadata":{"HA_Enabled":"True","instance_ha_enabled":"true"},"power_state":"SHUTDOWN","reservation_id":"r-bbh2xyt3","tags":[]}},{"cluster_id":"os-0","resource_manager":"openstack","uuid":"dd04be1a-d64a-4ab0-8c03-0b0f7747abff","name":"POSTGRES","project":"admin","username":"admin","created_at":"2025-08-06T13:55:55+0000","started_at":"2025-08-06T13:56:23+0000","ended_at":"N/A","created_at_ts":1754488555000,"started_at_ts":1754488583000,"elapsed":"29-18:11:13","state":"SHUTOFF","allocation":{"disk":40,"extra_specs":{},"mem":4096,"name":"m1.medium","swap":0,"vcpus":2},"total_time_seconds":{"alloc_cpumemtime":0,"alloc_cputime":0,"alloc_gpumemtime":0,"alloc_gputime":0,"walltime":0},"tags":{"az":"W1-az","metadata":{"HA_Enabled":"True","instance_ha_enabled":"true"},"power_state":"SHUTDOWN","reservation_id":"r-ah3nfh62","tags":[]}}]}

1 reply

mahendrapaipuri Sep 5, 2025
Maintainer

Hello @vurmil

No worries.

Are you sure that you installed the Prometheus rule file? Could you post a screenshot of "Rule Health" from Prometheus UI? It can be accessed from <Prometheus URL>/rules.

Could you please post the config you are using for CEEMS API server (after redacting any secrets)?

Cheers!

vurmil · 2025-09-05T09:41:33Z

vurmil
Sep 5, 2025
Author

ceems_api_server:
  data:
    path: /var/lib/ceems
    update_interval: 1m
    retention_period: 1y
    backup_path: /var/backups/ceems
    backup_interval: 1d
  web:
    url: http://198.19.1.21:9020
  admin:
    users:
      - admin
      - adm1
      - adm2
    grafana:
      url: http://198.19.1.21:3001
      teams_ids:
        - 1
      authorization:
        type: Bearer
        credentials: xxxxxx

clusters:
  - id: os-0
    manager: openstack
    updaters:
      - tsdb-0
    web:
      http_headers:
        X-OpenStack-Nova-API-Version:
          values:
            - latest
    extra_config:
      api_service_endpoints:
        compute: http://198.19.1.21:8774/v2.1
        identity: http://198.19.1.21:5000
      auth:
        identity:
          methods:
            - application_credential
          application_credential:
            id: xxxxxxxx
            secret: xxxxxx

updaters:
  - id: tsdb-0
    updater: tsdb
    web:
      url: http://198.19.1.21:9091
    extra_config:
      cutoff_duration: 5m
      queries:
        # Average CPU utilization
        avg_cpu_usage:
          global: |
            avg_over_time(avg by (uuid) (unit:ceems_compute_unit_cpu_usage:ratio_rate1m{uuid=~"{{.UUIDs}}"} > 0 < inf)[{{.Range}}:])

        # Average CPU Memory utilization
        avg_cpu_mem_usage:
          global: |
            avg_over_time(avg by (uuid) (unit:ceems_compute_unit_memory_usage:ratio{uuid=~"{{.UUIDs}}"} > 0 < inf)[{{.Range}}:])

        # Total CPU energy usage in kWh
        total_cpu_energy_usage_kwh:
          total: |
            sum_over_time(sum by (uuid) (unit:ceems_compute_unit_cpu_energy_usage:sum{uuid=~"{{.UUIDs}}"} > 0 < inf)[{{.Range}}:{{.ScrapeInterval}}]) * {{.ScrapeIntervalMilli}} / 3.6e9

        # Total CPU emissions in gms
        total_cpu_emissions_gms:
          rte_total: |
            sum_over_time(sum by (uuid) (unit:ceems_compute_unit_cpu_emissions:sum{uuid=~"{{.UUIDs}}",provider="rte"} > 0 < inf)[{{.Range}}:{{.ScrapeInterval}}]) * {{.ScrapeIntervalMilli}} / 1e3

          emaps_total: |
            sum_over_time(sum by (uuid) (unit:ceems_compute_unit_cpu_emissions:sum{uuid=~"{{.UUIDs}}",provider="emaps"} > 0 < inf)[{{.Range}}:{{.ScrapeInterval}}]) * {{.ScrapeIntervalMilli}} / 1e3

          owid_total: |
            sum_over_time(sum by (uuid) (unit:ceems_compute_unit_cpu_emissions:sum{uuid=~"{{.UUIDs}}",provider="owid"} > 0 < inf)[{{.Range}}:{{.ScrapeInterval}}]) * {{.ScrapeIntervalMilli}} / 1e3

        # Average GPU utilization
        avg_gpu_usage:
          global: |
            avg_over_time(avg by (uuid) (unit:ceems_compute_unit_gpu_usage:ratio{uuid=~"{{.UUIDs}}"} > 0 < inf)[{{.Range}}:])

        # Average GPU memory utilization
        avg_gpu_mem_usage:
          global: |
            avg_over_time(avg by (uuid) (unit:ceems_compute_unit_gpu_memory_usage:ratio{uuid=~"{{.UUIDs}}"} > 0 < inf)[{{.Range}}:])

        # Total GPU energy usage in kWh
        total_gpu_energy_usage_kwh:
          total: |
            sum_over_time(sum by (uuid) (unit:ceems_compute_unit_gpu_energy_usage:sum{uuid=~"{{.UUIDs}}"} > 0 < inf)[{{.Range}}:{{.ScrapeInterval}}]) * {{.ScrapeIntervalMilli}} / 3.6e9

        # Total GPU emissions in gms
        total_gpu_emissions_gms:
          rte_total: |
            sum_over_time(sum by (uuid) (unit:ceems_compute_unit_gpu_emissions:sum{uuid=~"{{.UUIDs}}",provider="rte"} > 0 < inf)[{{.Range}}:{{.ScrapeInterval}}]) * {{.ScrapeIntervalMilli}} / 1e3

          emaps_total: |
            sum_over_time(sum by (uuid) (unit:ceems_compute_unit_gpu_emissions:sum{uuid=~"{{.UUIDs}}",provider="emaps"} > 0 < inf)[{{.Range}}:{{.ScrapeInterval}}]) * {{.ScrapeIntervalMilli}} / 1e3

          owid_total: |
            sum_over_time(sum by (uuid) (unit:ceems_compute_unit_gpu_emissions:sum{uuid=~"{{.UUIDs}}",provider="owid"} > 0 < inf)[{{.Range}}:{{.ScrapeInterval}}]) * {{.ScrapeIntervalMilli}} / 1e3

9 replies

vurmil Sep 5, 2025
Author

Hello again,

Thanks for getting back to me. I appreciate your help with this.

Unfortunately, I don't have a free node with a GPU in vGPU mode right now. Another team is currently working on it, and I can't reconfigure it.

I did, however, check a node where the GPU is configured for passthrough. It seems that the OS can't access the card directly, as shown by the nvidia-smi command output. This is expected, as the entire PCI slot is passed through to the VM. The drivers are still installed on the host, but the OS just can't communicate with the card anymore.

Here's the output I got:

root@node02:/home/ochk# lspci |grep A40
82:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)

root@node02:/home/ochk# nvidia-smi --query --xml-format
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I hope this helps you with your work. Let me know if you have any other questions or need me to test something else when a node becomes available.

Cheers,
Vurmil

mahendrapaipuri Sep 5, 2025
Maintainer

Cheers @vurmil for quick turnaround.

Fair enough. That is good enough! So, I understand there is no gurantee to get GPU device details in Openstack environments. I will put up a PR with the fix and you will be able to get patched binary from CI artifacts. I will let you know once they are ready.

mahendrapaipuri Sep 5, 2025
Maintainer

Hello again @vurmil

You can get the new patched binary from CI artifacts. Please let me know if that fixes your issue.

vurmil Sep 5, 2025
Author

@mahendrapaipuri
I confirm it works. I'll continue working on it next week. Have a nice weekend.

mahendrapaipuri Sep 6, 2025
Maintainer

@vurmil A new patch release v0.11.1 has been made with the fix!

Missing CPU Metrics for Libvirt Instances #402

Uh oh!

Uh oh!

vurmil Aug 26, 2025

Replies: 5 comments · 22 replies

Uh oh!

mahendrapaipuri Aug 27, 2025 Maintainer

Uh oh!

Uh oh!

vurmil Aug 27, 2025 Author

Uh oh!

mahendrapaipuri Aug 27, 2025 Maintainer

Uh oh!

Uh oh!

vurmil Aug 27, 2025 Author

Uh oh!

mahendrapaipuri Aug 28, 2025 Maintainer

Uh oh!

Uh oh!

vurmil Aug 28, 2025 Author

Uh oh!

mahendrapaipuri Aug 28, 2025 Maintainer

Uh oh!

mahendrapaipuri Sep 2, 2025 Maintainer

Uh oh!

Uh oh!

vurmil Sep 4, 2025 Author

Uh oh!

mahendrapaipuri Sep 5, 2025 Maintainer

Uh oh!

vurmil Sep 5, 2025 Author

Uh oh!

Uh oh!

mahendrapaipuri Sep 5, 2025 Maintainer

Uh oh!

Uh oh!

vurmil Sep 5, 2025 Author

Uh oh!

vurmil Sep 5, 2025 Author

Uh oh!

mahendrapaipuri Sep 5, 2025 Maintainer

Uh oh!

mahendrapaipuri Sep 5, 2025 Maintainer

Uh oh!

vurmil Sep 5, 2025 Author

Uh oh!

mahendrapaipuri Sep 6, 2025 Maintainer

vurmil
Aug 26, 2025

Replies: 5 comments 22 replies

mahendrapaipuri
Aug 27, 2025
Maintainer

vurmil
Aug 27, 2025
Author

mahendrapaipuri Aug 27, 2025
Maintainer

vurmil
Aug 27, 2025
Author

mahendrapaipuri Aug 28, 2025
Maintainer

vurmil Aug 28, 2025
Author

mahendrapaipuri Aug 28, 2025
Maintainer

mahendrapaipuri Sep 2, 2025
Maintainer

vurmil Sep 4, 2025
Author

mahendrapaipuri Sep 5, 2025
Maintainer

vurmil
Sep 5, 2025
Author

mahendrapaipuri Sep 5, 2025
Maintainer

vurmil
Sep 5, 2025
Author

vurmil Sep 5, 2025
Author

mahendrapaipuri Sep 5, 2025
Maintainer

mahendrapaipuri Sep 5, 2025
Maintainer

vurmil Sep 5, 2025
Author

mahendrapaipuri Sep 6, 2025
Maintainer