Skip to content

feat: Add Grafonnet-based OpenStack Libvirt Dashboard#109

Open
aravindh-murugesan wants to merge 1 commit intoinovex:masterfrom
aravindh-murugesan:add-grafonnet-dashboard
Open

feat: Add Grafonnet-based OpenStack Libvirt Dashboard#109
aravindh-murugesan wants to merge 1 commit intoinovex:masterfrom
aravindh-murugesan:add-grafonnet-dashboard

Conversation

@aravindh-murugesan
Copy link

Description

This PR adds a new Grafana dashboard generated using Grafonnet.

Closes #17

@aravindh-murugesan
Copy link
Author

@frittentheke Any feedback on this PR? I'd be happy to improvise if this patch needs any tweaks.

@frittentheke
Copy link
Collaborator

@frittentheke Any feedback on this PR? I'd be happy to improvise if this patch needs any tweaks.

Thanks @aravindh-murugesan for taking the time and effort to make this dashboard and to also convert it to jsonnet.
This hopefully makes things easier down the road.

I just took this board for a quick spin -- looks good so far. @Knalltuete5000 will test this in a proper environment with lots of projects to see if everything works. We'll merge it once no issues show up!

@Knalltuete5000
Copy link
Collaborator

Thanks for the PR.
I have had the time to render and import the dashboard in our environment. On the first sight it looks good and the dashboard provides a great drill down for the individual vms per project, except there is no data.
I think I already found the issue, why there is no data:

The variable for the project name, the vm name and the vm id are extracted correctly but please remind that these variables are openstack specific why the rendered query, e.g. for the up state libvirt_domain_info_state{domain="$vm_id"} (from the rendered dashboard) does not work. There is a mixup between the openstack vm id and the domain id from libvirt
The correct query should look like this libvirt_domain_info_state * on(domain, instance) group_left(instance_name) libvirt_domain_openstack_info{instance_id="$vm_id"}
This must be adjusted for all of the queries, since no data is shown. Can you fix this?

Copy link
Collaborator

@Knalltuete5000 Knalltuete5000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a mixup between the openstack vm id and the libvirt domain id. These ids do not match each other but can be matched via the libvirt_domain_openstack_info

@Knalltuete5000
Copy link
Collaborator

Knalltuete5000 commented Feb 25, 2026

Can you also adjust queries like, e.g. query for the CPU usage % ((rate(libvirt_domain_info_cpu_time_seconds_total{domain="$vm_id"}[5m])/libvirt_domain_vcpu_current{domain="$vm_id"}) * 100) to not use a hardcoded interval and instead use the $__rate_interval provided by grafan.
Or are there any advantages to use a hardcoded interval of 5 minutes or a disatvatage of using the $__rate_interval

In other queries are also some hardcoded intervals like 2 minutes or some other

@Knalltuete5000
Copy link
Collaborator

Thanks for the PR. I have had the time to render and import the dashboard in our environment. On the first sight it looks good and the dashboard provides a great drill down for the individual vms per project, except there is no data. I think I already found the issue, why there is no data:

The variable for the project name, the vm name and the vm id are extracted correctly but please remind that these variables are openstack specific why the rendered query, e.g. for the up state libvirt_domain_info_state{domain="$vm_id"} (from the rendered dashboard) does not work. There is a mixup between the openstack vm id and the domain id from libvirt The correct query should look like this libvirt_domain_info_state * on(domain, instance) group_left(instance_name) libvirt_domain_openstack_info{instance_id="$vm_id"} This must be adjusted for all of the queries, since no data is shown. Can you fix this?

I have figured it out why this is the case: Nova has a feature that it can create the vms so the libvirt instance and the nova server have the same id. But this needs to be enabled and in some cases in an openstack environment this is not the case. In my test environment this is the case that these ids are different, but it should be safe to always use the group_left to fix this issue and the dashboard can be used with both settings

@aravindh-murugesan
Copy link
Author

Thanks for the PR. I have had the time to render and import the dashboard in our environment. On the first sight it looks good and the dashboard provides a great drill down for the individual vms per project, except there is no data. I think I already found the issue, why there is no data:
The variable for the project name, the vm name and the vm id are extracted correctly but please remind that these variables are openstack specific why the rendered query, e.g. for the up state libvirt_domain_info_state{domain="$vm_id"} (from the rendered dashboard) does not work. There is a mixup between the openstack vm id and the domain id from libvirt The correct query should look like this libvirt_domain_info_state * on(domain, instance) group_left(instance_name) libvirt_domain_openstack_info{instance_id="$vm_id"} This must be adjusted for all of the queries, since no data is shown. Can you fix this?

I have figured it out why this is the case: Nova has a feature that it can create the vms so the libvirt instance and the nova server have the same id. But this needs to be enabled and in some cases in an openstack environment this is not the case. In my test environment this is the case that these ids are different, but it should be safe to always use the group_left to fix this issue and the dashboard can be used with both settings

Understood. In my environment, both nova and libvirt domains have the same ID. So I assumed that would be the case. But I can try to improve this over weekend. I will read up $__rate_interval and how it calculates the rate interval with respect to the time range user selects, and make this change as well.

Thanks for the valuable feedback.

@aravindh-murugesan
Copy link
Author

@Knalltuete5000 @frittentheke I have a different suggestion. Instead of modifying all the queries to include joins, can we include another dependent variable that identifies the domain id, and use that as a filter for all our queries.

So we will have 4 variables,

  1. project_name
  2. vm_name
  3. vm_id
  4. New Variable: dom_id - label_values(libvirt_domain_openstack_info{instance_name="$vm_name", project_name="$project_name", instance_id="$vm_id"},domain)

Subsequently we modify the query to use this - for example (sum by (domain) (rate(libvirt_domain_vcpu_delay_seconds_total{domain="$dom_id"}[5m]))/sum by (domain) (rate(libvirt_domain_vcpu_time_seconds_total{domain="$dom_id"}[5m]))) * 100

I feel like this would be an efficient solution. Let me know what you think.

@Knalltuete5000
Copy link
Collaborator

Sure. If this is a cleaner way and also simpler to approach the issue with the different ids.

Do I understand it also correctly, that the dom_id is just a helper variable and updates with the selection of a new vm name, project or vm id and is not visible to the user via the dashboard? This would be great

Please let me know if I should test any of the updates with the different ids

@frittentheke
Copy link
Collaborator

Subsequently we modify the query to use this - for example (sum by (domain) (rate(libvirt_domain_vcpu_delay_seconds_total{domain="$dom_id"}[5m]))/sum by (domain) (rate(libvirt_domain_vcpu_time_seconds_total{domain="$dom_id"}[5m]))) * 100

you meant to use $__rate_interval :-P

(sum by (domain) (rate(libvirt_domain_vcpu_delay_seconds_total{domain="$dom_id"}[$__rate_interval]))/sum by (domain) (rate(libvirt_domain_vcpu_time_seconds_total{domain="$dom_id"}[$__rate_interval]))) * 100`

Also you can also set the unit to Percent 0.0 - 1.0 in the Grafana graph to avoid the need for multiplication in the query.

@aravindh-murugesan
Copy link
Author

aravindh-murugesan commented Mar 3, 2026

Sure. If this is a cleaner way and also simpler to approach the issue with the different ids.

Do I understand it also correctly, that the dom_id is just a helper variable and updates with the selection of a new vm name, project or vm id and is not visible to the user via the dashboard? This would be great

Please let me know if I should test any of the updates with the different ids

Yes it would be a hidden variable and wont show up as a drop down for the user.

Subsequently we modify the query to use this - for example (sum by (domain) (rate(libvirt_domain_vcpu_delay_seconds_total{domain="$dom_id"}[5m]))/sum by (domain) (rate(libvirt_domain_vcpu_time_seconds_total{domain="$dom_id"}[5m]))) * 100

you meant to use $__rate_interval :-P

(sum by (domain) (rate(libvirt_domain_vcpu_delay_seconds_total{domain="$dom_id"}[$__rate_interval]))/sum by (domain) (rate(libvirt_domain_vcpu_time_seconds_total{domain="$dom_id"}[$__rate_interval]))) * 100`

Also you can also set the unit to Percent 0.0 - 1.0 in the Grafana graph to avoid the need for multiplication in the query.

Yes, I switched to $__rate_interval after I made that comment :)

Will be updating the PR in a few mins. I will let @Knalltuete5000 confirm if this works for him when openstack id and domain id are different.

And thanks for that tip about percent 0-1. I did not know that. We are use this same query in our billing system to query metrics, I left it unchanged for now.

@aravindh-murugesan aravindh-murugesan force-pushed the add-grafonnet-dashboard branch from cb4b760 to 6dc4825 Compare March 3, 2026 10:31
@aravindh-murugesan
Copy link
Author

@Knalltuete5000 Can you test the new changes please?

@Knalltuete5000
Copy link
Collaborator

I have just pasted in the rendered dashboard in our environment and on the first sight it looks pretty good.

I currently have some metrics which do not show up but I think part of the problem is, that we still run an older version of the libvirt exporter.
Do you mind adding a warning in the board if the libvirt exporter version is to old for the board either in the board itself or just add a readme in either the dashboards folder or in the specific dashboards/openstack-libvirt-dashboard folder where the dashboard is currently located?

And can you add some short instructions on how to render the dashboards (requirements, etc.). That would be great.

In the mean time I will test some scenarios with the dashboard in the next few days and will report back if some additional changes are required (but I do not think so 😄 )

Thanks again for this great PR

@aravindh-murugesan
Copy link
Author

@Knalltuete5000 Sure thing. I will add instructions to render and add the version this dashboard is tested against. Just curious, what panels are not working for you?

@Knalltuete5000
Copy link
Collaborator

After an update of the libvirt exporter some of the panels now work. the version was just to old.

But the whole storage section does not work with a query error: One of the queries is

rate(sum by (domain, target_device) ({__name__=~"libvirt_domain_block_stats_(read|write)_requests_total", domain="$dom_id"} > 0)[$__rate_interval])

which results in the following error

bad_data: invalid parameter "query": 4:4: parse error: ranges only allowed for vector selectors

I think the query should look like this:

sum by (domain, target_device) (rate({__name__=~"libvirt_domain_block_stats_read_requests_total", domain="$dom_id"}[$__rate_interval]))

And as of some resources point out, e.g https://www.robustperception.io/rate-then-sum-never-sum-then-rate/ it is better to use a sum on a rate then the rate of the sum

@aravindh-murugesan
Copy link
Author

After an update of the libvirt exporter some of the panels now work. the version was just to old.

But the whole storage section does not work with a query error: One of the queries is

rate(sum by (domain, target_device) ({__name__=~"libvirt_domain_block_stats_(read|write)_requests_total", domain="$dom_id"} > 0)[$__rate_interval])

which results in the following error

bad_data: invalid parameter "query": 4:4: parse error: ranges only allowed for vector selectors

I think the query should look like this:

sum by (domain, target_device) (rate({__name__=~"libvirt_domain_block_stats_read_requests_total", domain="$dom_id"}[$__rate_interval]))

And as of some resources point out, e.g https://www.robustperception.io/rate-then-sum-never-sum-then-rate/ it is better to use a sum on a rate then the rate of the sum

I will take a look.

This query works for me, but I'm not sure if this difference comes down to victoria metrics(which is what I use) and prometheus (which I assume you are using?)
image

Thanks for the heads up.. Will fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

grafana board + awesome alerts

3 participants