Add 'switch' label containing hostname by metfan1981 · Pull Request #36 · treydock/infiniband_exporter

metfan1981 · 2025-11-26T11:33:34Z

Adds a switch label containing device’s hostname (or whatever ibnetdiscover reports) to every switch metric improving readability.

The label stays present even if --ibnetdiscover.node-name-map was not provided.

Add 'switch' label containing hostname

treydock · 2025-11-26T14:18:28Z

This is already possible with record rules:

infiniband_exporter/examples/infiniband.rules

Line 5 in 992ebe1

    
           expr: irate(infiniband_switch_port_transmit_data_bytes_total[5m]) * on(guid,port) group_left(switch, host, uplink, uplink_port) infiniband_switch_uplink_info

Most of the metrics are counters so can use record rules to define the rate used and pull in switch name and other info.

metfan1981 · 2025-11-26T14:42:11Z

It is not ideal for fabrics with more than one instance running the exporter (e.g. SM nodes):

Error executing query found duplicate series for the match group {guid="0xb8001db1da", port="1"} on the right hand-side of the operation: [{__name__="infiniband_switch_uplink_info", datacenter="dc", guid="0xb8001db1da", instance="opensm1:9315", job="infiniband", port="1", 
{__name__="infiniband_switch_uplink_info", datacenter="dc", guid="0xb8001db1da", instance="opensm2:9315", job="infiniband", port="1"];
many-to-many matching not allowed: matching labels must be unique on one side

Also running the join operation for a fabric big enough creates unnecessary resource strain on Prometheus, especially with this amount of metrics and recording rules for each.

treydock · 2025-11-26T14:52:06Z

We have UFM HA but only the primary server runs this exporter, the secondary doesn't get scraped and Prometheus scrapes follow the VIP that UFM uses to identify the primary. That's obviously not possible or as easy with plain OpenSM but might be worth trying as you end up with duplicate metrics if multiple targets are scraped with essentially the same data.

If you are creating dashboards or alerts where you end up doing rates over and over again, that may create more strain on Prometheus than storing the record rules and searching just the record rule. That's generally why record rules exist, to reduce the load on Prometheus during searches.

For the record rules you'd have to add instance to ON() since you have multiple instances doing the scrapes.

treydock · 2025-11-26T15:22:43Z

Also not sure on your scale but we have 81 switches with around 48 ports per switch and we've had little issues with the record rule and we have multiple kinds of rates for each metric and even record rules that consume other record rules. Prometheus shows our record rules take a little under 2 seconds to be generated on each interval which is 60 seconds.

for-whom-the-bell-tolls and others added 2 commits November 26, 2025 11:38

Add label containing hostname

045dbda

Merge pull request #1 from metfan1981/switch_label_add

c64a96c

Add 'switch' label containing hostname

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 'switch' label containing hostname#36

Add 'switch' label containing hostname#36
metfan1981 wants to merge 2 commits intotreydock:mainfrom
metfan1981:main

metfan1981 commented Nov 26, 2025

Uh oh!

treydock commented Nov 26, 2025

Uh oh!

metfan1981 commented Nov 26, 2025

Uh oh!

treydock commented Nov 26, 2025

Uh oh!

treydock commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

metfan1981 commented Nov 26, 2025

Uh oh!

treydock commented Nov 26, 2025

Uh oh!

metfan1981 commented Nov 26, 2025

Uh oh!

treydock commented Nov 26, 2025

Uh oh!

treydock commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants