Skip to content

Add 'switch' label containing hostname#36

Open
metfan1981 wants to merge 2 commits intotreydock:mainfrom
metfan1981:main
Open

Add 'switch' label containing hostname#36
metfan1981 wants to merge 2 commits intotreydock:mainfrom
metfan1981:main

Conversation

@metfan1981
Copy link

Adds a switch label containing device’s hostname (or whatever ibnetdiscover reports) to every switch metric improving readability.

The label stays present even if --ibnetdiscover.node-name-map was not provided.

@treydock
Copy link
Owner

This is already possible with record rules:

expr: irate(infiniband_switch_port_transmit_data_bytes_total[5m]) * on(guid,port) group_left(switch, host, uplink, uplink_port) infiniband_switch_uplink_info

Most of the metrics are counters so can use record rules to define the rate used and pull in switch name and other info.

@metfan1981
Copy link
Author

It is not ideal for fabrics with more than one instance running the exporter (e.g. SM nodes):

Error executing query found duplicate series for the match group {guid="0xb8001db1da", port="1"} on the right hand-side of the operation: [{__name__="infiniband_switch_uplink_info", datacenter="dc", guid="0xb8001db1da", instance="opensm1:9315", job="infiniband", port="1", 
{__name__="infiniband_switch_uplink_info", datacenter="dc", guid="0xb8001db1da", instance="opensm2:9315", job="infiniband", port="1"];
many-to-many matching not allowed: matching labels must be unique on one side

Also running the join operation for a fabric big enough creates unnecessary resource strain on Prometheus, especially with this amount of metrics and recording rules for each.

@treydock
Copy link
Owner

We have UFM HA but only the primary server runs this exporter, the secondary doesn't get scraped and Prometheus scrapes follow the VIP that UFM uses to identify the primary. That's obviously not possible or as easy with plain OpenSM but might be worth trying as you end up with duplicate metrics if multiple targets are scraped with essentially the same data.

If you are creating dashboards or alerts where you end up doing rates over and over again, that may create more strain on Prometheus than storing the record rules and searching just the record rule. That's generally why record rules exist, to reduce the load on Prometheus during searches.

For the record rules you'd have to add instance to ON() since you have multiple instances doing the scrapes.

@treydock
Copy link
Owner

Also not sure on your scale but we have 81 switches with around 48 ports per switch and we've had little issues with the record rule and we have multiple kinds of rates for each metric and even record rules that consume other record rules. Prometheus shows our record rules take a little under 2 seconds to be generated on each interval which is 60 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants