Skip to content

prometheus-node-exporter-ucode: new modules and extended info#28016

Open
systemcrash wants to merge 2 commits intoopenwrt:masterfrom
systemcrash:prom
Open

prometheus-node-exporter-ucode: new modules and extended info#28016
systemcrash wants to merge 2 commits intoopenwrt:masterfrom
systemcrash:prom

Conversation

@systemcrash
Copy link
Contributor

-thermal management (temp and cooling devices)
-uname amendment to oneline and help string
-realtek-poe
-odhcp6c
-entropy help strings
-time help string
-hwmon
-time clocksources
-watchdog
-metrics help strings

📦 Package Details

Maintainer: @dhewg (?) @feckert


🧪 Run Testing Details

  • OpenWrt Version:24.10.4 / 25 snapshot

✅ Formalities

  • I have reviewed the CONTRIBUTING.md file for detailed contributing guidelines.

@dhewg
Copy link
Contributor

dhewg commented Dec 5, 2025 via email

@systemcrash
Copy link
Contributor Author

systemcrash commented Dec 5, 2025

Wow, micro-optimisations! Yeah, I ran reset ; prometheus-node-exporter-ucode thermal alot. Very helpful aspect :)

  • creating metrics of the same name should always happen once in the main scope, and not in a loop (time, watchdog)

So we get ordered results like this? (IIRC this accelerates ingestion?):

# HELP node_cooling_device_max_state Maximum throttle state of the cooling device
# TYPE node_cooling_device_max_state gauge
node_cooling_device_max_state{name="0",type="Processor"} 0
node_cooling_device_max_state{name="1",type="Processor"} 0
node_cooling_device_max_state{name="10",type="Processor"} 0
...
node_cpu_guest_seconds_total{cpu="0",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="0",mode="user"} 0
node_cpu_guest_seconds_total{cpu="1",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="1",mode="user"} 0
node_cpu_guest_seconds_total{cpu="10",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="10",mode="user"} 0
...

Using many on slow devices in a tight polling interval really adds up:

Maniacs crushing their routers with 5 seconds of polling :D

The watchdog collector needs to be added to the Makefile.

It's in base, so it's installed automatically. Should it be an extra?

@GeorgeSapkin this commit length problem is not good. See how many characters one has to work with on this module?? :/

@GeorgeSapkin
Copy link
Member

GeorgeSapkin commented Dec 5, 2025

@systemcrash like I said before, I'm not married to the exact number. That one was the only thing in the official guidelines, so that's what I went with. If more people feel differently we can change the limit to something else, which I do think is important to have. We don't need to look far for whole commit messages being one long line.

Edit: this also comes back to blocking build if formal fails. Which I'm still on the fence about.

@systemcrash
Copy link
Contributor Author

@systemcrash like I said before, I'm not married to the exact number. That one was the only thing in the official guidelines, so that's what I went with. If more people feel differently we can change the limit to something else, which I do think is important to have. We don't need to look far for whole commit messages being one long line.

Edit: this also comes back to blocking build if formal fails. Which I'm still on the fence about.

I'm OK with a limit at the established 72 characters. But I don't see why that should be a hard-limit. A warning perhaps. A hard-limit at like 120-150 likely flags that there're no newlines used anywhere.

@systemcrash
Copy link
Contributor Author

@dhewg if you could take another look? I optimised netdev and netclass for ordered output.

@systemcrash
Copy link
Contributor Author

pinging @dhewg - think I've addressed everything here.

@dhewg
Copy link
Contributor

dhewg commented Dec 9, 2025

  • creating metrics of the same name should always happen once in the main scope, and not in a loop (time, watchdog)

So we get ordered results like this? (IIRC this accelerates ingestion?):

Sorry, I just meant:

const m = metric(...);
for (...) {
    m();
}

instead of:

for (...) {
    const m = metric(...);
    m();
}

I appreciate the optimizations, but is the ordered output really worth it? Like in terms of time spend by the prometheus daemon to parse the output? It makes the collectors less readable, especially netdev.

The watchdog collector needs to be added to the Makefile.

It's in base, so it's installed automatically. Should it be an extra?

Let's make it an extra. This is what I get one one of my devices:
node_watchdog_info{name="watchdog0",options="null",identity="null",state="null",status="null",pretimeout_governor="null"} 1
So It'll be just baggage there.

On another device I get:
node_watchdog_info{name="watchdog0",options="0x8180",identity="GPIO Watchdog",state="active",status="0x8000",pretimeout_governor="null"} 1
It's been a while, but those null values don't look right, those should probably checked for?

For hwmon, Is this the expected chip?
node_hwmon_chip_names{chip="thermal_thermal_zone0",chip_name="cpu_thermal"} 1

That's from:

ls -l /sys/class/hwmon/hwmon0/
lrwxrwxrwx    1 root     root             0 Dec  9 09:02 device -> ../../thermal_zone0
-r--r--r--    1 root     root          4096 Dec  9 09:02 name
drwxr-xr-x    2 root     root             0 Dec  9 09:07 power
lrwxrwxrwx    1 root     root             0 Dec  9 09:07 subsystem -> ../../../../../class/hwmon
-r--r--r--    1 root     root          4096 Dec  9 09:02 temp1_input
-rw-r--r--    1 root     root          4096 Dec  9 09:07 uevent

@systemcrash
Copy link
Contributor Author

systemcrash commented Dec 9, 2025

For hwmon, Is this the expected chip? node_hwmon_chip_names{chip="thermal_thermal_zone0",chip_name="cpu_thermal"} 1

Yes, that looks correct. It pulls the chip_name from:

ls -l /sys/class/hwmon/hwmon0/
...
-r--r--r--    1 root     root          4096 Dec  9 09:02 name

Can you check that file content?

Here's what I get from node-exporter v1.9 on an i3:

# HELP node_hwmon_chip_names Annotation metric for human-readable chip names
# TYPE node_hwmon_chip_names gauge
node_hwmon_chip_names{chip="0000:00:1c_4_0000:03:00_0",chip_name="i350bb"} 1
node_hwmon_chip_names{chip="platform_coretemp_0",chip_name="coretemp"} 1
node_hwmon_chip_names{chip="thermal_thermal_zone0",chip_name="acpitz"} 1

vs:

ls -l /sys/class/hwmon/hwmon0/
total 0
lrwxrwxrwx 1 root root    0 Dec  9 13:17 device -> ../../thermal_zone0
-r--r--r-- 1 root root 4096 Dec  9 13:17 name
drwxr-xr-x 2 root root    0 Dec  9 13:17 power
lrwxrwxrwx 1 root root    0 Dec  9 13:17 subsystem -> ../../../../../class/hwmon
-r--r--r-- 1 root root 4096 Dec  9 13:17 temp1_input
-r--r--r-- 1 root root 4096 Dec  9 13:17 temp2_input
-rw-r--r-- 1 root root 4096 Dec  9 13:17 uevent
# cat /sys/class/hwmon/hwmon0/name
acpitz

is the ordered output really worth it? Like in terms of time spend by the prometheus daemon to parse the output? It makes the collectors less readable, especially netdev.

It makes output less jumbled and more readable (since HELP strings occur with the TYPE, they're grouped together and that type appears once, rather than only appearing the first time in multiple appearances of that row type later on), and it contributes to lower data-cardinality at ingestion, which IIRC helps coalesce writes.

I think it makes the code more readable - since the loops are distinct and not nested.

@systemcrash
Copy link
Contributor Author

OK. It's an extra now. Much appreciated if you could take hwmon for another spin @dhewg

let line = oneline(devroot + m);
metrics[m]({ device }, line);
for (let m in metrics) {
for (let i = 0; i < dev_length; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ucode not support for..of?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. But were we to do it that way, it creates a new object on every iteration, whereas the classic loop does not. Aiming for speed, here :)

release: oneline("/proc/sys/kernel/osrelease"),
version: oneline("/proc/sys/kernel/version"),
machine: poneline("uname -m"), // TODO lame
machine: oneline("/proc/sys/kernel/arch"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still tabs here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to modify how it was already there. I know you want consistency...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(they all align whatever the width anyway)

identity: iden,
state: stat,
status: stus,
pretimeout_governor: prtg,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tabs here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(they all align whatever the width anyway)

@systemcrash
Copy link
Contributor Author

ping @dhewg for final test run

@dhewg
Copy link
Contributor

dhewg commented Dec 19, 2025

Sorry for the delay, pre-xmas stress...

Well, I agree to not modify the whitespaces!
But then please revert the whitespace changes in netclass/netdev, because that is now inconsistent.
I'm not married to whatever I used initially, but let's not bikeshed this please.

is the ordered output really worth it? Like in terms of time spend by the prometheus daemon to parse the output?

I think it makes the code more readable - since the loops are distinct and not nested.

Sorry, still disagreeing. While it's not rocket science, it's just more code to read. In netdev we now have 4 loops instead of one. You even added comments to make it clear what it does. It wasn't unclear before at all.
It's similar for most of the new files.

So again: Is this reduced readability really worth it?
Don't get me wrong, it may as well be! I just don't know and don't have the time to dig in atm.
But on the other side I can't see a prometheus daemon really sucking that much if not every collector on earth is optimized in this specific way?

As for hwmon/watchdog, the null values are gone, they're empty string now.
Remind me, is that what prometheus wants? Or should we just not output empty labels?

@systemcrash
Copy link
Contributor Author

systemcrash commented Dec 19, 2025

Don't get me wrong, it may as well be! I just don't know and don't have the time to dig in atm.

(my) Reasoning being that we output ordered, rather than revisiting the same metrics for different devices. It makes stare and compare in the scrape output clearer, among other downstream benefits.

As for hwmon/watchdog, the null values are gone, they're empty string now.
Remind me, is that what prometheus wants? Or should we just not output empty labels?

I'll take another look at that. Empty isn't bad per-se, but we might as well avoid the output altogether.

could you paste an example of the output? Edit: never mind. Found it :)

-netdev
-netclass

Fix metric output to be ordered per-metric (and not per-device) as is
done by node_exporter. In testing, netdev and netclass scrape time
increased a mean 2msec as a result of the extra loops.

In netdev, we skip some checks since the format of /proc/net/dev is
stable.

Signed-off-by: Paul Donald <newtwen+github@gmail.com>
-thermal management (temp and cooling devices)
-uname amendment to oneline and help string
-realtek-poe
-odhcp6c
-entropy help strings
-time help string
-hwmon
-time clocksources
-watchdog
-metrics help strings

Signed-off-by: Paul Donald <newtwen+github@gmail.com>
@dhewg
Copy link
Contributor

dhewg commented Dec 19, 2025

(my) Reasoning being that we output ordered, rather than revisiting the same metrics for different devices. It makes stare and compare in the scrape output clearer, among other downstream benefits.

I got that, but why should scrape output readability trump code readability? Debugging comes to mind, but we can go just use our standard tools like prometheus-node-exporter-ucode hwmon|grep foo|sort|cut|awk|whatever

"Other downstream benefits" may be the reason to go this way, but that's a bit too hand-wavy argument for my taste ;)

Here's watchdog on an OpenWrt One, we got some (harmless) warnings and an empty pretimeout_governor:

prometheus-node-exporter-ucode watchdog
prometheus-node-exporter-ucode now serving requests with 16 collectors
Status: 200 OK
Content-Type: text/plain; version=0.0.4; charset=utf-8

# HELP node_watchdog_bootstatus Value of /sys/class/watchdog/<watchdog>/bootstatus
# TYPE node_watchdog_bootstatus gauge
node_watchdog_bootstatus{name="watchdog0"} 0
node_watchdog_bootstatus{name="watchdog1"} 0
# HELP node_watchdog_fw_version Value of /sys/class/watchdog/<watchdog>/fw_version
# TYPE node_watchdog_fw_version gauge
node_watchdog_fw_version{name="watchdog0"} 0
node_watchdog_fw_version{name="watchdog1"} 0
# HELP node_watchdog_nowayout Value of /sys/class/watchdog/<watchdog>/nowayout
# TYPE node_watchdog_nowayout gauge
node_watchdog_nowayout{name="watchdog0"} 0
node_watchdog_nowayout{name="watchdog1"} 0
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_timeleft_seconds)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_timeleft_seconds)
# HELP node_watchdog_timeout_seconds Value of /sys/class/watchdog/<watchdog>/timeout
# TYPE node_watchdog_timeout_seconds gauge
node_watchdog_timeout_seconds{name="watchdog0"} 30
node_watchdog_timeout_seconds{name="watchdog1"} 31
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_pretimeout_seconds)
# HELP node_watchdog_pretimeout_seconds Value of /sys/class/watchdog/<watchdog>/pretimeout
# TYPE node_watchdog_pretimeout_seconds gauge
node_watchdog_pretimeout_seconds{name="watchdog1"} 15
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_access_cs0)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_access_cs0)
# HELP node_watchdog_info Info of /sys/class/watchdog/<watchdog>
# TYPE node_watchdog_info gauge
node_watchdog_info{name="watchdog0",options="0x8180",identity="GPIO Watchdog",state="active",status="0x8000",pretimeout_governor=""} 1
node_watchdog_info{name="watchdog1",options="0x8380",identity="mtk-wdt",state="inactive",status="0x0",pretimeout_governor="panic"} 1
# HELP node_watchdog_available Info of /sys/class/watchdog/<watchdog>/pretimeout_available_governors
# TYPE node_watchdog_available gauge
node_watchdog_available{available="panic",device="watchdog1"} 1
# TYPE node_scrape_collector_duration_seconds gauge
node_scrape_collector_duration_seconds{collector="watchdog"} 0.005309986
# TYPE node_scrape_collector_success gauge
node_scrape_collector_success{collector="watchdog"} 1

@dhewg
Copy link
Contributor

dhewg commented Dec 19, 2025

avm,fritzbox-7530 looks even more funny:

$ prometheus-node-exporter-ucode watchdog|grep \"\"
prometheus-node-exporter-ucode now serving requests with 20 collectors
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_bootstatus)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_fw_version)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_nowayout)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_timeleft_seconds)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_timeout_seconds)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_pretimeout_seconds)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_access_cs0)
node_watchdog_info{name="watchdog0",options="",identity="",state="",status="",pretimeout_governor=""} 1
`

@systemcrash
Copy link
Contributor Author

(my) Reasoning being that we output ordered, rather than revisiting the same metrics for different devices. It makes stare and compare in the scrape output clearer, among other downstream benefits.

I got that, but why should scrape output readability trump code readability? Debugging comes to mind, but we can go just use our standard tools like prometheus-node-exporter-ucode hwmon|grep foo|sort|cut|awk|whatever

"Other downstream benefits" may be the reason to go this way, but that's a bit too hand-wavy argument for my taste ;)

Fair. Well, when you're comparing with Prometheus node_exporter, it helps quite a bit.

Here's watchdog on an OpenWrt One, we got some (harmless) warnings and an empty pretimeout_governor:

Ah, thanks!

DEBUG: skipping metric: unsupported value 'null' (node_watchdog_timeleft_seconds)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_timeleft_seconds)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_pretimeout_seconds)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_access_cs0)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_access_cs0)

These should all be resolved with the latest push.

# HELP node_watchdog_info Info of /sys/class/watchdog/<watchdog>
# TYPE node_watchdog_info gauge
node_watchdog_info{name="watchdog0",options="0x8180",identity="GPIO Watchdog",state="active",status="0x8000",pretimeout_governor=""} 1

As should this, without things like pretimeout_governor="". Should be fixed.

The same fixes should also work on the Fritz. We now skip it if the path is unavailable.

vooon added a commit to vooon/my-openwrt-feed that referenced this pull request Jan 22, 2026
Copy variant from PR:
openwrt/packages#28016

Signed-off-by: Vladimir Ermakov <vooon341@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants