-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Hi!
Found this a few years back now and love it. Use it every day.
However, I just recently tried updating to the newest prometheus v3.5.0 released today and I noticed a lot of issues.
Seems the rule parsing is enforced more strictly? I haven't dug too deeply into the "why" yet, but I have some logs and notes and noticed no one had started an issue yet.
There's a new flag as well called --enable-feature=promql-delayed-name-removal that seems to alleviate some of the issues. The docs for that feature are here and they lead to a GitHub issue with even more info.
tl;dr seems that a number of metrics are going to need updating to work with anything past prometheus 3.4.x ...
Also, some sample logs running the new v3.5.0 below:
prometheus | time=2025-07-14T17:55:45.844Z level=WARN source=group.go:544 msg="Evaluating rule failed" component="rule manager" file=/etc/prometheus/rules/alerting/node_alerting_rules.yml group=node.rules name=HostOutOfMemory index=0 rule="alert: HostOutOfMemory\nexpr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10\nfor: 2m\nlabels:\n severity: warning\nannotations:\n description: |-\n Node memory is filling up (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}\n summary: Host out of memory (instance {{ $labels.instance }})\n" err="vector cannot contain metrics with the same labelset"
prometheus | time=2025-07-14T17:55:45.845Z level=WARN source=group.go:544 msg="Evaluating rule failed" component="rule manager" file=/etc/prometheus/rules/alerting/node_alerting_rules.yml group=node.rules name=HostOutOfDiskSpace index=6 rule="alert: HostOutOfDiskSpace\nexpr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and on\n (instance, device, mountpoint) node_filesystem_readonly == 0\nfor: 2m\nlabels:\n severity: warning\nannotations:\n description: |-\n Disk is almost full (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}\n summary: Host out of disk space (instance {{ $labels.instance }})\n" err="vector cannot contain metrics with the same labelset"
prometheus | time=2025-07-14T17:55:45.845Z level=WARN source=group.go:544 msg="Evaluating rule failed" component="rule manager" file=/etc/prometheus/rules/alerting/node_alerting_rules.yml group=node.rules name=HostDiskWillFillIn24Hours index=7 rule="alert: HostDiskWillFillIn24Hours\nexpr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and on\n (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~\"tmpfs\"}[1h],\n 24 * 3600) < 0 and on (instance, device, mountpoint) node_filesystem_readonly ==\n 0\nfor: 2m\nlabels:\n severity: warning\nannotations:\n description: |-\n Filesystem is predicted to run out of space within the next 24 hours at current write rate\n VALUE = {{ $value }}\n LABELS = {{ $labels }}\n summary: Host disk will fill in 24 hours (instance {{ $labels.instance }})\n" err="vector cannot contain metrics with the same labelset"
prometheus | time=2025-07-14T17:55:45.854Z level=WARN source=group.go:544 msg="Evaluating rule failed" component="rule manager" file=/etc/prometheus/rules/alerting/node_alerting_rules.yml group=node.rules name=HostSwapIsFillingUp index=17 rule="alert: HostSwapIsFillingUp\nexpr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80\nfor: 2m\nlabels:\n severity: warning\nannotations:\n description: |-\n Swap is filling up (>80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}\n summary: Host swap is filling up (instance {{ $labels.instance }})\n" err="vector cannot contain metrics with the same labelset"
prometheus | time=2025-07-14T17:55:45.863Z level=WARN source=group.go:544 msg="Evaluating rule failed" component="rule manager" file=/etc/prometheus/rules/alerting/node_alerting_rules.yml group=node.rules name=HostConntrackLimit index=28 rule="alert: HostConntrackLimit\nexpr: node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8\nfor: 5m\nlabels:\n severity: warning\nannotations:\n description: |-\n The number of conntrack is approaching limit\n VALUE = {{ $value }}\n LABELS = {{ $labels }}\n summary: Host conntrack limit (instance {{ $labels.instance }})\n" err="vector cannot contain metrics with the same labelset"
prometheus | time=2025-07-14T17:55:55.963Z level=WARN source=group.go:544 msg="Evaluating rule failed" component="rule manager" file=/etc/prometheus/rules/alerting/prometheus_alerting_rules.yml group=prometheus.rules name=PrometheusTargetMissingWithWarmupTime index=2 rule="alert: PrometheusTargetMissingWithWarmupTime\nexpr: sum by (instance, job) ((up == 0) * on (instance) group_right (job) (node_time_seconds\n - node_boot_time_seconds > 600))\nlabels:\n severity: critical\nannotations:\n description: |-\n Allow a job time to start up (10 minutes) before alerting that it's down.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}\n summary: Prometheus target missing with warmup time (instance {{ $labels.instance\n }})\n" err="vector cannot contain metrics with the same labelset"
prometheus | time=2025-07-14T17:55:55.963Z level=WARN source=group.go:544 msg="Evaluating rule failed" component="rule manager" file=/etc/prometheus/rules/alerting/prometheus_alerting_rules.yml group=prometheus.rules name=PrometheusTooManyRestarts index=3 rule="alert: PrometheusTooManyRestarts\nexpr: changes(process_start_time_seconds{job=~\"prometheus|pushgateway|alertmanager\"}[15m])\n > 2\nlabels:\n severity: warning\nannotations:\n description: |-\n Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}\n summary: Prometheus too many restarts (instance {{ $labels.instance }})\n" err="vector cannot contain metrics with the same labelset"
prometheus | time=2025-07-14T17:55:55.964Z level=WARN source=group.go:544 msg="Evaluating rule failed" component="rule manager" file=/etc/prometheus/rules/alerting/prometheus_alerting_rules.yml group=prometheus.rules name=PrometheusNotificationsBacklog index=8 rule="alert: PrometheusNotificationsBacklog\nexpr: min_over_time(prometheus_notifications_queue_length[10m]) > 0\nlabels:\n severity: warning\nannotations:\n description: |-\n The Prometheus notification queue has not been empty for 10 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}\n summary: Prometheus notifications backlog (instance {{ $labels.instance }})\n" err="vector cannot contain metrics with the same labelset"
prometheus | time=2025-07-14T17:55:57.091Z level=WARN source=group.go:544 msg="Evaluating rule failed" component="rule manager" file=/etc/prometheus/rules/alerting/loki_alerting_rules.yml group=loki.rules name=LokiProcessTooManyRestarts index=0 rule="alert: LokiProcessTooManyRestarts\nexpr: changes(process_start_time_seconds{job=~\".*loki.*\"}[15m]) > 2\nlabels:\n severity: warning\nannotations:\n description: |-\n A loki process had too many restarts (target {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}\n summary: Loki process too many restarts (instance {{ $labels.instance }})\n" err="vector cannot contain metrics with the same labelset"
prometheus | time=2025-07-14T17:55:57.412Z level=WARN source=group.go:544 msg="Evaluating rule failed" component="rule manager" file=/etc/prometheus/rules/alerting/redis_alerting_rules.yml group=redis.rules name=RedisDisconnectedSlaves index=3 rule="alert: RedisDisconnectedSlaves\nexpr: count without (job) (redis_connected_slaves) - sum without (job) (redis_connected_slaves)\n - 1 > 0\nlabels:\n severity: critical\nannotations:\n description: |-\n Redis not replicating for all slaves. Consider reviewing the redis replication status.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}\n summary: Redis disconnected slaves (instance {{ $labels.instance }})\n" err="vector cannot contain metrics with the same labelset"
prometheus | time=2025-07-14T17:55:57.412Z level=WARN source=group.go:544 msg="Evaluating rule failed" component="rule manager" file=/etc/prometheus/rules/alerting/redis_alerting_rules.yml group=redis.rules name=RedisMissingBackup index=6 rule="alert: RedisMissingBackup\nexpr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24\nlabels:\n severity: critical\nannotations:\n description: |-\n Redis has not been backuped for 24 hours\n VALUE = {{ $value }}\n LABELS = {{ $labels }}\n summary: Redis missing backup (instance {{ $labels.instance }})\n" err="vector cannot contain metrics with the same labelset"
Happy to generate more logs, etc. Also I believe this is from an instance where I did not provide that new flag. But providing the flag just lessened the number of failing rules, it did not entirely recreate v3.4.x working behavior.
Thanks again for the amazing work on this repo!