[Prometheus] Metrics are sometimes returning 0 #15355
-
Describe the bugWe recently upgrade rabbitMQ running from docker. rabbitmq:4.1.3-management => rabbitmq:4.2.2-management. Post upgrade several Prometheus metrics started behaving inconsistently. This started about 7 hrs post upgrade at midnight. The metrics we noticed this with are: We are scraping from the /metrics/per-object endpoint every 10s. Metric counters would randomly return 0 and then go back up to what seems a correct value. Not all are 0 just some randomly. Metrics returned from the Management Plugin seem not to have this issue. Reproduction steps
Expected behaviorTimeout or return correct metric. Additional contextOnly one message from this time and it doesn't seem unusual: rabbit_sysmon_handler busy_dist_port <0.859.0> [{name,delegate_management_4},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] {#Port<0.18>,unknown}�[0m |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
|
@dforste that "doesn't seem unusual" message tells you that the inter-node communication link on that node has been overloaded for a certain amount of time continuously. That can directly affect the metrics that are aggregated across all nodes: not all responses arrive within the short timeout such operations use, and therefore you get underreported metrics in the UI. Without clear evidence of other scenarios, that's my conclusion. Perhaps you have periodic processes that publish large messages running at midnight, or something like that. |
Beta Was this translation helpful? Give feedback.
@dforste that "doesn't seem unusual" message tells you that the inter-node communication link on that node has been overloaded for a certain amount of time continuously.
That can directly affect the metrics that are aggregated across all nodes: not all responses arrive within the short timeout such operations use, and therefore you get underreported metrics in the UI.
Without clear evidence of other scenarios, that's my conclusion. Perhaps you have periodic processes that publish large messages running at midnight, or something like that.