-
Notifications
You must be signed in to change notification settings - Fork 14
Description
I'm opening a new issue for some odd behavior I'm seeing with host notifications. I've seen this behavior occur about 4 times now but I haven't necessarily been able to replicate it intentionally yet.
I was using escalations for host notifications but removed that in favor of delay_first_notification. Sometimes it appears that Naemon/Merlin chooses to send out a host notification earlier than when it should. Take the following example:
Sep 17 06:31:05 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;1;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:05 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;1;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:05 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;1;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:24 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;2;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:24 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;2;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:24 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;2;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:42 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;3;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:42 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;3;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:42 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;3;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:03 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;4;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:03 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;4;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:03 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;4;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:05 node_b naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notifications blocked for object in a soft state.
Sep 17 06:32:05 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notifications blocked for object in a soft state.
Sep 17 06:32:05 node_a naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notifications blocked for object in a soft state.
Sep 17 06:32:24 node_b naemon: HOST ALERT: pdu-g24;DOWN;HARD;5;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_b naemon: HOST NOTIFICATION: admin1+admin2+admin3_email;pdu-g24;DOWN;dcops_host-notify-by-email;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_b naemon: HOST NOTIFICATION: admin1-pager;pdu-g24;DOWN;dcops_host-notify-by-email-pager;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_b naemon: HOST NOTIFICATION: admin1-pushover;pdu-g24;DOWN;dcops_host-notify-by-pushover;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_b naemon: HOST NOTIFICATION: admin2-pushover;pdu-g24;DOWN;dcops_host-notify-by-pushover;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_b naemon: HOST NOTIFICATION: admin3-pushover;pdu-g24;DOWN;dcops_host-notify-by-pushover;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_a naemon: HOST ALERT: pdu-g24;DOWN;HARD;5;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_a naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification was blocked by a NEB module. Module '/usr/local/merlin/lib/merlin/merlin.so' cancelled notification: 'Notification will be handled by a peer (node_b.colorado.edu)'
Sep 17 06:32:24 node_c naemon: HOST ALERT: pdu-g24;DOWN;HARD;5;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification blocked because first_notification_delay is configured and not enough time has elapsed since the object changed to a non-UP/non-OK state (or since program start).
Sep 17 06:32:27 node_b naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Re-notification blocked for this problem because not enough time has passed since last notification.
Sep 17 06:32:27 node_a naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Re-notification blocked for this problem because not enough time has passed since last notification.
Sep 17 06:32:27 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Re-notification blocked for this problem because not enough time has passed since last notification.
Sep 17 06:32:51 node_b naemon: HOST ALERT: pdu-g24;UP;HARD;1;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_b naemon: HOST NOTIFICATION: admin1+admin2+admin3_email;pdu-g24;UP;dcops_host-notify-by-email;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_b naemon: HOST NOTIFICATION: admin1-pager;pdu-g24;UP;dcops_host-notify-by-email-pager;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_b naemon: HOST NOTIFICATION: admin1-pushover;pdu-g24;UP;dcops_host-notify-by-pushover;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_b naemon: HOST NOTIFICATION: admin2-pushover;pdu-g24;UP;dcops_host-notify-by-pushover;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_b naemon: HOST NOTIFICATION: admin3-pushover;pdu-g24;UP;dcops_host-notify-by-pushover;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_a naemon: HOST ALERT: pdu-g24;UP;HARD;1;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_a naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification was blocked by a NEB module. Module '/usr/local/merlin/lib/merlin/merlin.so' cancelled notification: 'Notification will be handled by a peer (node_b.colorado.edu)'
Sep 17 06:32:51 node_c naemon: HOST ALERT: pdu-g24;UP;HARD;1;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification was blocked by a NEB module. Module '/usr/local/merlin/lib/merlin/merlin.so' cancelled notification: 'Notification will be handled by a peer (node_b.colorado.edu)'
Sep 17 06:33:13 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;1;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:33:13 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;1;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:33:13 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;1;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:33:27 node_b naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notifications blocked for object in a soft state.
Sep 17 06:33:27 node_a naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notifications blocked for object in a soft state.
Sep 17 06:33:27 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notifications blocked for object in a soft state.
Sep 17 06:34:16 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;2;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:34:16 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;2;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:34:16 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;2;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:35:13 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;3;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:35:13 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;3;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:35:13 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;3;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:09 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;4;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:09 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;4;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:09 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;4;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:48 node_b naemon: HOST ALERT: pdu-g24;DOWN;HARD;5;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:48 node_a naemon: HOST ALERT: pdu-g24;DOWN;HARD;5;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:48 node_b naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification blocked because first_notification_delay is configured and not enough time has elapsed since the object changed to a non-UP/non-OK state (or since program start).
Sep 17 06:36:48 node_a naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification was blocked by a NEB module. Module '/usr/local/merlin/lib/merlin/merlin.so' cancelled notification: 'Notification will be handled by a peer (node_b.colorado.edu)'
Sep 17 06:36:48 node_c naemon: HOST ALERT: pdu-g24;DOWN;HARD;5;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:48 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification was blocked by a NEB module. Module '/usr/local/merlin/lib/merlin/merlin.so' cancelled notification: 'Notification will be handled by a peer (node_b.colorado.edu)'
Sep 17 06:39:54 node_b naemon: HOST NOTIFICATION: admin1+admin2+admin3_email;pdu-g24;DOWN;dcops_host-notify-by-email;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:39:54 node_b naemon: HOST NOTIFICATION: admin1-pager;pdu-g24;DOWN;dcops_host-notify-by-email-pager;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:39:54 node_b naemon: HOST NOTIFICATION: admin1-pushover;pdu-g24;DOWN;dcops_host-notify-by-pushover;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:39:54 node_b naemon: HOST NOTIFICATION: admin2-pushover;pdu-g24;DOWN;dcops_host-notify-by-pushover;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:39:54 node_b naemon: HOST NOTIFICATION: admin3-pushover;pdu-g24;DOWN;dcops_host-notify-by-pushover;CRITICAL - 172.20.9.140: rta nan, lost 100%
Look particularly at 06:32:24. 'node_b' logs a HARD state and immediately sends a notification. At the same time node_a reports that the notification is blocked by a NEB module and will be handled by node_b (that makes sense). But then node_c reports something entirely different:
Sep 17 06:32:24 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification blocked because first_notification_delay is configured and not enough time has elapsed since the object changed to a non-UP/non-OK state (or since program start).
So for some reason node_c realizes first_notification_delay is in play ... but somehow node_a doesn't. I can confirm that all three nodes have the exact same configuration.
Here is the relevant configuration for host checks:
define host {
active_checks_enabled 1
check_command team_check-host-alive
check_freshness 0
check_period team_24x7
contact_groups admin1+admin2+admin3_email
event_handler_enabled 0
flap_detection_enabled 1
max_check_attempts 5
name team_generic_host_template
notification_interval 2
notification_options d,r
notification_period team_24x7
notifications_enabled 1
first_notification_delay 3
obsess_over_host 0
passive_checks_enabled 0
process_perf_data 1
register 0
retain_nonstatus_information 1
retain_status_information 1
}