Skip to content

Naemon stops executing checks and doesnt respawn Core Worker processes #418

@ccztux

Description

@ccztux

On a system running Naemon Core 1.3.0 we ran into the issue, that naemon stops executing checks. There were no more worker processes. I have not seen anything suspicious in the system-journal or dmesg. No SIGSEGV or oom_killer in action.

Log snippet of the Naemon log (host and servicenames anonymized):

[1677024519] Warning:  Check of host 'myhost' did not exit properly!
[1677024519] HOST ALERT: myhost;DOWN;SOFT;2;(Host check did not exit properly)
[1677024520] wproc: Socket to worker Core Worker 4261 broken, removing
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN - check_nwc_health timed out after 50 seconds
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN - check_nwc_health timed out after 50 seconds
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024520] HOST ALERT: myhost;DOWN;SOFT;3;CRITICAL - 10.0.0.63: rta nan, lost 100%
[1677024521] Warning:  Check of service 'myservice' on host 'myhost' did not exit properly!
[1677024521] SERVICE ALERT: myhost;myservice;(Service check did not exit properly)
[1677024521] Warning:  Check of host 'myhost' did not exit properly!
[1677024521] HOST ALERT: myhost;DOWN;SOFT;2;(Host check did not exit properly)
[1677024521] Warning:  Check of service 'myservice' on host 'myhost' did not exit properly!
[1677024521] SERVICE ALERT: myhost;myservice;(Service check did not exit properly)
[1677024521] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024521] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024521] wproc: nm_bufferqueue_read() from Core Worker 4258 returned -1: Connection reset by peer
[1677024521] wproc: Socket to worker Core Worker 4258 broken, removing
[1677024521] wproc: nm_bufferqueue_read() from Core Worker 4260 returned -1: Connection reset by peer
[1677024521] wproc: Socket to worker Core Worker 4260 broken, removing
[1677024521] wproc: nm_bufferqueue_read() from Core Worker 4259 returned -1: Connection reset by peer
[1677024521] wproc: Socket to worker Core Worker 4259 broken, removing
[1677024526] Warning:  Check of host 'myhost' did not exit properly!
[1677024526] HOST ALERT: myhost;DOWN;SOFT;3;(Host check did not exit properly)
[1677024526] wproc: nm_bufferqueue_read() from Core Worker 4257 returned -1: Connection reset by peer
[1677024526] wproc: Socket to worker Core Worker 4257 broken, removing
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for host 'myhost' to worker (ret=-2)
[1677024527] Unable to send check for host 'myhost' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024527] Unable to send check for host 'myhost' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)

Independent of the root cause of the broken Core Worker processes, i think naemon should respawn the Core Worker processes, if there are no processes or less than desired.

This also happens with a manual installation with the actual version of the master branch Naemon Core 1.4.1.g2916d626.20230223.

Found this to reproduce the issue.

After looking into the source code i expected to hit the following if condition which doesnt happen:

if (workers.len <= 0) {
/* there aren't global workers left, we can't run any more checks
* we should try respawning a few of the standard ones
*/
nm_log(NSLOG_RUNTIME_ERROR, "wproc: All our workers are dead, we can't do anything!");
}

I will provide a fix for the respawning thing via a pull request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions