Skip to content

Signalilo Heartbeat Implementation QuestionΒ #123

@dragoangel

Description

@dragoangel

This more question to discuss, but from what I see:

  1. README.md states alert will be in UNKNOWN state if heartbeat will be triggered, but actually it will be in CRITICAL state. I think it was a change in heartbeat service example and someone forgot to update description.
  2. README.md states: On startup, Signalilo checks if the matching heartbeat service is available in Icinga, otherwise it exits with a fatal error. Which get me to understanding that if the heartbeat service doesn't exist 404 or there will be any other failures like 4xx\5xx - Signalilo will die, but I don't see this behavior for now. Maybe it was someday broken?
  3. From my view Signalilo should report if it has issues with writing alerts received from Alertmanager to Icinga in some way. For now it can be broken silently and if nobody checks Signalilo logs - they will not know about it. It could be due to Icinga downtime, or somebody will break something on Icinga side, even basically drop host or API user. There are a couple of different options which can resolve this:
    • Update heartbeat service status if we face errors from Icinga API in some places. I not like such way, as it will be confusing.
    • Work like a proxy between Alertmanager and Icinga, do not reply to Alertmanager with 200 status code till we not get such status code from Icinga. In this way we will know on Alertmanager side that Icinga integration looks like a dead ATM, and "fallback" route could be used to notify about AlertmanagerFailedToSendAlerts|AlertmanagerClusterFailedToSendAlerts. Problem that it will create delays.
    • Last option - I think most preferable: have a separate mandatory service IcingaApiErorrs for such error handling that must be created in the same way as Heartbeat, which will display if there was any errors in last minute. Small minus - with multiple Signalilo replicas it could start to be flapping. In case when even updates of IcingaApiErorrs service fails - Signalilo can instantly reply to Alertmanager about failures. After 1m there will be no errors in Icinga API, as no requests were made, and we will try to update IcingaApiErorrs - if fail - wait 1 minute again and reply to Alertmanager 500, if pass - start accept alerts from Alertmanager.

What you think?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions