Signalilo Heartbeat Implementation Question

This more question to discuss, but from what I see:

1. `README.md` states alert will be in UNKNOWN state if heartbeat will be triggered, but actually it will be in CRITICAL state. I think it was a change in heartbeat service example and someone forgot to update description.
2. `README.md` states: `On startup, Signalilo checks if the matching heartbeat service is available in Icinga, otherwise it exits with a fatal error`. Which get me to understanding that if the heartbeat service doesn't exist `404` or there will be any other failures like `4xx\5xx` - Signalilo will die, but I don't see this behavior for now. Maybe it was someday broken?
3. From my view Signalilo should report if it has issues with writing alerts received from Alertmanager to Icinga in some way. For now it can be broken silently and if nobody checks Signalilo logs - they will not know about it. It could be due to Icinga downtime, or somebody will break something on Icinga side, even basically drop host or API user. There are a couple of different options which can resolve this:
    - Update heartbeat service status if we face errors from Icinga API in some places. I not like such way, as it will be confusing.
    - Work like a proxy between Alertmanager and Icinga, do not reply to Alertmanager with 200 status code till we not get such status code from Icinga. In this way we will know on Alertmanager side that Icinga integration looks like a dead ATM, and "fallback" route could be used to notify about `AlertmanagerFailedToSendAlerts|AlertmanagerClusterFailedToSendAlerts`. Problem that it will create delays.
    - Last option - I think most preferable: have a separate mandatory service `IcingaApiErorrs` for such error handling that must be created in the same way as `Heartbeat`, which will display if there was any errors in last minute. Small minus - with multiple Signalilo replicas it could start to be flapping. In case when even updates of `IcingaApiErorrs` service fails - Signalilo can instantly reply to Alertmanager about failures. After 1m there will be no errors in Icinga API, as no requests were made, and we will try to update `IcingaApiErorrs` - if fail - wait 1 minute again and reply to Alertmanager 500, if pass - start accept alerts from Alertmanager.

What you think?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Signalilo Heartbeat Implementation Question #123

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Signalilo Heartbeat Implementation Question #123

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions