-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Description
This more question to discuss, but from what I see:
README.mdstates alert will be in UNKNOWN state if heartbeat will be triggered, but actually it will be in CRITICAL state. I think it was a change in heartbeat service example and someone forgot to update description.README.mdstates:On startup, Signalilo checks if the matching heartbeat service is available in Icinga, otherwise it exits with a fatal error. Which get me to understanding that if the heartbeat service doesn't exist404or there will be any other failures like4xx\5xx- Signalilo will die, but I don't see this behavior for now. Maybe it was someday broken?- From my view Signalilo should report if it has issues with writing alerts received from Alertmanager to Icinga in some way. For now it can be broken silently and if nobody checks Signalilo logs - they will not know about it. It could be due to Icinga downtime, or somebody will break something on Icinga side, even basically drop host or API user. There are a couple of different options which can resolve this:
- Update heartbeat service status if we face errors from Icinga API in some places. I not like such way, as it will be confusing.
- Work like a proxy between Alertmanager and Icinga, do not reply to Alertmanager with 200 status code till we not get such status code from Icinga. In this way we will know on Alertmanager side that Icinga integration looks like a dead ATM, and "fallback" route could be used to notify about
AlertmanagerFailedToSendAlerts|AlertmanagerClusterFailedToSendAlerts. Problem that it will create delays. - Last option - I think most preferable: have a separate mandatory service
IcingaApiErorrsfor such error handling that must be created in the same way asHeartbeat, which will display if there was any errors in last minute. Small minus - with multiple Signalilo replicas it could start to be flapping. In case when even updates ofIcingaApiErorrsservice fails - Signalilo can instantly reply to Alertmanager about failures. After 1m there will be no errors in Icinga API, as no requests were made, and we will try to updateIcingaApiErorrs- if fail - wait 1 minute again and reply to Alertmanager 500, if pass - start accept alerts from Alertmanager.
What you think?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels