Replies: 1 comment
-
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'd like to propose an enhancement to the alert configuration system that would significantly improve our ability to manage service status changes and automate appropriate responses.
Current Limitation:
Currently, we are limited to a single alert configuration per status type (e.g., one "DOWN" alert, one DEGRADED alert). This can restrict our ability to implement nuanced and escalating responses to service degradation or outages.
Proposed Feature: Multiple, Configurable Alerts per Status
My suggestion is to allow users to configure zero to N (0-N) distinct alerts for a given service status, such as
DOWNorDEGRADED. Each of these individual alerts would possess its own independent configuration settings, including:DOWN,DEGRADED).Benefits of this Approach:
This flexible configuration would enable us to implement sophisticated, escalating alert patterns tailored to the severity and duration of an issue. This offers several key advantages:
Illustrative Example: Escalating "DOWN" Alerts
Consider the following scenario for a service experiencing
DOWNstatus, demonstrating how multiple alerts could work in an escalating fashion:First Alert - Initial Blip Notification
DOWNSecond Alert - Sustained Issue Notification
DOWNemail(to a broader team distribution list for awareness, but not yet a high-priority PagerDuty alert)DOWNstatus persists beyond a minor blip, this alert notifies a wider group that the problem is not self-correcting.Third Alert - Critical Outage Escalation
DOWNwebhook(to an incident management system like PagerDuty or Opsgenie, triggering on-call rotations and critical response procedures)This proposed functionality would empower users with much greater control and automation capabilities for their monitoring and incident response strategies.
Beta Was this translation helpful? Give feedback.
All reactions