diff --git a/content/en/incident_response/on-call/_index.md b/content/en/incident_response/on-call/_index.md index 9ce7d996740..ecdf503b83a 100644 --- a/content/en/incident_response/on-call/_index.md +++ b/content/en/incident_response/on-call/_index.md @@ -25,7 +25,7 @@ Datadog On-Call integrates monitoring, paging, and incident response into one pl - **Pages** represent something to get alerted for, such as a monitor, incident, or security signal. A Page can have a status of `Triggered`, `Acknowledged`, or `Resolved`. - **Teams** are groups configured within Datadog to handle specific types of Pages, based on expertise and operational roles. -- **Routing rules** allow Teams to finely adjust their reactions to specific types of incoming events. These rules can set a Page's urgency level and route Pages to different escalation policies depending on the event's metadata. +- **Routing rules** allow Teams to finely adjust their reactions to specific types of incoming events. These rules can set a Page's urgency level, route Pages to different escalation policies depending on the event's metadata, and configure [support hours][7] to delay escalation notifications to defined time windows. - **Escalation policies** determine how Pages are escalated within or across Teams. - **Schedules** set timetables for when specific Team members are on-call to respond to Pages. @@ -106,3 +106,4 @@ On-Call is a seat-based SKU. To learn more about how On-Call is billed and how t [4]: /account_management/rbac/#role-based-access-control [5]: https://www.datadoghq.com/pricing/?product=incident-response#products [6]: /account_management/billing/incident_response/ +[7]: /incident_response/on-call/routing_rules#support-hours diff --git a/content/en/incident_response/on-call/routing_rules.md b/content/en/incident_response/on-call/routing_rules.md index eb9c2e79ee6..b9613213a0f 100644 --- a/content/en/incident_response/on-call/routing_rules.md +++ b/content/en/incident_response/on-call/routing_rules.md @@ -23,6 +23,10 @@ With routing rules, you can define granular logic to control how alerts reach yo - During business hours, route alerts to an escalation policy. - After hours, route critical alerts to paging, and non-critical alerts to chat. +- Delay escalation outside of support hours: + - Define [support hours](#support-hours) on an escalation policy action to postpone notifications until the next active window. + - For example, a Page that arrives at 2:00 AM on Saturday creates a case immediately, but does not notify responders until 9:00 AM on Monday. + - Use Dynamic Urgency to automatically detect urgency from the monitor alert: - `warn` status ➝ low urgency - `alert` status ➝ high urgency @@ -55,11 +59,81 @@ Routing rules use [Datadog query syntax][3] and support multiple `if/else` condi | `priority` | Monitor priority (1–5) | `priority:(1 OR 2)` | | `alert_status` | Monitor status (`error`, `warn`, `success`) | `alert_status:(error OR warn)` | +## Support hours + +Support hours let you define time windows during which an escalation policy actively notifies responders. When a Page arrives outside of support hours, Datadog creates the Page immediately but **postpones** the escalation policy until the next active support hours window. After the postponement period ends, the escalation policy begins executing normally. + +### How support hours work + +1. An alert triggers a Page to an On-Call team. +1. Routing rules are evaluated from top to bottom to find a matching rule. +1. The matching rule's escalation policy action checks the current time against the configured support hours: + - **Inside support hours**: The escalation policy executes immediately and responders are notified. + - **Outside support hours**: The Page is created and the escalation policy is postponed. Datadog records a timeline entry on the Page indicating the postponement. When support hours resume, the escalation policy begins executing. + +### Support hours compared to time restrictions + +Routing rules support two types of time-based controls. They serve different purposes: + +| Feature | What it controls | Behavior outside the time window | +|---------|-----------------|----------------------------------| +| **Time restrictions** | When the routing rule **evaluates** | The rule is skipped and the next rule is tried. No Page is created by this rule. | +| **Support hours** | When the escalation policy **notifies responders** | The Page is created immediately, but notifications are postponed until the next active window. | + +For example, if your team handles priority 2 alerts and wants to track all alerts but only page responders during business hours, use **support hours**. If your team should not handle certain alerts at all outside of business hours (and another rule or team should handle them instead), use **time restrictions**. + +
You cannot configure both time restrictions and support hours on the same routing rule. Use one or the other.
+ +### Configure support hours + +To add support hours to a routing rule's escalation policy action, configure a time zone and one or more time windows (restrictions). + +Each support hours configuration includes: +- **Time zone**: An IANA time zone (for example, `America/New_York`, `Europe/Paris`, or `Asia/Tokyo`). +- **Restrictions**: One or more time windows that define when the escalation policy is active. Each restriction specifies: + - A **start day** and **start time** + - An **end day** and **end time** + +Times use the `HH:MM:SS` format (for example, `09:00:00` for 9:00 AM). + +If multiple restriction windows are defined, the escalation policy is active if the current time matches **any** of the windows. + +#### Example: Business hours only (Monday through Friday, 9 AM to 5 PM) + +Set a single restriction window: +- **Start day**: Monday, **Start time**: 09:00:00 +- **End day**: Friday, **End time**: 17:00:00 +- **Time zone**: `America/New_York` + +Pages that arrive outside this window (for example, at 2:00 AM on Saturday) are postponed until 9:00 AM on the following Monday. + +#### Example: Split shift (mornings and afternoons) + +Define two restriction windows to cover non-contiguous hours: + +**Window 1:** +- **Start day**: Monday, **Start time**: 09:00:00 +- **End day**: Friday, **End time**: 12:00:00 + +**Window 2:** +- **Start day**: Monday, **Start time**: 14:00:00 +- **End day**: Friday, **End time**: 18:00:00 + +Pages that arrive between 12:00 PM and 2:00 PM are postponed until the afternoon window opens. + ## Best practices - Balance visibility with urgency: - Use paging and escalation policies for critical alerts that require immediate action. - - Use Slack or Teams for lower-severity issues that need awareness but don’t warrant an on-call response. + - Use Slack or Teams for lower-severity issues that need awareness but don't warrant an on-call response. + +- Use support hours to protect responders from off-hours notifications: + - For non-critical alerts, configure support hours to match your team's working hours. Pages are tracked immediately but responders are only notified during active windows. + - For critical alerts that require immediate attention regardless of time, do **not** set support hours on the escalation policy. + +- Choose between time restrictions and support hours based on your routing needs: + - Use **time restrictions** when a different routing rule or team should handle the alert outside of business hours. + - Use **support hours** when your team should own the alert at all times but only page responders during defined hours. ## Further reading