Skip to content

Commit 094b0e3

Browse files
authored
[E&A] Alerts & Cases reorg and refine (#377)
1 parent 2b697af commit 094b0e3

32 files changed

+262
-528
lines changed

explore-analyze/alerts-cases.md

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,17 +6,6 @@ mapped_urls:
66

77
# Alerts and cases
88

9-
% What needs to be done: Write from scratch
10-
11-
% Use migrated content from existing pages that map to this page:
12-
13-
% - [ ] ./raw-migrated-files/kibana/kibana/alerting-getting-started.md
14-
% - [ ] ./raw-migrated-files/docs-content/serverless/project-settings-alerts.md
15-
16-
$$$alerting-concepts-actions$$$
17-
18-
$$$alerting-concepts-conditions$$$
19-
209
Alerting tools in Elasticsearch and Kibana provide functionality to monitor data and notify you about significant changes or events in real time. This page provides an overview of how the key components work.
2110

2211
## Alerts

explore-analyze/alerts-cases/alerts.md

Lines changed: 100 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -7,24 +7,112 @@ mapped_urls:
77

88
# Alerts
99

10-
% What needs to be done: Align serverless/stateful
10+
## {{rules-app}} [rules]
1111

12-
% Scope notes: connection to kibana connectors reference prod considerations
12+
In general, a rule consists of three parts:
1313

14-
% Use migrated content from existing pages that map to this page:
14+
* *Conditions*: what needs to be detected?
15+
* *Schedule*: when/how often should detection checks run?
16+
* *Actions*: what happens when a condition is detected?
1517

16-
% - [ ] ./raw-migrated-files/kibana/kibana/alerting-getting-started.md
17-
% - [ ] ./raw-migrated-files/docs-content/serverless/rules.md
18-
% - [ ] ./raw-migrated-files/cloud/cloud/ec-organizations-notifications-domain-allowlist.md
18+
For example, when monitoring a set of servers, a rule might:
1919

20-
% Internal links rely on the following IDs being on this page (e.g. as a heading ID, paragraph ID, etc):
20+
* Check for average CPU usage > 0.9 on each server for the last two minutes (condition).
21+
* Check every minute (schedule).
22+
* Send a warning email message via SMTP with subject `CPU on {{server}} is high` (action).
2123

22-
$$$alerting-getting-started$$$
24+
### Conditions [rules-conditions]
2325

24-
$$$rules-alerts$$$
26+
Each project type supports a specific set of rule types. Each *rule type* provides its own way of defining the conditions to detect, but an expression formed by a series of clauses is a common pattern. For example, in an {{es}} query rule, you specify an index, a query, and a threshold, which uses a metric aggregation operation (`count`, `average`, `max`, `min`, or `sum`):
2527

26-
$$$alerting-concepts-conditions$$$
28+
:::{image} ../../images/serverless-es-query-rule-conditions.png
29+
:alt: UI for defining rule conditions in an {{es}} query rule
30+
:class: screenshot
31+
:::
2732

28-
$$$alerting-concepts-scheduling$$$
33+
### Schedule [rules-schedule]
2934

30-
$$$alerting-concepts-actions$$$
35+
All rules must have a check interval, which defines how often to evaluate the rule conditions. Checks are queued; they run as close to the defined value as capacity allows.
36+
37+
::::{important}
38+
The intervals of rule checks in {{kib}} are approximate. Their timing is affected by factors such as the frequency at which tasks are claimed and the task load on the system. Refer to [Alerting production considerations](../../deploy-manage/production-guidance/kibana-alerting-production-considerations.md)
39+
40+
::::
41+
42+
### Actions [rules-actions]
43+
44+
You can add one or more actions to your rule to generate notifications when its conditions are met. Recovery actions likewise run when rule conditions are no longer met.
45+
46+
When defining actions in a rule, you specify:
47+
48+
* A connector
49+
* An action frequency
50+
* A mapping of rule values to properties exposed for that type of action
51+
52+
Each action uses a connector, which provides connection information for a {{kib}} service or third party integration, depending on where you want to send the notifications. The specific list of connectors that you can use in your rule vary by project type. Refer to [{{connectors-app}}](../../deploy-manage/manage-connectors.md).
53+
54+
After you select a connector, set the *action frequency*. If you want to reduce the number of notifications you receive without affecting their timeliness, some rule types support alert summaries. For example, if you create an {{es}} query rule, you can set the action frequency such that you receive summaries of the new, ongoing, and recovered alerts on a custom interval:
55+
56+
:::{image} ../../images/serverless-es-query-rule-action-summary.png
57+
:alt: UI for defining rule conditions in an {{es}} query rule
58+
:class: screenshot
59+
:::
60+
61+
Alternatively, you can set the action frequency such that the action runs for each alert. If the rule type does not support alert summaries, this is your only available option. You must choose when the action runs (for example, at each check interval, only when the alert status changes, or at a custom action interval). You must also choose an action group, which affects whether the action runs. Each rule type has a specific set of valid action groups. For example, you can set *Run when* to `Query matched` or `Recovered` for the {{es}} query rule:
62+
63+
:::{image} ../../images/serverless-es-query-rule-recovery-action.png
64+
:alt: UI for defining a recovery action
65+
:class: screenshot
66+
:::
67+
68+
Each connector supports a specific set of actions for each action group and enables different action properties. For example, you can have actions that create an {{opsgenie}} alert when rule conditions are met and recovery actions that close the {{opsgenie}} alert.
69+
70+
Some types of rules enable you to further refine the conditions under which actions run. For example, you can specify that actions run only when an alert occurs within a specific time frame or when it matches a KQL query.
71+
72+
::::{tip}
73+
If you are not using alert summaries, actions are triggered per alert and a rule can end up generating a large number of actions. Take the following example where a rule is monitoring three servers every minute for CPU usage > 0.9, and the action frequency is `On check intervals`:
74+
75+
* Minute 1: server X123 > 0.9. *One email* is sent for server X123.
76+
* Minute 2: X123 and Y456 > 0.9. *Two emails* are sent, one for X123 and one for Y456.
77+
* Minute 3: X123, Y456, Z789 > 0.9. *Three emails* are sent, one for each of X123, Y456, Z789.
78+
79+
In this example, three emails are sent for server X123 in the span of 3 minutes for the same rule. Often, it’s desirable to suppress these re-notifications. If you set the action frequency to `On custom action intervals` with an interval of 5 minutes, you reduce noise by getting emails only every 5 minutes for servers that continue to exceed the threshold:
80+
81+
* Minute 1: server X123 > 0.9. *One email* will be sent for server X123.
82+
* Minute 2: X123 and Y456 > 0.9. *One email* will be sent for Y456.
83+
* Minute 3: X123, Y456, Z789 > 0.9. *One email* will be sent for Z789.
84+
85+
To get notified only once when a server exceeds the threshold, you can set the action frequency to `On status changes`. Alternatively, if the rule type supports alert summaries, consider using them to reduce the volume of notifications.
86+
87+
::::
88+
89+
#### Action variables [rules-action-variables]
90+
91+
You can pass rule values to an action at the time a condition is detected. To view the list of variables available for your rule, click the "add rule variable" button:
92+
93+
:::{image} ../../images/serverless-es-query-rule-action-variables.png
94+
:alt: Passing rule values to an action
95+
:class: screenshot
96+
:::
97+
98+
For more information about common action variables, refer to [Rule actions variables](../../explore-analyze/alerts-cases/alerts/rule-action-variables.md)
99+
100+
### Alerts [rules-alerts]
101+
102+
When checking for a condition, a rule might identify multiple occurrences of the condition. {{kib}} tracks each of these alerts separately. Depending on the action frequency, an action occurs per alert or at the specified alert summary interval.
103+
104+
Using the server monitoring example, each server with average CPU > 0.9 is tracked as an alert. This means a separate email is sent for each server that exceeds the threshold whenever the alert status changes.
105+
106+
### Putting it all together [rules-putting-it-all-together]
107+
108+
A rule consists of conditions, actions, and a schedule. When conditions are met, alerts are created that render actions and invoke them. To make action setup and update easier, actions use connectors that centralize the information used to connect with {{kib}} services and third-party integrations. The following example ties these concepts together:
109+
110+
:::{image} ../../images/serverless-rule-concepts-summary.svg
111+
:alt: Rules
112+
:class: screenshot
113+
:::
114+
115+
1. Any time a rule’s conditions are met, an alert is created. This example checks for servers with average CPU > 0.9. Three servers meet the condition, so three alerts are created.
116+
2. Alerts create actions according to the action frequency, as long as they are not muted or throttled. When actions are created, its properties are filled with actual values. In this example, three actions are created when the threshold is met, and the template string `{{server}}` is replaced with the appropriate server name for each alert.
117+
3. {{kib}} runs the actions, sending notifications by using a third party integration like an email service.
118+
4. If the third party integration has connection parameters or credentials, {{kib}} fetches these from the appropriate connector.

explore-analyze/alerts-cases/alerts/alerting-common-issues.md

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,6 @@ mapped_pages:
77

88
This page describes how to resolve common problems you might encounter with Alerting.
99

10-
1110
## Rules with small check intervals run late [rules-small-check-interval-run-late]
1211

1312
**Problem**
@@ -22,7 +21,6 @@ Either tweak the [{{kib}} Task Manager settings](https://www.elastic.co/guide/en
2221

2322
For more details, see [Tasks with small schedule intervals run late](../../../troubleshoot/kibana/task-manager.md#task-manager-health-scheduled-tasks-small-schedule-interval-run-late).
2423

25-
2624
## Rules with the inconsistent cadence [scheduled-rules-run-late]
2725

2826
**Problem**
@@ -39,7 +37,6 @@ Alerting tasks always begin with `alerting:`. For example, the `alerting:.index-
3937

4038
For more details on monitoring and diagnosing tasks in Task Manager, refer to [Health monitoring](../../../deploy-manage/monitor/kibana-task-manager-health-monitoring.md).
4139

42-
4340
## Connectors have TLS errors when running actions [connector-tls-settings]
4441

4542
**Problem**
@@ -50,7 +47,6 @@ A connector gets a TLS socket error when connecting to the server to run an acti
5047

5148
Configuration options are available to specialize connections to TLS servers, including ignoring server certificate validation and providing certificate authority data to verify servers using custom certificates. For more details, see [Action settings](https://www.elastic.co/guide/en/kibana/current/alert-action-settings-kb.html#action-settings).
5249

53-
5450
## Rules take a long time to run [rules-long-run-time]
5551

5652
**Problem**
@@ -62,7 +58,6 @@ By default, only users with a `superuser` role can query the [preview] {{kib}} e
6258

6359
::::
6460

65-
6661
**Solution**
6762

6863
By default, rules have a `5m` timeout. Rules that run longer than this timeout are automatically canceled to prevent them from consuming too much of {{kib}}'s resources. Alerts and actions that may have been scheduled before the rule timed out are discarded. When a rule times out, you will see this error in the {{kib}} logs:
@@ -157,7 +152,6 @@ GET /.kibana-event-log*/_search
157152
4. This interval buckets the `event.duration_in_seconds` runtime field into 1 second intervals. Update this value to change the granularity of the buckets. If you are unable to use runtime fields, make sure this aggregation targets `event.duration` and use nanoseconds for the interval.
158153
5. This retrieves the top 10 rule ids for this duration interval. Update this value to retrieve more rule ids.
159154

160-
161155
This query returns the following:
162156

163157
```json
@@ -232,10 +226,8 @@ This query returns the following:
232226
1. Most run durations fall within the first bucket (0 - 1 seconds).
233227
2. A single rule with id `41893910-6bca-11eb-9e0d-85d233e3ee35` took between 30 and 31 seconds to run.
234228

235-
236229
Use the get rule API to retrieve additional information about rules that take a long time to run.
237230

238-
239231
## Rule cannot decrypt API key [rule-cannot-decrypt-api-key]
240232

241233
**Problem**:
@@ -252,7 +244,6 @@ This error happens when the `xpack.encryptedSavedObjects.encryptionKey` value us
252244
| If another {{kib}} instance with a different encryption key connects to the cluster. | The other {{kib}} instance might be trying to run the rule using a different encryption key than what the rule was created with. Ensure the encryption keys among all the {{kib}} instances are the same, and setting [decryption only keys](https://www.elastic.co/guide/en/kibana/current/security-settings-kb.html#xpack-encryptedSavedObjects-keyRotation-decryptionOnlyKeys) for previously used encryption keys. |
253245
| If other scenarios don’t apply. | Generate a new API key for the rule. For example, in **{{stack-manage-app}} > {{rules-ui}}**, select **Update API key** from the action menu. |
254246

255-
256247
## Rules stop running after upgrade [known-issue-upgrade-rule]
257248

258249
**Problem**:

explore-analyze/alerts-cases/alerts/alerting-getting-started.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ This section will clarify some of the important differences in the function and
112112
Functionally, the {{alert-features}} differ in that:
113113

114114
* Scheduled checks are run on {{kib}} instead of {es}
115-
* {{kib}} [rules hide the details of detecting conditions](../../../explore-analyze/alerts-cases.md#alerting-concepts-conditions) through rule types, whereas watches provide low-level control over inputs, conditions, and transformations.
115+
* {{kib}} [rules hide the details of detecting conditions](../../../explore-analyze/alerts-cases/alerts/alerting-getting-started.md#alerting-concepts-conditions) through rule types, whereas watches provide low-level control over inputs, conditions, and transformations.
116116
* {{kib}} rules track and persist the state of each detected condition through alerts. This makes it possible to mute and throttle individual alerts, and detect changes in state such as resolution.
117117
* Actions are linked to alerts. Actions are fired for each occurrence of a detected condition, rather than for the entire rule.
118118

explore-analyze/alerts-cases/alerts/alerting-setup.md

Lines changed: 1 addition & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,9 @@ mapped_pages:
44
- https://www.elastic.co/guide/en/kibana/current/alerting-setup.html
55
---
66

7-
8-
97
# Set up [alerting-setup]
108

11-
12-
{{kib}} {alert-features} are automatically enabled, but might require some additional configuration.
13-
9+
{{kib}} {{alert-features}} are automatically enabled, but might require some additional configuration.
1410

1511
## Prerequisites [alerting-prerequisites]
1612

@@ -25,19 +21,16 @@ If you are using an **on-premises** {{stack}} deployment with [**security**](../
2521

2622
The alerting framework uses queries that require the `search.allow_expensive_queries` setting to be `true`. See the scripts [documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-script-query.html#_allow_expensive_queries_4).
2723

28-
2924
## Production considerations and scaling guidance [alerting-setup-production]
3025

3126
When relying on alerting and actions as mission critical services, make sure you follow the [alerting production considerations](../../../deploy-manage/production-guidance/kibana-alerting-production-considerations.md).
3227

3328
For more information on the scalability of {{alert-features}}, go to [Scaling guidance](../../../deploy-manage/production-guidance/kibana-alerting-production-considerations.md#alerting-scaling-guidance).
3429

35-
3630
## Security [alerting-security]
3731

3832
To use {{alert-features}} in a {{kib}} app, you must have the appropriate feature privileges:
3933

40-
4134
### Give full access to manage alerts, connectors, and rules in **{{stack-manage-app}}** [_give_full_access_to_manage_alerts_connectors_and_rules_in_stack_manage_app]
4235

4336
**{{kib}} privileges**
@@ -57,8 +50,6 @@ The rule type also affects the privileges that are required. For example, to cre
5750

5851
::::
5952

60-
61-
6253
### Give view-only access to alerts, connectors, and rules in **{{stack-manage-app}}** [_give_view_only_access_to_alerts_connectors_and_rules_in_stack_manage_app]
6354

6455
**{{kib}} privileges**
@@ -72,15 +63,12 @@ The rule type also affects the privileges that are required. For example, to vie
7263

7364
::::
7465

75-
76-
7766
### Give view-only access to alerts in **Discover** or **Dashboards** [_give_view_only_access_to_alerts_in_discover_or_dashboards]
7867

7968
**{{kib}} privileges**
8069

8170
* `Read` index privileges for the `.alerts-*` system indices.
8271

83-
8472
### Revoke all access to alerts, connectors, and rules in **{{stack-manage-app}}**, **Discover**, or **Dashboards** [_revoke_all_access_to_alerts_connectors_and_rules_in_stack_manage_app_discover_or_dashboards]
8573

8674
**{{kib}} privileges**
@@ -90,12 +78,10 @@ The rule type also affects the privileges that are required. For example, to vie
9078
* `None` for the **Management > {{connectors-feature}}** feature.
9179
* No index privileges for the `.alerts-*` system indices.
9280

93-
9481
### More details [_more_details]
9582

9683
For more information on configuring roles that provide access to features, go to [Feature privileges](../../../deploy-manage/users-roles/cluster-or-deployment-auth/kibana-privileges.md#kibana-feature-privileges).
9784

98-
9985
### API keys [alerting-authorization]
10086

10187
Rules are authorized using an API key. Its credentials are used to run all background tasks associated with the rule, including condition checks like {{es}} queries and triggered actions.
@@ -113,18 +99,14 @@ If a rule requires certain privileges, such as index privileges, to run and a us
11399

114100
::::
115101

116-
117-
118102
### Restrict actions [alerting-restricting-actions]
119103

120104
For security reasons you may wish to limit the extent to which {{kib}} can connect to external services. You can use [Action settings](https://www.elastic.co/guide/en/kibana/current/alert-action-settings-kb.html#action-settings) to disable certain [*Connectors*](../../../deploy-manage/manage-connectors.md) and allowlist the hostnames that {{kib}} can connect with.
121105

122-
123106
## Space isolation [alerting-spaces]
124107

125108
Rules and connectors are isolated to the {{kib}} space in which they were created. A rule or connector created in one space will not be visible in another.
126109

127-
128110
## {{ccs-cap}} [alerting-ccs-setup]
129111

130112
If you want to use alerting rules with {{ccs}}, you must configure privileges for {{ccs-init}} and {{kib}}. Refer to [Remote clusters](../../../deploy-manage/remote-clusters.md).

0 commit comments

Comments
 (0)