Skip to content

Commit b6eba61

Browse files
authored
Adds missing parts of 'Generating alerts for anomaly detection jobs' (#1250)
As raised in [this Slack thread](https://elastic.slack.com/archives/CK7L2T31V/p1745419252793339), some content was missing from the "Generating alerts for anomaly detection jobs" section after the migration. This PR adds the missing parts.
1 parent 6444a03 commit b6eba61

File tree

7 files changed

+360
-0
lines changed

7 files changed

+360
-0
lines changed
109 KB
Loading
99.6 KB
Loading
94.9 KB
Loading
242 KB
Loading
134 KB
Loading
224 KB
Loading

explore-analyze/machine-learning/anomaly-detection/ml-configuring-alerts.md

Lines changed: 360 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,363 @@ If you have created rules for specific {{anomaly-jobs}} and you want to monitor
2323
::::
2424

2525
In **{{stack-manage-app}} > {{rules-ui}}**, you can create both types of {{ml}} rules. In the **{{ml-app}}** app, you can create only {{anomaly-detect}} alert rules; create them from the {{anomaly-job}} wizard after you start the job or from the {{anomaly-job}} list.
26+
27+
## {{anomaly-detect-cap}} alert rules [creating-anomaly-alert-rules]
28+
29+
When you create an {{anomaly-detect}} alert rule, you must select the job that
30+
the rule applies to.
31+
32+
You must also select a type of {{ml}} result. In particular, you can create rules
33+
based on bucket, record, or influencer results.
34+
35+
:::{image} /explore-analyze/images/ml-anomaly-alert-severity.png
36+
:alt: Selecting result type, severity, and test interval
37+
:screenshot:
38+
:::
39+
40+
For each rule, you can configure the `anomaly_score` that triggers the action.
41+
The `anomaly_score` indicates the significance of a given anomaly compared to
42+
previous anomalies. The default severity threshold is 75 which means every
43+
anomaly with an `anomaly_score` of 75 or higher triggers the associated action.
44+
45+
You can select whether you want to include interim results. Interim results are
46+
created by the {{anomaly-job}} before a bucket is finalized. These results might
47+
disappear after the bucket is fully processed. Include interim results if you
48+
want to be notified earlier about a potential anomaly even if it might be a
49+
false positive. If you want to get notified only about anomalies of fully
50+
processed buckets, do not include interim results.
51+
52+
You can also configure advanced settings. _Lookback interval_ sets an interval
53+
that is used to query previous anomalies during each condition check. Its value
54+
is derived from the bucket span of the job and the query delay of the {{{dfeed}} by
55+
default. It is not recommended to set the lookback interval lower than the
56+
default value as it might result in missed anomalies. _Number of latest buckets_
57+
sets how many buckets to check to obtain the highest anomaly from all the
58+
anomalies that are found during the _Lookback interval_. An alert is created
59+
based on the anomaly with the highest anomaly score from the most anomalous
60+
bucket.
61+
62+
You can also test the configured conditions against your existing data and check
63+
the sample results by providing a valid interval for your data. The generated
64+
preview contains the number of potentially created alerts during the relative
65+
time range you defined.
66+
67+
::::{tip}
68+
You must also provide a _check interval_ that defines how often to
69+
evaluate the rule conditions. It is recommended to select an interval that is
70+
close to the bucket span of the job.
71+
::::
72+
73+
As the last step in the rule creation process, define its [actions](#ml-configuring-alert-actions).
74+
75+
## {{anomaly-jobs-cap}} health rules [creating-anomaly-jobs-health-rules]
76+
77+
When you create an {{anomaly-jobs}} health rule, you must select the job or group
78+
that the rule applies to. If you assign more jobs to the group, they are
79+
included the next time the rule conditions are checked.
80+
81+
You can also use a special character (`*`) to apply the rule to all your jobs.
82+
Jobs created after the rule are automatically included. You can exclude jobs
83+
that are not critically important by using the _Exclude_ field.
84+
85+
Enable the health check types that you want to apply. All checks are enabled by
86+
default. At least one check needs to be enabled to create the rule. The
87+
following health checks are available:
88+
89+
Datafeed is not started
90+
: Notifies if the corresponding {{dfeed}} of the job is not started but the job is
91+
in an opened state. The notification message recommends the necessary
92+
actions to solve the error.
93+
94+
Model memory limit reached
95+
: Notifies if the model memory status of the job reaches the soft or hard model
96+
memory limit. Optimize your job by following
97+
[these guidelines](/explore-analyze/machine-learning/anomaly-detection/anomaly-detection-scale.md) or consider
98+
[amending the model memory limit](/explore-analyze/machine-learning/anomaly-detection/anomaly-detection-scale.md#set-model-memory-limit).
99+
100+
Data delay has occurred
101+
: Notifies when the job missed some data. You can define the threshold for the
102+
amount of missing documents you get alerted on by setting
103+
_Number of documents_. You can control the lookback interval for checking
104+
delayed data with _Time interval_. Refer to the
105+
[Handling delayed data](/explore-analyze/machine-learning/anomaly-detection/ml-delayed-data-detection.md) page to see what to do about delayed data.
106+
107+
Errors in job messages
108+
: Notifies when the job messages contain error messages. Review the
109+
notification; it contains the error messages, the corresponding job IDs and
110+
recommendations on how to fix the issue. This check looks for job errors
111+
that occur after the rule is created; it does not look at historic behavior.
112+
113+
:::{image} /explore-analyze/images/ml-health-check-config.png
114+
:alt: Selecting health checkers
115+
:screenshot:
116+
:::
117+
118+
::::{tip}
119+
You must also provide a _check interval_ that defines how often to
120+
evaluate the rule conditions. It is recommended to select an interval that is
121+
close to the bucket span of the job.
122+
::::
123+
124+
As the last step in the rule creation process, define its actions.
125+
126+
## Actions [ml-configuring-alert-actions]
127+
128+
You can optionally send notifications when the rule conditions are met and when
129+
they are no longer met. In particular, these rules support:
130+
131+
* alert summaries
132+
* actions that run when the anomaly score matches the conditions (for {{anomaly-detect}} alert rules)
133+
* actions that run when an issue is detected (for {{anomaly-jobs}} health rules)
134+
* recovery actions that run when the conditions are no longer met
135+
136+
Each action uses a connector, which stores connection information for a {{kib}}
137+
service or supported third-party integration, depending on where you want to
138+
send the notifications. For example, you can use a Slack connector to send a
139+
message to a channel. Or you can use an index connector that writes a JSON
140+
object to a specific index. For details about creating connectors, refer to
141+
[Connectors](/deploy-manage/manage-connectors.md#creating-new-connector).
142+
143+
After you select a connector, you must set the action frequency. You can choose
144+
to create a summary of alerts on each check interval or on a custom interval.
145+
For example, send slack notifications that summarize the new, ongoing, and
146+
recovered alerts:
147+
148+
:::{image} /explore-analyze/images/ml-anomaly-alert-action-summary.png
149+
:alt: Adding an alert summary action to the rule
150+
:screenshot:
151+
:::
152+
153+
::::{tip}
154+
If you choose a custom action interval, it cannot be shorter than the
155+
rule's check interval.
156+
::::
157+
158+
Alternatively, you can set the action frequency such that actions run for each
159+
alert. Choose how often the action runs (at each check interval, only when the
160+
alert status changes, or at a custom action interval). For {{anomaly-detect}}
161+
alert rules, you must also choose whether the action runs when the anomaly score
162+
matches the condition or when the alert recovers:
163+
164+
:::{image} /explore-analyze/images/ml-anomaly-alert-action-score-matched.png
165+
:alt: Adding an action for each alert in the rule
166+
:screenshot:
167+
:::
168+
169+
In {{anomaly-jobs}} health rules, choose whether the action runs when the issue is
170+
detected or when it is recovered:
171+
172+
:::{image} /explore-analyze/images/ml-health-check-action.png
173+
:alt: Adding an action for each alert in the rule
174+
:screenshot:
175+
:::
176+
177+
You can further refine the rule by specifying that actions run only when they
178+
match a KQL query or when an alert occurs within a specific time frame.
179+
180+
There is a set of variables that you can use to customize the notification
181+
messages for each action. Click the icon above the message text box to get the
182+
list of variables or refer to [action variables](#action-variables). For example:
183+
184+
:::{image} /explore-analyze/images/ml-anomaly-alert-messages.png
185+
:alt: Customizing your message
186+
:screenshot:
187+
:::
188+
189+
After you save the configurations, the rule appears in the
190+
*{{stack-manage-app}} > {{rules-ui}}* list; you can check its status and see the
191+
overview of its configuration information.
192+
193+
When an alert occurs for an {{anomaly-detect}} alert rule, it is always the same
194+
name as the job ID of the associated {{anomaly-job}} that triggered it. You can
195+
review how the alerts that are occured correlate with the {{anomaly-detect}}
196+
results in the **Anomaly explorer** by using the **Anomaly timeline** swimlane
197+
and the **Alerts** panel.
198+
199+
If necessary, you can snooze rules to prevent them from generating actions. For
200+
more details, refer to
201+
[Snooze and disable rules](/explore-analyze/alerts-cases/alerts/create-manage-rules.md#controlling-rules).
202+
203+
## Action variables [action-variables]
204+
205+
The following variables are specific to the {{ml}} rule types. An asterisk (`*`)
206+
marks the variables that you can use in actions related to recovered alerts.
207+
208+
You can also specify [variables common to all rules](/explore-analyze/alerts-cases/alerts/rule-action-variables.md).
209+
210+
### {{anomaly-detect-cap}} alert action variables [anomaly-alert-action-variables]
211+
212+
Every {{anomaly-detect}} alert has the following action variables:
213+
214+
**`context.anomalyExplorerUrl`^*^**
215+
: URL to open in the Anomaly Explorer.
216+
217+
**`context.isInterim`**
218+
: Indicates if top hits contain interim results.
219+
220+
**`context.jobIds`^*^**
221+
: List of job IDs that triggered the alert.
222+
223+
**`context.message`^*^**
224+
: A preconstructed message for the alert.
225+
226+
**`context.score`**
227+
: Anomaly score at the time of the notification action.
228+
229+
**`context.timestamp`**
230+
: The bucket timestamp of the anomaly.
231+
232+
**`context.timestampIso8601`**
233+
: The bucket timestamp of the anomaly in ISO8601 format.
234+
235+
**`context.topInfluencers`**
236+
: The list of top influencers. Limited to a maximum of 3 documents.
237+
238+
:::{dropdown} Properties of `context.topInfluencers`
239+
**`influencer_field_name`**
240+
: The field name of the influencer.
241+
242+
**`influencer_field_value`**
243+
: The entity that influenced, contributed to, or was to blame for the anomaly.
244+
245+
**`score`**
246+
: The influencer score. A normalized score between 0–100 which shows the influencer’s overall contribution to the anomalies.
247+
:::
248+
249+
**`context.topRecords`**
250+
: The list of top records. Limited to a maximum of 3 documents.
251+
252+
:::{dropdown} Properties of `context.topRecords`
253+
**`actual`**
254+
: The actual value for the bucket.
255+
256+
**`by_field_value`**
257+
: The value of the by field.
258+
259+
**`field_name`**
260+
: Certain functions require a field to operate on, for example, `sum()`. For those functions, this value is the name of the field to be analyzed.
261+
262+
**`function`**
263+
: The function in which the anomaly occurs, as specified in the detector configuration. For example, `max`.
264+
265+
**`over_field_name`**
266+
: The field used to split the data.
267+
268+
**`partition_field_value`**
269+
: The field used to segment the analysis.
270+
271+
**`score`**
272+
: A normalized score between 0–100, which is based on the probability of the anomalousness of this record.
273+
274+
**`typical`**
275+
: The typical value for the bucket, according to analytical modeling.
276+
:::
277+
278+
### {{anomaly-detect-cap}} health action variables [anomaly-jobs-health-action-variables]
279+
280+
Every health check has two main variables: `context.message` and
281+
`context.results`. The properties of `context.results` may vary based on the
282+
type of check. You can find the possible properties for all the checks below.
283+
284+
#### Datafeed is not started
285+
286+
**`context.message`^*^**
287+
: A preconstructed message for the alert.
288+
289+
**`context.results`**
290+
: Contains the following properties:
291+
292+
:::{dropdown} Properties of `context.results`
293+
**`datafeed_id`^*^**
294+
: The datafeed identifier.
295+
296+
**`datafeed_state`^*^**
297+
: The state of the datafeed. It can be `starting`, `started`, `stopping`, or `stopped`.
298+
299+
**`job_id`^*^**
300+
: The job identifier.
301+
302+
**`job_state`^*^**
303+
: The state of the job. It can be `opening`, `opened`, `closing`, `closed`, or `failed`.
304+
:::
305+
306+
#### Model memory limit reached
307+
308+
**`context.message`^*^**
309+
: A preconstructed message for the rule.
310+
311+
**`context.results`**
312+
: Contains the following properties:
313+
314+
:::{dropdown} Properties of `context.results`
315+
**`job_id`^*^**
316+
: The job identifier.
317+
318+
**`memory_status`^*^**
319+
: The status of the mathematical model. It can have one of the following values:
320+
- `soft_limit`: The model used more than 60% of the configured memory limit and older unused models will be pruned to free up space. In categorization jobs, no further category examples will be stored.
321+
- `hard_limit`: The model used more space than the configured memory limit. As a result, not all incoming data was processed.
322+
The `memory_status` is `ok` for recovered alerts.
323+
324+
**`model_bytes`^*^**
325+
: The number of bytes of memory used by the models.
326+
327+
**`model_bytes_exceeded`^*^**
328+
: The number of bytes over the high limit for memory usage at the last allocation failure.
329+
330+
**`model_bytes_memory_limit`^*^**
331+
: The upper limit for model memory usage.
332+
333+
**`log_time`^*^**
334+
: The timestamp of the model size statistics according to server time. Time formatting is based on the Kibana settings.
335+
336+
**`peak_model_bytes`^*^**
337+
: The peak number of bytes of memory ever used by the model.
338+
:::
339+
340+
#### Data delay has occurred
341+
342+
**`context.message`^*^**
343+
: A preconstructed message for the rule.
344+
345+
**`context.results`**
346+
: For recovered alerts, `context.results` is either empty (when there is no delayed data) or the same as for an active alert (when the number of missing documents is less than the *Number of documents* threshold set by the user).
347+
Contains the following properties:
348+
349+
:::{dropdown} Properties of `context.results`
350+
**`annotation`^*^**
351+
: The annotation corresponding to the data delay in the job.
352+
353+
**`end_timestamp`^*^**
354+
: Timestamp of the latest finalized buckets with missing documents. Time formatting is based on the Kibana settings.
355+
356+
**`job_id`^*^**
357+
: The job identifier.
358+
359+
**`missed_docs_count`^*^**
360+
: The number of missed documents.
361+
:::
362+
363+
#### Error in job messages
364+
365+
**`context.message`^*^**
366+
: A preconstructed message for the rule.
367+
368+
**`context.results`**
369+
: Contains the following properties:
370+
371+
:::{dropdown} Properties of `context.results`
372+
**`timestamp`**
373+
: Timestamp of the latest finalized buckets with missing documents.
374+
375+
**`job_id`**
376+
: The job identifier.
377+
378+
**`message`**
379+
: The error message.
380+
381+
**`node_name`**
382+
: The name of the node that runs the job.
383+
:::
384+
385+

0 commit comments

Comments
 (0)