Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,285 @@
[[ml-count-functions]]
= Count functions

Count functions detect anomalies when the number of events in a bucket is
anomalous.

Use `non_zero_count` functions if your data is sparse and you want to ignore
cases where the bucket count is zero.

Use `distinct_count` functions to determine when the number of distinct values
in one field is unusual, as opposed to the total count.

Use high-sided functions if you want to monitor unusually high event rates.
Use low-sided functions if you want to look at drops in event rate.

The {ml-features} include the following count functions:

* xref:ml-count[`count`, `high_count`, `low_count`]
* xref:ml-nonzero-count[`non_zero_count`, `high_non_zero_count`, `low_non_zero_count`]
* xref:ml-distinct-count[`distinct_count`, `high_distinct_count`, `low_distinct_count`]

[discrete]
[[ml-count]]
== Count, high_count, low_count

The `count` function detects anomalies when the number of events in a bucket is
anomalous.

The `high_count` function detects anomalies when the count of events in a bucket
are unusually high.

The `low_count` function detects anomalies when the count of events in a bucket
are unusually low.

These functions support the following properties:

* `by_field_name` (optional)
* `over_field_name` (optional)
* `partition_field_name` (optional)

For more information about those properties, see the
{ref}/ml-put-job.html#ml-put-job-request-body[create {anomaly-jobs} API].

.Example 1: Analyzing events with the count function
[source,console]
--------------------------------------------------
PUT _ml/anomaly_detectors/example1
{
"analysis_config": {
"detectors": [{
"function" : "count"
}]
},
"data_description": {
"time_field":"timestamp",
"time_format": "epoch_ms"
}
}
--------------------------------------------------
// TEST[skip:needs-licence]

This example is probably the simplest possible analysis. It identifies
time buckets during which the overall count of events is higher or lower than
usual.

When you use this function in a detector in your {anomaly-job}, it models the
event rate and detects when the event rate is unusual compared to its past
behavior.

.Example 2: Analyzing errors with the high_count function
[source,console]
--------------------------------------------------
PUT _ml/anomaly_detectors/example2
{
"analysis_config": {
"detectors": [{
"function" : "high_count",
"by_field_name" : "error_code",
"over_field_name": "user"
}]
},
"data_description": {
"time_field":"timestamp",
"time_format": "epoch_ms"
}
}
--------------------------------------------------
// TEST[skip:needs-licence]

If you use this `high_count` function in a detector in your {anomaly-job}, it
models the event rate for each error code. It detects users that generate an
unusually high count of error codes compared to other users.


.Example 3: Analyzing status codes with the low_count function
[source,console]
--------------------------------------------------
PUT _ml/anomaly_detectors/example3
{
"analysis_config": {
"detectors": [{
"function" : "low_count",
"by_field_name" : "status_code"
}]
},
"data_description": {
"time_field":"timestamp",
"time_format": "epoch_ms"
}
}
--------------------------------------------------
// TEST[skip:needs-licence]

In this example, the function detects when the count of events for a status code
is lower than usual.

When you use this function in a detector in your {anomaly-job}, it models the
event rate for each status code and detects when a status code has an unusually
low count compared to its past behavior.

.Example 4: Analyzing aggregated data with the count function
[source,console]
--------------------------------------------------
PUT _ml/anomaly_detectors/example4
{
"analysis_config": {
"summary_count_field_name" : "events_per_min",
"detectors": [{
"function" : "count"
}]
},
"data_description": {
"time_field":"timestamp",
"time_format": "epoch_ms"
}
}
--------------------------------------------------
// TEST[skip:needs-licence]

If you are analyzing an aggregated `events_per_min` field, do not use a sum
function (for example, `sum(events_per_min)`). Instead, use the count function
and the `summary_count_field_name` property. For more information, see
<<ml-configuring-aggregation>>.

[discrete]
[[ml-nonzero-count]]
== Non_zero_count, high_non_zero_count, low_non_zero_count

The `non_zero_count` function detects anomalies when the number of events in a
bucket is anomalous, but it ignores cases where the bucket count is zero. Use
this function if you know your data is sparse or has gaps and the gaps are not
important.

The `high_non_zero_count` function detects anomalies when the number of events
in a bucket is unusually high and it ignores cases where the bucket count is
zero.

The `low_non_zero_count` function detects anomalies when the number of events in
a bucket is unusually low and it ignores cases where the bucket count is zero.

These functions support the following properties:

* `by_field_name` (optional)
* `partition_field_name` (optional)

For more information about those properties, see the
{ref}/ml-put-job.html#ml-put-job-request-body[create {anomaly-jobs} API].

For example, if you have the following number of events per bucket:

====

1,22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,43,31,0,0,0,0,0,0,0,0,0,0,0,0,2,1

====

The `non_zero_count` function models only the following data:

====

1,22,2,43,31,2,1

====

.Example 5: Analyzing signatures with the high_non_zero_count function
[source,console]
--------------------------------------------------
PUT _ml/anomaly_detectors/example5
{
"analysis_config": {
"detectors": [{
"function" : "high_non_zero_count",
"by_field_name" : "signaturename"
}]
},
"data_description": {
"time_field":"timestamp",
"time_format": "epoch_ms"
}
}
--------------------------------------------------
// TEST[skip:needs-licence]

If you use this `high_non_zero_count` function in a detector in your
{anomaly-job}, it models the count of events for the `signaturename` field. It
ignores any buckets where the count is zero and detects when a `signaturename`
value has an unusually high count of events compared to its past behavior.

NOTE: Population analysis (using an `over_field_name` property value) is not
supported for the `non_zero_count`, `high_non_zero_count`, and
`low_non_zero_count` functions. If you want to do population analysis and your
data is sparse, use the `count` functions, which are optimized for that scenario.


[discrete]
[[ml-distinct-count]]
== Distinct_count, high_distinct_count, low_distinct_count

The `distinct_count` function detects anomalies where the number of distinct
values in one field is unusual.

The `high_distinct_count` function detects unusually high numbers of distinct
values in one field.

The `low_distinct_count` function detects unusually low numbers of distinct
values in one field.

These functions support the following properties:

* `field_name` (required)
* `by_field_name` (optional)
* `over_field_name` (optional)
* `partition_field_name` (optional)

For more information about those properties, see the
{ref}/ml-put-job.html#ml-put-job-request-body[create {anomaly-jobs} API].

.Example 6: Analyzing users with the distinct_count function
[source,console]
--------------------------------------------------
PUT _ml/anomaly_detectors/example6
{
"analysis_config": {
"detectors": [{
"function" : "distinct_count",
"field_name" : "user"
}]
},
"data_description": {
"time_field":"timestamp",
"time_format": "epoch_ms"
}
}
--------------------------------------------------
// TEST[skip:needs-licence]

This `distinct_count` function detects when a system has an unusual number
of logged in users. When you use this function in a detector in your
{anomaly-job}, it models the distinct count of users. It also detects when the
distinct number of users is unusual compared to the past.

.Example 7: Analyzing ports with the high_distinct_count function
[source,console]
--------------------------------------------------
PUT _ml/anomaly_detectors/example7
{
"analysis_config": {
"detectors": [{
"function" : "high_distinct_count",
"field_name" : "dst_port",
"over_field_name": "src_ip"
}]
},
"data_description": {
"time_field":"timestamp",
"time_format": "epoch_ms"
}
}
--------------------------------------------------
// TEST[skip:needs-licence]

This example detects instances of port scanning. When you use this function in a
detector in your {anomaly-job}, it models the distinct count of ports. It also
detects the `src_ip` values that connect to an unusually high number of
different `dst_ports` values compared to other `src_ip` values.
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
[[ml-functions]]
= Function reference

The {ml-features} include analysis functions that provide a wide variety of
flexible ways to analyze data for anomalies.

When you create {anomaly-jobs}, you specify one or more detectors, which define
the type of analysis that needs to be done. If you are creating your job by
using {ml} APIs, you specify the functions in detector configuration objects.
If you are creating your job in {kib}, you specify the functions differently
depending on whether you are creating single metric, multi-metric, or advanced
jobs.
//For a demonstration of creating jobs in {kib}, see <<ml-getting-started>>.

Most functions detect anomalies in both low and high values. In statistical
terminology, they apply a two-sided test. Some functions offer low and high
variations (for example, `count`, `low_count`, and `high_count`). These variations
apply one-sided tests, detecting anomalies only when the values are low or
high, depending one which alternative is used.

You can specify a `summary_count_field_name` with any function except `metric`.
When you use `summary_count_field_name`, the {ml} features expect the input
data to be pre-aggregated. The value of the `summary_count_field_name` field
must contain the count of raw events that were summarized. In {kib}, use the
**summary_count_field_name** in advanced {anomaly-jobs}. Analyzing aggregated
input data provides a significant boost in performance. For more information, see
<<ml-configuring-aggregation>>.

If your data is sparse, there may be gaps in the data which means you might have
empty buckets. You might want to treat these as anomalies or you might want these
gaps to be ignored. Your decision depends on your use case and what is important
to you. It also depends on which functions you use. The `sum` and `count`
functions are strongly affected by empty buckets. For this reason, there are
`non_null_sum` and `non_zero_count` functions, which are tolerant to sparse data.
These functions effectively ignore empty buckets.

* <<ml-count-functions,Count functions>>
* <<ml-geo-functions,Geographic functions>>
* <<ml-info-functions,Information content functions>>
* <<ml-metric-functions,Metric functions>>
* <<ml-rare-functions,Rare functions>>
* <<ml-sum-functions,Sum functions>>
* <<ml-time-functions,Time functions>>
Loading