You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Returns the value at which a certain percentage of observed values occur. For example, the 95th percentile is the value which is greater than 95% of the observed values and the 50th percentile is the `MEDIAN`.
16
-
17
-
**Supported types**
18
-
19
-
| number | percentile | result |
20
-
| --- | --- | --- |
21
-
| double | double | double |
22
-
| double | integer | double |
23
-
| double | long | double |
24
-
| integer | double | double |
25
-
| integer | integer | double |
26
-
| integer | long | double |
27
-
| long | double | double |
28
-
| long | integer | double |
29
-
| long | long | double |
30
-
31
-
**Examples**
32
-
33
-
```esql
34
-
FROM employees
35
-
| STATS p0 = PERCENTILE(salary, 0)
36
-
, p50 = PERCENTILE(salary, 50)
37
-
, p99 = PERCENTILE(salary, 99)
38
-
```
39
-
40
-
| p0:double| p50:double| p99:double|
41
-
| --- | --- | --- |
42
-
| 25324 | 47003 | 74970.29 |
43
-
44
-
The expression can use inline functions. For example, to calculate a percentile of the maximum values of a multivalued column, first use `MV_MAX` to get the maximum value per row, and use the result with the `PERCENTILE` function
### `PERCENTILE` is (usually) approximate [esql-percentile-approximate]
57
-
58
1
There are many different algorithms to calculate percentiles. The naive implementation simply stores all the values in a sorted array. To find the 50th percentile, you simply find the value that is at `my_array[count(my_array) * 0.5]`.
59
2
60
3
Clearly, the naive implementation does not scale — the sorted array grows linearly with the number of values in your dataset. To calculate percentiles across potentially billions of values in an Elasticsearch cluster, *approximate* percentiles are calculated.
@@ -72,11 +15,3 @@ The following chart shows the relative error on a uniform distribution depending
It shows how precision is better for extreme percentiles. The reason why error diminishes for large number of values is that the law of large numbers makes the distribution of values more and more uniform and the t-digest tree can do a better job at summarizing it. It would not be the case on more skewed distributions.
75
-
76
-
::::{warning}
77
-
`PERCENTILE` is also [non-deterministic](https://en.wikipedia.org/wiki/Nondeterministic_algorithm). This means you can get slightly different results using the same data.
Copy file name to clipboardExpand all lines: docs/reference/data-analysis/aggregations/search-aggregations-metrics-cardinality-aggregation.md
+2-13Lines changed: 2 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -65,19 +65,8 @@ Computing exact counts requires loading values into a hash set and returning its
65
65
66
66
This `cardinality` aggregation is based on the [HyperLogLog++](https://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf) algorithm, which counts based on the hashes of the values with some interesting properties:
67
67
68
-
* configurable precision, which decides on how to trade memory for accuracy,
69
-
* excellent accuracy on low-cardinality sets,
70
-
* fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.
71
-
72
-
For a precision threshold of `c`, the implementation that we are using requires about `c * 8` bytes.
73
-
74
-
The following chart shows how the error varies before and after the threshold:
For all 3 thresholds, counts have been accurate up to the configured threshold. Although not guaranteed, this is likely to be the case. Accuracy in practice depends on the dataset in question. In general, most datasets show consistently good accuracy. Also note that even with a threshold as low as 100, the error remains very low (1-6% as seen in the above graph) even when counting millions of items.
79
-
80
-
The HyperLogLog++ algorithm depends on the leading zeros of hashed values, the exact distributions of hashes in a dataset can affect the accuracy of the cardinality.
0 commit comments