Skip to content

Commit bf1c2b6

Browse files
committed
Merge branch 'main' into esql_move_limit_in_docs
2 parents 2df34d7 + 1e25a54 commit bf1c2b6

File tree

993 files changed

+29294
-18946
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

993 files changed

+29294
-18946
lines changed

.gitattributes

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,9 @@ x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/parser/EsqlBasePar
1313
x-pack/plugin/esql/src/main/generated/** linguist-generated=true
1414
x-pack/plugin/esql/src/main/generated-src/** linguist-generated=true
1515

16-
# ESQL functions docs are autogenerated. More information at `docs/reference/esql/functions/README.md`
17-
docs/reference/esql/functions/*/** linguist-generated=true
16+
# ESQL functions docs are autogenerated. More information at `docs/reference/query-languages/esql/README.md`
17+
docs/reference/query-languages/esql/_snippets/functions/*/** linguist-generated=true
18+
#docs/reference/query-languages/esql/_snippets/operators/*/** linguist-generated=true
19+
docs/reference/query-languages/esql/images/** linguist-generated=true
20+
docs/reference/query-languages/esql/kibana/** linguist-generated=true
1821

docs/changelog/124581.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
pr: 124581
2+
summary: New `vector_rescore` parameter as a quantized index type option
3+
area: Vector Search
4+
type: enhancement
5+
issues: []

docs/changelog/124676.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
pr: 124676
2+
summary: TO_LOWER processes all values
3+
area: ES|QL
4+
type: bug
5+
issues:
6+
- 124002

docs/changelog/124739.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
pr: 124739
2+
summary: Improve rolling up metrics
3+
area: Downsampling
4+
type: enhancement
5+
issues: []

docs/docset.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@ project: 'Elasticsearch'
22
exclude:
33
- README.md
44
- internal/*
5-
- reference/esql/functions/kibana/docs/*
6-
- reference/esql/functions/README.md
5+
- reference/query-languages/esql/kibana/docs/**
6+
- reference/query-languages/esql/README.md
77
cross_links:
88
- beats
99
- cloud
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
* configurable precision, which decides on how to trade memory for accuracy,
2+
* excellent accuracy on low-cardinality sets,
3+
* fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.
4+
5+
For a precision threshold of `c`, the implementation that we are using requires about `c * 8` bytes.
6+
7+
The following chart shows how the error varies before and after the threshold:
8+
9+
![cardinality error](/images/cardinality_error.png "")
10+
11+
For all 3 thresholds, counts have been accurate up to the configured threshold. Although not guaranteed,
12+
this is likely to be the case. Accuracy in practice depends on the dataset in question. In general,
13+
most datasets show consistently good accuracy. Also note that even with a threshold as low as 100,
14+
the error remains very low (1-6% as seen in the above graph) even when counting millions of items.
15+
16+
The HyperLogLog++ algorithm depends on the leading zeros of hashed values, the exact distributions of
17+
hashes in a dataset can affect the accuracy of the cardinality.
Original file line numberDiff line numberDiff line change
@@ -1,60 +1,3 @@
1-
## `PERCENTILE` [esql-percentile]
2-
3-
**Syntax**
4-
5-
:::{image} ../../../../../images/percentile.svg
6-
:alt: Embedded
7-
:class: text-center
8-
:::
9-
10-
**Parameters**
11-
12-
true
13-
**Description**
14-
15-
Returns the value at which a certain percentage of observed values occur. For example, the 95th percentile is the value which is greater than 95% of the observed values and the 50th percentile is the `MEDIAN`.
16-
17-
**Supported types**
18-
19-
| number | percentile | result |
20-
| --- | --- | --- |
21-
| double | double | double |
22-
| double | integer | double |
23-
| double | long | double |
24-
| integer | double | double |
25-
| integer | integer | double |
26-
| integer | long | double |
27-
| long | double | double |
28-
| long | integer | double |
29-
| long | long | double |
30-
31-
**Examples**
32-
33-
```esql
34-
FROM employees
35-
| STATS p0 = PERCENTILE(salary, 0)
36-
, p50 = PERCENTILE(salary, 50)
37-
, p99 = PERCENTILE(salary, 99)
38-
```
39-
40-
| p0:double | p50:double | p99:double |
41-
| --- | --- | --- |
42-
| 25324 | 47003 | 74970.29 |
43-
44-
The expression can use inline functions. For example, to calculate a percentile of the maximum values of a multivalued column, first use `MV_MAX` to get the maximum value per row, and use the result with the `PERCENTILE` function
45-
46-
```esql
47-
FROM employees
48-
| STATS p80_max_salary_change = PERCENTILE(MV_MAX(salary_change), 80)
49-
```
50-
51-
| p80_max_salary_change:double |
52-
| --- |
53-
| 12.132 |
54-
55-
56-
### `PERCENTILE` is (usually) approximate [esql-percentile-approximate]
57-
581
There are many different algorithms to calculate percentiles. The naive implementation simply stores all the values in a sorted array. To find the 50th percentile, you simply find the value that is at `my_array[count(my_array) * 0.5]`.
592

603
Clearly, the naive implementation does not scale — the sorted array grows linearly with the number of values in your dataset. To calculate percentiles across potentially billions of values in an Elasticsearch cluster, *approximate* percentiles are calculated.
@@ -72,11 +15,3 @@ The following chart shows the relative error on a uniform distribution depending
7215
![percentiles error](/images/percentiles_error.png "")
7316

7417
It shows how precision is better for extreme percentiles. The reason why error diminishes for large number of values is that the law of large numbers makes the distribution of values more and more uniform and the t-digest tree can do a better job at summarizing it. It would not be the case on more skewed distributions.
75-
76-
::::{warning}
77-
`PERCENTILE` is also [non-deterministic](https://en.wikipedia.org/wiki/Nondeterministic_algorithm). This means you can get slightly different results using the same data.
78-
79-
::::
80-
81-
82-

docs/reference/data-analysis/aggregations/search-aggregations-metrics-cardinality-aggregation.md

Lines changed: 2 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -65,19 +65,8 @@ Computing exact counts requires loading values into a hash set and returning its
6565

6666
This `cardinality` aggregation is based on the [HyperLogLog++](https://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf) algorithm, which counts based on the hashes of the values with some interesting properties:
6767

68-
* configurable precision, which decides on how to trade memory for accuracy,
69-
* excellent accuracy on low-cardinality sets,
70-
* fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.
71-
72-
For a precision threshold of `c`, the implementation that we are using requires about `c * 8` bytes.
73-
74-
The following chart shows how the error varies before and after the threshold:
75-
76-
![cardinality error](../../../images/cardinality_error.png "")
77-
78-
For all 3 thresholds, counts have been accurate up to the configured threshold. Although not guaranteed, this is likely to be the case. Accuracy in practice depends on the dataset in question. In general, most datasets show consistently good accuracy. Also note that even with a threshold as low as 100, the error remains very low (1-6% as seen in the above graph) even when counting millions of items.
79-
80-
The HyperLogLog++ algorithm depends on the leading zeros of hashed values, the exact distributions of hashes in a dataset can affect the accuracy of the cardinality.
68+
:::{include} _snippets/search-aggregations-metrics-cardinality-aggregation-explanation.md
69+
:::
8170

8271

8372
## Pre-computed hashes [_pre_computed_hashes]

docs/reference/data-analysis/aggregations/search-aggregations-metrics-percentile-aggregation.md

Lines changed: 2 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -175,31 +175,14 @@ GET latency/_search
175175

176176
## Percentiles are (usually) approximate [search-aggregations-metrics-percentile-aggregation-approximation]
177177

178-
There are many different algorithms to calculate percentiles. The naive implementation simply stores all the values in a sorted array. To find the 50th percentile, you simply find the value that is at `my_array[count(my_array) * 0.5]`.
179-
180-
Clearly, the naive implementation does not scale — the sorted array grows linearly with the number of values in your dataset. To calculate percentiles across potentially billions of values in an Elasticsearch cluster, *approximate* percentiles are calculated.
181-
182-
The algorithm used by the `percentile` metric is called TDigest (introduced by Ted Dunning in [Computing Accurate Quantiles using T-Digests](https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf)).
183-
184-
When using this metric, there are a few guidelines to keep in mind:
185-
186-
* Accuracy is proportional to `q(1-q)`. This means that extreme percentiles (e.g. 99%) are more accurate than less extreme percentiles, such as the median
187-
* For small sets of values, percentiles are highly accurate (and potentially 100% accurate if the data is small enough).
188-
* As the quantity of values in a bucket grows, the algorithm begins to approximate the percentiles. It is effectively trading accuracy for memory savings. The exact level of inaccuracy is difficult to generalize, since it depends on your data distribution and volume of data being aggregated
189-
190-
The following chart shows the relative error on a uniform distribution depending on the number of collected values and the requested percentile:
191-
192-
![percentiles error](../../../images/percentiles_error.png "")
193-
194-
It shows how precision is better for extreme percentiles. The reason why error diminishes for large number of values is that the law of large numbers makes the distribution of values more and more uniform and the t-digest tree can do a better job at summarizing it. It would not be the case on more skewed distributions.
178+
:::{include} /reference/data-analysis/aggregations/_snippets/search-aggregations-metrics-percentile-aggregation-approximate.md
179+
:::
195180

196181
::::{warning}
197182
Percentile aggregations are also [non-deterministic](https://en.wikipedia.org/wiki/Nondeterministic_algorithm). This means you can get slightly different results using the same data.
198-
199183
::::
200184

201185

202-
203186
## Compression [search-aggregations-metrics-percentile-aggregation-compression]
204187

205188
Approximate algorithms must balance memory utilization with estimation accuracy. This balance can be controlled using a `compression` parameter:

docs/reference/elasticsearch/mapping-reference/dense-vector.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -287,6 +287,14 @@ $$$dense-vector-index-options$$$
287287
`confidence_interval`
288288
: (Optional, float) Only applicable to `int8_hnsw`, `int4_hnsw`, `int8_flat`, and `int4_flat` index types. The confidence interval to use when quantizing the vectors. Can be any value between and including `0.90` and `1.0` or exactly `0`. When the value is `0`, this indicates that dynamic quantiles should be calculated for optimized quantization. When between `0.90` and `1.0`, this value restricts the values used when calculating the quantization thresholds. For example, a value of `0.95` will only use the middle 95% of the values when calculating the quantization thresholds (e.g. the highest and lowest 2.5% of values will be ignored). Defaults to `1/(dims + 1)` for `int8` quantized vectors and `0` for `int4` for dynamic quantile calculation.
289289

290+
`rescore_vector`
291+
: (Optional, object) Functionality in [preview]. An optional section that configures automatic vector rescoring on knn queries for the given field. Only applicable to quantized index types.
292+
:::::{dropdown} Properties of `rescore_vector`
293+
`oversample`
294+
: (required, float) The amount to oversample the search results by. This value should be greater than `1.0` and less than `10.0`. The higher the value, the more vectors will be gathered and rescored with the raw values per shard.
295+
: In case a knn query specifies a `rescore_vector` parameter, the query `rescore_vector` parameter will be used instead.
296+
: See [oversampling and rescoring quantized vectors](docs-content://solutions/search/vector/knn.md#dense-vector-knn-search-rescoring) for details.
297+
:::::
290298
::::
291299

292300

0 commit comments

Comments
 (0)