Skip to content

Commit 71345d4

Browse files
committed
ESQL autogenerate docs v3 (elastic#124312)
Building on the work started in elastic#123904, we now want to auto-generate most of the small subfiles from the ES|QL functions unit tests. This work also investigates any remaining discrepancies between the original asciidoc version and the new markdown, and tries to minimize differences so the docs do not look too different. The kibana json and markdown files are moved to a new location, and the operator docs are a little more generated than before (although still largely manual).
1 parent 40187a6 commit 71345d4

File tree

679 files changed

+26063
-1497
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

679 files changed

+26063
-1497
lines changed

docs/docset.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@ project: 'Elasticsearch'
22
exclude:
33
- README.md
44
- internal/*
5-
- reference/esql/functions/kibana/docs/*
6-
- reference/esql/functions/README.md
5+
- reference/query-languages/esql/kibana/docs/**
6+
- reference/query-languages/esql/README.md
77
cross_links:
88
- beats
99
- cloud
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
* configurable precision, which decides on how to trade memory for accuracy,
2+
* excellent accuracy on low-cardinality sets,
3+
* fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.
4+
5+
For a precision threshold of `c`, the implementation that we are using requires about `c * 8` bytes.
6+
7+
The following chart shows how the error varies before and after the threshold:
8+
9+
![cardinality error](/images/cardinality_error.png "")
10+
11+
For all 3 thresholds, counts have been accurate up to the configured threshold. Although not guaranteed,
12+
this is likely to be the case. Accuracy in practice depends on the dataset in question. In general,
13+
most datasets show consistently good accuracy. Also note that even with a threshold as low as 100,
14+
the error remains very low (1-6% as seen in the above graph) even when counting millions of items.
15+
16+
The HyperLogLog++ algorithm depends on the leading zeros of hashed values, the exact distributions of
17+
hashes in a dataset can affect the accuracy of the cardinality.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
There are many different algorithms to calculate percentiles. The naive implementation simply stores all the values in a sorted array. To find the 50th percentile, you simply find the value that is at `my_array[count(my_array) * 0.5]`.
2+
3+
Clearly, the naive implementation does not scale — the sorted array grows linearly with the number of values in your dataset. To calculate percentiles across potentially billions of values in an Elasticsearch cluster, *approximate* percentiles are calculated.
4+
5+
The algorithm used by the `percentile` metric is called TDigest (introduced by Ted Dunning in [Computing Accurate Quantiles using T-Digests](https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf)).
6+
7+
When using this metric, there are a few guidelines to keep in mind:
8+
9+
* Accuracy is proportional to `q(1-q)`. This means that extreme percentiles (e.g. 99%) are more accurate than less extreme percentiles, such as the median
10+
* For small sets of values, percentiles are highly accurate (and potentially 100% accurate if the data is small enough).
11+
* As the quantity of values in a bucket grows, the algorithm begins to approximate the percentiles. It is effectively trading accuracy for memory savings. The exact level of inaccuracy is difficult to generalize, since it depends on your data distribution and volume of data being aggregated
12+
13+
The following chart shows the relative error on a uniform distribution depending on the number of collected values and the requested percentile:
14+
15+
![percentiles error](/images/percentiles_error.png "")
16+
17+
It shows how precision is better for extreme percentiles. The reason why error diminishes for large number of values is that the law of large numbers makes the distribution of values more and more uniform and the t-digest tree can do a better job at summarizing it. It would not be the case on more skewed distributions.

docs/reference/data-analysis/aggregations/search-aggregations-metrics-cardinality-aggregation.md

Lines changed: 2 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -65,19 +65,8 @@ Computing exact counts requires loading values into a hash set and returning its
6565

6666
This `cardinality` aggregation is based on the [HyperLogLog++](https://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf) algorithm, which counts based on the hashes of the values with some interesting properties:
6767

68-
* configurable precision, which decides on how to trade memory for accuracy,
69-
* excellent accuracy on low-cardinality sets,
70-
* fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.
71-
72-
For a precision threshold of `c`, the implementation that we are using requires about `c * 8` bytes.
73-
74-
The following chart shows how the error varies before and after the threshold:
75-
76-
![cardinality error](../../../images/cardinality_error.png "")
77-
78-
For all 3 thresholds, counts have been accurate up to the configured threshold. Although not guaranteed, this is likely to be the case. Accuracy in practice depends on the dataset in question. In general, most datasets show consistently good accuracy. Also note that even with a threshold as low as 100, the error remains very low (1-6% as seen in the above graph) even when counting millions of items.
79-
80-
The HyperLogLog++ algorithm depends on the leading zeros of hashed values, the exact distributions of hashes in a dataset can affect the accuracy of the cardinality.
68+
:::{include} _snippets/search-aggregations-metrics-cardinality-aggregation-explanation.md
69+
:::
8170

8271

8372
## Pre-computed hashes [_pre_computed_hashes]

docs/reference/data-analysis/aggregations/search-aggregations-metrics-percentile-aggregation.md

Lines changed: 2 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -175,31 +175,14 @@ GET latency/_search
175175

176176
## Percentiles are (usually) approximate [search-aggregations-metrics-percentile-aggregation-approximation]
177177

178-
There are many different algorithms to calculate percentiles. The naive implementation simply stores all the values in a sorted array. To find the 50th percentile, you simply find the value that is at `my_array[count(my_array) * 0.5]`.
179-
180-
Clearly, the naive implementation does not scale — the sorted array grows linearly with the number of values in your dataset. To calculate percentiles across potentially billions of values in an Elasticsearch cluster, *approximate* percentiles are calculated.
181-
182-
The algorithm used by the `percentile` metric is called TDigest (introduced by Ted Dunning in [Computing Accurate Quantiles using T-Digests](https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf)).
183-
184-
When using this metric, there are a few guidelines to keep in mind:
185-
186-
* Accuracy is proportional to `q(1-q)`. This means that extreme percentiles (e.g. 99%) are more accurate than less extreme percentiles, such as the median
187-
* For small sets of values, percentiles are highly accurate (and potentially 100% accurate if the data is small enough).
188-
* As the quantity of values in a bucket grows, the algorithm begins to approximate the percentiles. It is effectively trading accuracy for memory savings. The exact level of inaccuracy is difficult to generalize, since it depends on your data distribution and volume of data being aggregated
189-
190-
The following chart shows the relative error on a uniform distribution depending on the number of collected values and the requested percentile:
191-
192-
![percentiles error](../../../images/percentiles_error.png "")
193-
194-
It shows how precision is better for extreme percentiles. The reason why error diminishes for large number of values is that the law of large numbers makes the distribution of values more and more uniform and the t-digest tree can do a better job at summarizing it. It would not be the case on more skewed distributions.
178+
:::{include} /reference/data-analysis/aggregations/_snippets/search-aggregations-metrics-percentile-aggregation-approximate.md
179+
:::
195180

196181
::::{warning}
197182
Percentile aggregations are also [non-deterministic](https://en.wikipedia.org/wiki/Nondeterministic_algorithm). This means you can get slightly different results using the same data.
198-
199183
::::
200184

201185

202-
203186
## Compression [search-aggregations-metrics-percentile-aggregation-compression]
204187

205188
Approximate algorithms must balance memory utilization with estimation accuracy. This balance can be controlled using a `compression` parameter:
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
The ES|QL documentation is composed of static content and generated content.
2+
The static content exists in this directory and can be edited by hand.
3+
However, the sub-directories `_snippets`, `images` and `kibana` contain mostly
4+
generated content.
5+
6+
### _snippets
7+
8+
In `_snippets` there are files that can be included within other files
9+
using the [File Inclusion](https://elastic.github.io/docs-builder/syntax/file_inclusion/)
10+
feature of the Elastic Docs V3 system.
11+
Most, but not all, files in this directory are generated.
12+
In particular the directories `_snippets/functions/*` and `_snippets/operators/*`
13+
contain subdirectories that are mostly generated:
14+
15+
* `description` - description of each function scraped from `@FunctionInfo#description`
16+
* `examples` - examples of each function scraped from `@FunctionInfo#examples`
17+
* `parameters` - description of each function's parameters scraped from `@Param`
18+
* `signature` - railroad diagram of the syntax to invoke each function
19+
* `types` - a table of each combination of support type for each parameter. These are generated from tests.
20+
* `layout` - a fully generated description for each function
21+
22+
Most functions can use the generated docs generated in the `layout` directory.
23+
If we need something more custom for the function we can make a file in this
24+
directory that can `include::` any parts of the files above.
25+
26+
To regenerate the files for a function run its tests using gradle.
27+
For example to generate docs for the `CASE` function:
28+
```
29+
./gradlew :x-pack:plugin:esql:test -Dtests.class='CaseTests'
30+
```
31+
32+
To regenerate the files for all functions run all of ESQL's tests using gradle:
33+
```
34+
./gradlew :x-pack:plugin:esql:test
35+
```
36+
37+
### images
38+
39+
The `images` directory contains `functions` and `operators` sub-directories with
40+
the `*.svg` files used to describe the syntax of each function or operator.
41+
These are all generated by the same tests that generate the functions and operators docs above.
42+
43+
### kibana
44+
45+
The `kibana` directory contains `definition` and `docs` sub-directories that are generated:
46+
47+
* `kibana/definition` - function definitions for kibana's ESQL editor
48+
* `kibana/docs` - the inline docs for kibana
49+
50+
These are also generated as part of the unit tests described above.
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
% This is generated by ESQL's AbstractFunctionTestCase. Do no edit it. See ../README.md for how to regenerate it.
2+
3+
### Counts are approximate [esql-agg-count-distinct-approximate]
4+
5+
Computing exact counts requires loading values into a set and returning its
6+
size. This doesn’t scale when working on high-cardinality sets and/or large
7+
values as the required memory usage and the need to communicate those
8+
per-shard sets between nodes would utilize too many resources of the cluster.
9+
10+
This `COUNT_DISTINCT` function is based on the
11+
[HyperLogLog++](https://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf)
12+
algorithm, which counts based on the hashes of the values with some interesting
13+
properties:
14+
15+
:::{include} /reference/data-analysis/aggregations/_snippets/search-aggregations-metrics-cardinality-aggregation-explanation.md
16+
:::
17+
18+
The `COUNT_DISTINCT` function takes an optional second parameter to configure
19+
the precision threshold. The `precision_threshold` options allows to trade memory
20+
for accuracy, and defines a unique count below which counts are expected to be
21+
close to accurate. Above this value, counts might become a bit more fuzzy. The
22+
maximum supported value is `40000`, thresholds above this number will have the
23+
same effect as a threshold of `40000`. The default value is `3000`.
24+
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
% This is generated by ESQL's AbstractFunctionTestCase. Do no edit it. See ../README.md for how to regenerate it.
2+
3+
::::{warning}
4+
`MEDIAN` is also [non-deterministic](https://en.wikipedia.org/wiki/Nondeterministic_algorithm).
5+
This means you can get slightly different results using the same data.
6+
::::
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
% This is generated by ESQL's AbstractFunctionTestCase. Do no edit it. See ../README.md for how to regenerate it.
2+
3+
::::{warning}
4+
`MEDIAN_ABSOLUTE_DEVIATION` is also [non-deterministic](https://en.wikipedia.org/wiki/Nondeterministic_algorithm).
5+
This means you can get slightly different results using the same data.
6+
::::
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
% This is generated by ESQL's AbstractFunctionTestCase. Do no edit it. See ../README.md for how to regenerate it.
2+
3+
### `PERCENTILE` is (usually) approximate [esql-percentile-approximate]
4+
5+
:::{include} /reference/data-analysis/aggregations/_snippets/search-aggregations-metrics-percentile-aggregation-approximate.md
6+
:::
7+
8+
::::{warning}
9+
`PERCENTILE` is also [non-deterministic](https://en.wikipedia.org/wiki/Nondeterministic_algorithm).
10+
This means you can get slightly different results using the same data.
11+
::::

0 commit comments

Comments
 (0)