Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
There are many different algorithms to calculate percentiles. The naive implementation simply stores all the values in a sorted array. To find the 50th percentile, you simply find the value that is at `my_array[count(my_array) * 0.5]`.

Clearly, the naive implementation does not scale — the sorted array grows linearly with the number of values in your dataset. To calculate percentiles across potentially billions of values in an Elasticsearch cluster, *approximate* percentiles are calculated.
Clearly, the naive implementation does not scalethe sorted array grows linearly with the number of values in your dataset. To calculate percentiles across potentially billions of values in an Elasticsearch cluster, *approximate* percentiles are calculated.

The algorithm used by the `percentile` metric is called TDigest (introduced by Ted Dunning in [Computing Accurate Quantiles using T-Digests](https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf)).

Expand Down
2 changes: 1 addition & 1 deletion docs/reference/aggregations/pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,7 @@ An alternate syntax is supported to cope with aggregations or metrics which have

## Dealing with gaps in the data [gap-policy]

Data in the real world is often noisy and sometimes contains **gaps** — places where data simply doesn’t exist. This can occur for a variety of reasons, the most common being:
Data in the real world is often noisy and sometimes contains **gaps**places where data simply doesn’t exist. This can occur for a variety of reasons, the most common being:

* Documents falling into a bucket do not contain a required field
* There are no documents matching the query for one or more buckets
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -606,7 +606,7 @@ PUT my-index-000001
```

1. This index is sorted by `username` first then by `timestamp`.
2. … in ascending order for the `username` field and in descending order for the `timestamp` field.1. could be used to optimize these composite aggregations:
2. … in ascending order for the `username` field and in descending order for the `timestamp` field.1. could be used to optimize these composite aggregations:



Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -679,7 +679,7 @@ Response:
}
```

The response will contain all the buckets having the relative day of the week as key : 1 for Monday, 2 for Tuesday… 7 for Sunday.
The response will contain all the buckets having the relative day of the week as key : 1 for Monday, 2 for Tuesday… 7 for Sunday.



Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ mapped_pages:
# Rare terms aggregation [search-aggregations-bucket-rare-terms-aggregation]


A multi-bucket value source based aggregation which finds "rare" terms — terms that are at the long-tail of the distribution and are not frequent. Conceptually, this is like a `terms` aggregation that is sorted by `_count` ascending. As noted in the [terms aggregation docs](/reference/aggregations/search-aggregations-bucket-terms-aggregation.md#search-aggregations-bucket-terms-aggregation-order), actually ordering a `terms` agg by count ascending has unbounded error. Instead, you should use the `rare_terms` aggregation
A multi-bucket value source based aggregation which finds "rare" termsterms that are at the long-tail of the distribution and are not frequent. Conceptually, this is like a `terms` aggregation that is sorted by `_count` ascending. As noted in the [terms aggregation docs](/reference/aggregations/search-aggregations-bucket-terms-aggregation.md#search-aggregations-bucket-terms-aggregation-order), actually ordering a `terms` agg by count ascending has unbounded error. Instead, you should use the `rare_terms` aggregation

## Syntax [_syntax_3]

Expand Down Expand Up @@ -117,7 +117,7 @@ This does, however, mean that a large number of results can be returned if chose

## Max Bucket Limit [search-aggregations-bucket-rare-terms-aggregation-max-buckets]

The Rare Terms aggregation is more liable to trip the `search.max_buckets` soft limit than other aggregations due to how it works. The `max_bucket` soft-limit is evaluated on a per-shard basis while the aggregation is collecting results. It is possible for a term to be "rare" on a shard but become "not rare" once all the shard results are merged together. This means that individual shards tend to collect more buckets than are truly rare, because they only have their own local view. This list is ultimately pruned to the correct, smaller list of rare terms on the coordinating node… but a shard may have already tripped the `max_buckets` soft limit and aborted the request.
The Rare Terms aggregation is more liable to trip the `search.max_buckets` soft limit than other aggregations due to how it works. The `max_bucket` soft-limit is evaluated on a per-shard basis while the aggregation is collecting results. It is possible for a term to be "rare" on a shard but become "not rare" once all the shard results are merged together. This means that individual shards tend to collect more buckets than are truly rare, because they only have their own local view. This list is ultimately pruned to the correct, smaller list of rare terms on the coordinating node… but a shard may have already tripped the `max_buckets` soft limit and aborted the request.

When aggregating on fields that have potentially many "rare" terms, you may need to increase the `max_buckets` soft limit. Alternatively, you might need to find a way to filter the results to return fewer rare values (smaller time span, filter by category, etc), or re-evaluate your definition of "rare" (e.g. if something appears 100,000 times, is it truly "rare"?)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Re-analyzing *large* result sets will require a lot of time and memory. It is re
* Suggesting "H5N1" when users search for "bird flu" to help expand queries
* Suggesting keywords relating to stock symbol $ATI for use in an automated news classifier

In these cases the words being selected are not simply the most popular terms in results. The most popular words tend to be very boring (*and, of, the, we, I, they* …). The significant words are the ones that have undergone a significant change in popularity measured between a *foreground* and *background* set. If the term "H5N1" only exists in 5 documents in a 10 million document index and yet is found in 4 of the 100 documents that make up a user’s search results that is significant and probably very relevant to their search. 5/10,000,000 vs 4/100 is a big swing in frequency.
In these cases the words being selected are not simply the most popular terms in results. The most popular words tend to be very boring (*and, of, the, we, I, they* … ). The significant words are the ones that have undergone a significant change in popularity measured between a *foreground* and *background* set. If the term "H5N1" only exists in 5 documents in a 10 million document index and yet is found in 4 of the 100 documents that make up a user’s search results that is significant and probably very relevant to their search. 5/10,000,000 vs 4/100 is a big swing in frequency.

## Basic use [_basic_use_2]

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -696,7 +696,7 @@ When aggregating on multiple indices the type of the aggregated field may not be

### Failed Trying to Format Bytes [_failed_trying_to_format_bytes]

When running a terms aggregation (or other aggregation, but in practice usually terms) over multiple indices, you may get an error that starts with "Failed trying to format bytes…". This is usually caused by two of the indices not having the same mapping type for the field being aggregated.
When running a terms aggregation (or other aggregation, but in practice usually terms) over multiple indices, you may get an error that starts with "Failed trying to format bytes… ". This is usually caused by two of the indices not having the same mapping type for the field being aggregated.

**Use an explicit `value_type`** Although it’s best to correct the mappings, you can work around this issue if the field is unmapped in one of the indices. Setting the `value_type` parameter can resolve the issue by coercing the unmapped field into the correct type.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ GET latency/_search
1. Compression controls memory usage and approximation error


The TDigest algorithm uses a number of "nodes" to approximate percentiles — the more nodes available, the higher the accuracy (and large memory footprint) proportional to the volume of data. The `compression` parameter limits the maximum number of nodes to `20 * compression`.
The TDigest algorithm uses a number of "nodes" to approximate percentilesthe more nodes available, the higher the accuracy (and large memory footprint) proportional to the volume of data. The `compression` parameter limits the maximum number of nodes to `20 * compression`.

Therefore, by increasing the compression value, you can increase the accuracy of your percentiles at the cost of more memory. Larger compression values also make the algorithm slower since the underlying tree data structure grows in size, resulting in more expensive operations. The default compression value is `100`.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ By default, the `percentile` metric will generate a range of percentiles: `[ 1,

As you can see, the aggregation will return a calculated value for each percentile in the default range. If we assume response times are in milliseconds, it is immediately obvious that the webpage normally loads in 10-720ms, but occasionally spikes to 940-980ms.

Often, administrators are only interested in outliers — the extreme percentiles. We can specify just the percents we are interested in (requested percentiles must be a value between 0-100 inclusive):
Often, administrators are only interested in outliersthe extreme percentiles. We can specify just the percents we are interested in (requested percentiles must be a value between 0-100 inclusive):

```console
GET latency/_search
Expand Down Expand Up @@ -177,7 +177,7 @@ GET latency/_search

There are many different algorithms to calculate percentiles. The naive implementation simply stores all the values in a sorted array. To find the 50th percentile, you simply find the value that is at `my_array[count(my_array) * 0.5]`.

Clearly, the naive implementation does not scale — the sorted array grows linearly with the number of values in your dataset. To calculate percentiles across potentially billions of values in an Elasticsearch cluster, *approximate* percentiles are calculated.
Clearly, the naive implementation does not scalethe sorted array grows linearly with the number of values in your dataset. To calculate percentiles across potentially billions of values in an Elasticsearch cluster, *approximate* percentiles are calculated.

The algorithm used by the `percentile` metric is called TDigest (introduced by Ted Dunning in [Computing Accurate Quantiles using T-Digests](https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf)).

Expand Down Expand Up @@ -222,7 +222,7 @@ GET latency/_search
1. Compression controls memory usage and approximation error


The TDigest algorithm uses a number of "nodes" to approximate percentiles — the more nodes available, the higher the accuracy (and large memory footprint) proportional to the volume of data. The `compression` parameter limits the maximum number of nodes to `20 * compression`.
The TDigest algorithm uses a number of "nodes" to approximate percentilesthe more nodes available, the higher the accuracy (and large memory footprint) proportional to the volume of data. The `compression` parameter limits the maximum number of nodes to `20 * compression`.

Therefore, by increasing the compression value, you can increase the accuracy of your percentiles at the cost of more memory. Larger compression values also make the algorithm slower since the underlying tree data structure grows in size, resulting in more expensive operations. The default compression value is `100`.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ mapped_pages:

A `single-value` metrics aggregation that computes the weighted average of numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.

When calculating a regular average, each datapoint has an equal "weight" … it contributes equally to the final value. Weighted averages, on the other hand, weight each datapoint differently. The amount that each datapoint contributes to the final value is extracted from the document.
When calculating a regular average, each datapoint has an equal "weight" … it contributes equally to the final value. Weighted averages, on the other hand, weight each datapoint differently. The amount that each datapoint contributes to the final value is extracted from the document.

As a formula, a weighted average is the `∑(value * weight) / ∑(weight)`

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ It supports the `mode` and `user_dictionary` settings from [`kuromoji_tokenizer`

The `kuromoji_tokenizer` tokenizer uses characters from the MeCab-IPADIC dictionary to split text into tokens. The dictionary includes some full-width characters, such as `o` and `f`. If a text contains full-width characters, the tokenizer can produce unexpected tokens.

For example, the `kuromoji_tokenizer` tokenizer converts the text `Culture of Japan` to the tokens `[ culture, o, f, japan ]` instead of `[ culture, of, japan ]`.
For example, the `kuromoji_tokenizer` tokenizer converts the text `Culture of Japan` to the tokens `[ culture, o, f, japan ]` instead of `[ culture, of, japan ]`.

To avoid this, add the [`icu_normalizer` character filter](/reference/elasticsearch-plugins/analysis-icu-normalization-charfilter.md) to a custom analyzer based on the `kuromoji` analyzer. The `icu_normalizer` character filter converts full-width characters to their normal equivalents.

Expand Down
2 changes: 1 addition & 1 deletion docs/reference/elasticsearch-plugins/integrations.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Integrations are not plugins, but are external tools or modules that make it eas
* [Ingest processor template](https://github.com/spinscale/cookiecutter-elasticsearch-ingest-processor): A template for creating new ingest processors.
* [Kafka Standalone Consumer (Indexer)](https://github.com/BigDataDevs/kafka-elasticsearch-consumer): Kafka Standalone Consumer [Indexer] will read messages from Kafka in batches, processes(as implemented) and bulk-indexes them into Elasticsearch. Flexible and scalable. More documentation in above GitHub repo’s Wiki.
* [Scrutineer](https://github.com/Aconex/scrutineer): A high performance consistency checker to compare what you’ve indexed with your source of truth content (e.g. DB)
* [FS Crawler](https://github.com/dadoonet/fscrawler): The File System (FS) crawler allows to index documents (PDF, Open Office…) from your local file system and over SSH. (by David Pilato)
* [FS Crawler](https://github.com/dadoonet/fscrawler): The File System (FS) crawler allows to index documents (PDF, Open Office… ) from your local file system and over SSH. (by David Pilato)
* [Elasticsearch Evolution](https://github.com/senacor/elasticsearch-evolution): A library to migrate elasticsearch mappings.
* [PGSync](https://pgsync.com): A tool for syncing data from Postgres to Elasticsearch.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -428,7 +428,7 @@ The `transport.compress` setting always configures local cluster request compres

### Response compression [response-compression]

The compression settings do not configure compression for responses. {{es}} will compress a response if the inbound request was compressed—even when compression is not enabled. Similarly, {{es}} will not compress a response if the inbound request was uncompressed—even when compression is enabled. The compression scheme used to compress a response will be the same scheme the remote node used to compress the request.
The compression settings do not configure compression for responses. {{es}} will compress a response if the inbound request was compressed— even when compression is not enabled. Similarly, {{es}} will not compress a response if the inbound request was uncompressed— even when compression is enabled. The compression scheme used to compress a response will be the same scheme the remote node used to compress the request.



Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ mapped_pages:

# Use index sorting to speed up conjunctions [index-modules-index-sorting-conjunctions]

Index sorting can be useful in order to organize Lucene doc ids (not to be conflated with `_id`) in a way that makes conjunctions (a AND b AND …) more efficient. In order to be efficient, conjunctions rely on the fact that if any clause does not match, then the entire conjunction does not match. By using index sorting, we can put documents that do not match together, which will help skip efficiently over large ranges of doc IDs that do not match the conjunction.
Index sorting can be useful in order to organize Lucene doc ids (not to be conflated with `_id`) in a way that makes conjunctions (a AND b AND … ) more efficient. In order to be efficient, conjunctions rely on the fact that if any clause does not match, then the entire conjunction does not match. By using index sorting, we can put documents that do not match together, which will help skip efficiently over large ranges of doc IDs that do not match the conjunction.

This trick only works with low-cardinality fields. A rule of thumb is that you should sort first on fields that both have a low cardinality and are frequently used for filtering. The sort order (`asc` or `desc`) does not matter as we only care about putting values that would match the same clauses close to each other.

Expand Down
4 changes: 2 additions & 2 deletions docs/reference/elasticsearch/index-settings/sorting.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ PUT my-index-000001
```

1. This index is sorted by the `date` field
2. … in descending order.
2. … in descending order.


It is also possible to sort the index by more than one field:
Expand Down Expand Up @@ -64,7 +64,7 @@ PUT my-index-000001
```

1. This index is sorted by `username` first then by `date`
2. … in ascending order for the `username` field and in descending order for the `date` field.
2. … in ascending order for the `username` field and in descending order for the `date` field.


Index sorting supports the following settings:
Expand Down
2 changes: 1 addition & 1 deletion docs/reference/elasticsearch/mapping-reference/array.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ When adding a field dynamically, the first value in the array determines the fie

Arrays with a mixture of data types are *not* supported: [ `10`, `"some string"` ]

An array may contain `null` values, which are either replaced by the configured [`null_value`](/reference/elasticsearch/mapping-reference/null-value.md) or skipped entirely. An empty array `[]` is treated as a missing field — a field with no values.
An array may contain `null` values, which are either replaced by the configured [`null_value`](/reference/elasticsearch/mapping-reference/null-value.md) or skipped entirely. An empty array `[]` is treated as a missing fielda field with no values.

Nothing needs to be pre-configured in order to use arrays in documents, they are supported out of the box:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ PUT my-index-000001/_mapping
}
```

When `eager_global_ordinals` is enabled, global ordinals are built when a shard is [refreshed](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-refresh) — Elasticsearch always loads them before exposing changes to the content of the index. This shifts the cost of building global ordinals from search to index-time. Elasticsearch will also eagerly build global ordinals when creating a new copy of a shard, as can occur when increasing the number of replicas or relocating a shard onto a new node.
When `eager_global_ordinals` is enabled, global ordinals are built when a shard is [refreshed](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-refresh)Elasticsearch always loads them before exposing changes to the content of the index. This shifts the cost of building global ordinals from search to index-time. Elasticsearch will also eagerly build global ordinals when creating a new copy of a shard, as can occur when increasing the number of replicas or relocating a shard onto a new node.

Eager loading can be disabled at any time by updating the `eager_global_ordinals` setting:

Expand Down
Loading
Loading