Skip to content
Merged
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
205 changes: 205 additions & 0 deletions docs/reference/data-streams/logs.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -50,3 +50,208 @@ DELETE _index_template/my-index-template
----
// TEST[continued]
////

[[logsdb-default-settings]]

[discrete]
[[logsdb-synthtic-source]]
=== Synthetic source

By default, `logsdb` mode uses <<synthetic-source,Synthetic source>>, which omits storing the original `_source`
field and synthesizes it from doc values or stored fields upon document retrieval. Synthetic source comes with a few
restrictions which you can read more about in the <<synthetic-source,documentation>> section dedicated to it.

NOTE: When dealing with multi-value fields, the `index.mapping.synthetic_source_keep` setting controls how field values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if multi-value fields is clear enough. Maybe "when dealing with arrays of values"?

Copy link
Contributor Author

@salvatore-campagna salvatore-campagna Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places in the documents we use multi-value fields...which by the way is the correct name. Elasticsearch doesn't normally need to maintain array order because its core functionality revolves around searching based on the presence of values, not their position. This is true also for aggregations. Therefore, it treats arrays and multi-value fields as a set of independent values, where order doesn't play a role in indexing or querying. So, IMO it is where we use "array" that we make a mistake. An array is a (concrete) ordered data structure...a multi-value field is an abstract collection of values where order does not matter. I don't want to sound picky but again...I think "array" is incorrect. A lot of our code is written without considering ordering an issue (including the way synthetic source works normally and aggregations work). If we use "array" we suggest, instead, that ordering matters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this context, sounds good.

are preserved for <<synthetic-source,Synthetic source>> reconstruction. In `logsdb`, the default value is `arrays`,
which retains both duplicate values and the order of entries but not necessarily the exact structure when it comes to
array elements or objects. Preserving duplicates and ordering could be critical for some log fields. This could be the
case, for instance, for DNS A records, HTTP headers, or log entries that represent sequential or repeated events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add something like:

For more details on this setting and ways to refine or bypass it, check out <<synthetic-source-keep, this section>>.


For more details on this setting and ways to refine or bypass it, check out <<synthetic-source-keep, this section>>.

[discrete]
[[logsdb-data-streams]]
=== LogsDB for logs data streams

In Elasticsearch, `logsdb` mode is applied by default for data streams whose name matches the pattern `logs-*-*`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not say that in 8.16 docs, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When backporting will change it.

This pattern identifies a logs data stream, and Elasticsearch automatically configures the data stream to use LogsDB.
We recommend using `logsdb` index mode for <<data-streams, data streams>> by means of standard or custom (component)
templates.

Users can opt out of `logsdb` index mode by overriding the `index.mode` setting in the index settings or by
using composable or index templates to customize the indexing configuration. This allows for flexibility in choosing
the appropriate indexing mode for different data streams if LogsDB is not desired.

For data streams not matching the pattern `logs-*-*` and for standalone indices, users can still use the `index.mode`
setting to enable LogsDB.

[discrete]
[[logsdb-sort-settings]]
=== Index sort settings

The following settings are applied by default when using the `logsdb` mode for index sorting:

* `index.sort.field`: `["host.name", "@timestamp"]`
In `logsdb` mode, indices are sorted by `host.name` and `@timestamp` fields by default. For data streams, the
`@timestamp` field is automatically injected if it is not present.

* `index.sort.order`: `["desc", "desc"]`
The default sort order for both fields is descending (`desc`), prioritizing the latest data.

* `index.sort.mode`: `["min", "min"]`
The default sort mode is `min`, ensuring that indices are sorted by the minimum value of multi-value fields.

* `index.sort.missing`: `["_first", "_first"]`
Missing values are sorted to appear first (`_first`) in `logsdb` index mode.

`logsdb` index mode allows users to override the default sort settings. For instance, users can specify their own fields
and order for sorting by modifying the `index.sort.field` and `index.sort.order`.

When using default sort settings, the `host.name` field is automatically injected into the mappings of the
index as a `keyword` field to ensure that sorting can be applied. This guarantees that logs are efficiently sorted and
retrieved based on the `host.name` and `@timestamp` fields.

NOTE: If `subobjects` is set to `true` (which is the default), the `host.name` field will be mapped as an object field
named `host`, containing a `name` child field of type `keyword`. On the other hand, if `subobjects` is set to `false`,
a single `host.name` field will be mapped as a `keyword` field.

Once an index is created, the sort settings are immutable and cannot be modified. To apply different sort settings,
a new index must be created with the desired configuration. For data streams, this can be achieved by means of an index
rollover after updating relevant (component) templates.

If the default sort settings are not suitable for your use case, consider modifying them. Keep in mind that sort
settings can influence indexing throughput, query latency, and may affect compression efficiency due to the way data
is organized after sorting. For more details, refer to our documentation on
<<index-modules-index-sorting,index sorting>>.

NOTE: For <<data-streams, data streams>>, the `@timestamp` field is automatically injected if not already present.
However, if custom sort settings are applied, the `@timestamp` field is injected into the mappings, but it is not
automatically added to the list of sort fields.

[discrete]
[[logsdb-specialized-codecs]]
=== Specialized codecs

`logsdb` index mode uses the `best_compression` codec by default, which applies {wikipedia}/Zstd[ZSTD] compression to
stored fields.

Users are allowed to override the default compression codec. If desired, they can switch to the `best_speed`
codec for faster compression at the expense of slightly larger storage footprint.

* `index.codec`: `"best_compression"`
This is the default setting, applying {wikipedia}/Zstd[ZSTD] compression to stored fields for optimal storage
efficiency.

* `index.codec`: `"best_speed"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option is named default and not best_speed. In the codec this is known as best speed, but that isn't what the configuration option's name is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks

If faster indexing performance is required, users can opt for `best_speed` compression, which sacrifices some storage
efficiency for higher indexing throughput.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just link to the documentation about index.codec setting? (https://www.elastic.co/guide/en/elasticsearch/reference/8.16/index-modules.html)


`logsdb` index mode adopts specialized codecs for numeric doc values that are crafted to optimize storage usage.
Users can rely on these specialized codecs being applied by default when using `logsdb` index mode.

Doc values encoding for numeric fields in `logsdb` follows a static sequence of codecs, applying each one in the
following order: delta encoding, offset encoding, GCD encoding, and finally Frame Of Reference (FOR) encoding.
The decision to apply each encoding is based on heuristics determined by the data distribution. For example, before
applying delta encoding, the algorithm checks if the data is monotonically non-decreasing or non-increasing. If the data
fits this pattern, delta encoding is applied; otherwise, the next encoding is considered.

The encoding is specific to each Lucene segment and is also re-applied at segment merging time. The merged Lucene segment
may use a different encoding compared to the original Lucene segments, based on the characteristics of the merged data.

The following methods are applied sequentially:

* **Delta encoding**:
a compression method that stores the difference between consecutive values instead of the actual values.

* **Offset encoding**:
a compression method that stores the difference from a base value rather than between consecutive values.

* **Greatest Common Divisor (GCD) encoding**:
a compression method that finds the greatest common divisor of a set of values and stores the differences
as multiples of the GCD.

* **Frame Of Reference (FOR) encoding**:
a compression method that determines the smallest number of bits required to encode a block of values and uses
bit-packing to fit such values into larger 64-bit blocks.

For keyword fields, Run Length Encoding (RLE) is applied to the ordinals, which represent positions in the Lucene
segment-level keyword dictionary. This compression is used when multiple consecutive documents share the same keyword.

[discrete]
[[logsdb-ignored-settings]]
=== `ignore_malformed`, `ignore_above`, `ignore_dynamic_beyond_limit` and `_ignored_source`

By default, `logsdb` index mode sets `ignore_malformed` to `true`. This setting allows documents with malformed fields
to be indexed without causing indexing failures, ensuring that log data ingestion continues smoothly even when some
fields contain invalid or improperly formatted data.

Users can override this setting by setting `index.mapping.ignore_malformed` to `false`. However, this is not recommended
as it might result in documents with malformed fields being rejected and not indexed at all.

In `logsdb` index mode, the `index.mapping.ignore_above` setting is applied by default at the index level to ensure
efficient storage and indexing of large keyword fields.The index-level default for `ignore_above` is set to 8191
**characters**. If using UTF-8 encoding, this results in a limit of 32764 bytes, depending on character encoding.
The mapping-level `ignore_above` setting still takes precedence. If a specific field has an `ignore_above` value
defined in its mapping, that value will override the index-level `index.mapping.ignore_above` value. This default
behavior helps to optimize indexing performance by preventing excessively large string values from being indexed, while
still allowing users to customize the limit, overriding it at the mapping level or changing the index level default
setting.

In `logsdb` index mode, the setting `index.mapping.total_fields.ignore_dynamic_beyond_limit` is set to `true` by
default. This allows dynamically mapped fields to be added on top of statically defined fields without causing document
rejection, even after the total number of fields exceeds the limit defined by `index.mapping.total_fields.limit`. The
`index.mapping.total_fields.limit` setting specifies the maximum number of fields an index can have (static, dynamic
and runtime). When the limit is reached, new dynamically mapped fields will be ignored instead of failing the document
indexing, ensuring continued log ingestion without errors.

NOTE: When automatically injected, `host.name` and `@timestamp` contribute to the limit of mapped fields. When
`host.name` is mapped with `subobjects: true` it consists of two fields. When `host.name` is mapped with
`subobjects: false` it only consists of one field.

`logsdb` index mode uses a special field named `_ignored_source` that allows retrieving values for fields that have been
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm this is more of an internal implementation detail.. I wonder if we should be documenting this, as its use may change in the future. Do we expect users to care about it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We expose it via the fields and stored_fields api anyway...so they can actually fetch it. I wrote that they should not rely on the name or the encoding. I think this is fair. The idea is that this should only be used for debugging purposes. If there is an issue it will be handy asking them about getting the value for this field.

ignored for various reasons (e.g., due to malformed data or indexing rules). This field ensures that even ignored
field values can be accessed if needed. The `_ignored_source` field is not returned by default and must be explicitly
requested via the <<search-fields,fields or stored fields>> API using `_ignored_source` as the field name.
Additionally, the field is encoded, and the encoding format may change over time, so users should not rely on the
encoding or the field name remaining the same.

[discrete]
[[logsdb-nodocvalue-fields]]
=== Fields without doc values

When `logsdb` index mode uses synthetic `_source`, and `doc_values` are disabled for a field in the mapping,
Elasticsearch may set the `store` setting to `true` for that field as a last resort option to ensure that the field's
data is still available for reconstructing the document’s source when retrieving it via
<<synthetic-source,Synthetic source>>.

For example, this happens with text fields when `store` is `false` and there is no suitable multi-field available to
reconstruct the original value in <<synthetic-source,Synthetic source>>.

This automatic adjustment allows synthetic source to work correctly, even when doc values are not enabled for certain
fields.

[discrete]
[[logsdb-settings-summary]]
=== LogsDB settings summary

The following is a summary of key settings that apply when using `logsdb` index mode in Elasticsearch:

* **`index.mode`**: `"logsdb"`

* **`index.mapping.synthetic_source_keep`**: `"arrays"`

* **`index.sort.field`**: `["host.name", "@timestamp"]`

* **`index.sort.order`**: `["desc", "desc"]`

* **`index.sort.mode`**: `["min", "min"]`

* **`index.sort.missing`**: `["_first", "_first"]`

* **`index.codec`**: `"best_compression"`

* **`index.mapping.ignore_malformed`**: `true`

* **`index.mapping.ignore_above`**: `8191`

* **`index.mapping.total_fields.ignore_dynamic_beyond_limit`**: `true`