Skip to content
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 52 additions & 38 deletions docs/reference/data-streams/logs.asciidoc
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
[[logs-data-stream]]
== Logs data stream

IMPORTANT: The {es} `logsdb` index mode is generally available in Elastic Cloud Hosted
and self-managed Elasticsearch as of version 8.17, and is enabled by default for
logs in https://www.elastic.co/elasticsearch/serverless[{serverless-full}].
IMPORTANT: The {es} `logsdb` index mode is generally available in Elastic Cloud Hosted
and self-managed Elasticsearch as of version 8.17, and is enabled by default for
logs in https://www.elastic.co/elasticsearch/serverless[{serverless-full}].

A logs data stream is a data stream type that stores log data more efficiently.

Expand Down Expand Up @@ -54,57 +54,48 @@ DELETE _index_template/my-index-template
=== Synthetic source

If you have the required https://www.elastic.co/subscriptions[subscription], `logsdb` index mode uses <<synthetic-source,synthetic `_source`>>, which omits storing the original `_source`
field. Instead, the document source is synthesized from doc values or stored fields upon document retrieval.
field. Instead, the document source is synthesized from doc values or stored fields upon document retrieval.

If you don't have the required https://www.elastic.co/subscriptions[subscription], `logsdb` mode uses the original `_source` field.

Before using synthetic source, make sure to review the <<synthetic-source-restrictions,restrictions>>.
Before using synthetic source, make sure to review the <<synthetic-source-restrictions,restrictions>>.

When working with multi-value fields, the `index.mapping.synthetic_source_keep` setting controls how field values
are preserved for <<synthetic-source,synthetic source>> reconstruction. In `logsdb`, the default value is `arrays`,
which retains both duplicate values and the order of entries. However, the exact structure of
array elements and objects is not necessarily retained. Preserving duplicates and ordering can be critical for some
log fields, such as DNS A records, HTTP headers, and log entries that represent sequential or repeated events.
array elements and objects is not necessarily retained. Preserving duplicates and ordering can be critical for some
log fields, such as DNS A records, HTTP headers, and log entries that represent sequential or repeated events.

[discrete]
[[logsdb-sort-settings]]
=== Index sort settings

In `logsdb` index mode, the following sort settings are applied by default:
In `logsdb` index mode, indices are sorted by fields `host.name` and `@timestamp` by default. The `@timestamp` field is
automatically injected if it is not present. The `host.name` field is automatically injected as `keyword` if it is not
present and can be injected - this may not be possible if `host` is a keyword field, for instance. If field
`host.name` can't be injected or can't be used for sorting (e.g. it's an IP field), sorting is only applied to field
`@timestamp`.

`index.sort.field`: `["host.name", "@timestamp"]`::
Indices are sorted by `host.name` and `@timestamp` by default. The `@timestamp` field is automatically injected if it is not present.
NOTE: If `host.name` is injected and `subobjects` is set to `true` (default), the `host` field is mapped as an object
field named `host` with a `name` child field of type `keyword`. If `subobjects` is set to `false`, a single
`host.name` field is mapped as a `keyword` field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use the restructuring I suggested in the preceding lines, delete this note. I folded it into my suggestion above.


`index.sort.order`: `["desc", "desc"]`::
Both `host.name` and `@timestamp` are sorted in descending (`desc`) order, prioritizing the latest data.
`host.name` and `@timestamp` are sorted in ascending and descending order respectively, prioritizing the latest data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use the restructuring I suggested above, delete this line. I folded it into my suggestion above.


`index.sort.mode`: `["min", "min"]`::
The `min` mode sorts indices by the minimum value of multi-value fields.
It is possible to override the default sort configuration by configuring settings `index.sort.field` and
`index.sort.order`. Section <<index-modules-index-sorting>> covers this topic in detail. To modify the sort
configuration of an existing data stream, update the data stream's component templates, and then perform or wait for a
<<data-streams-rollover,rollover>>.

`index.sort.missing`: `["_first", "_first"]`::
Missing values are sorted to appear `_first`.

You can override these default sort settings. For example, to sort on different fields
and change the order, manually configure `index.sort.field` and `index.sort.order`. For more details, see
<<index-modules-index-sorting>>.

When using the default sort settings, the `host.name` field is automatically injected into the index mappings as a `keyword` field to ensure that sorting can be applied. This guarantees that logs are efficiently sorted and retrieved based on the `host.name` and `@timestamp` fields.

NOTE: If `subobjects` is set to `true` (default), the `host` field is mapped as an object field
named `host` with a `name` child field of type `keyword`. If `subobjects` is set to `false`,
a single `host.name` field is mapped as a `keyword` field.

To apply different sort settings to an existing data stream, update the data stream's component templates, and then
perform or wait for a <<data-streams-rollover,rollover>>.

NOTE: In `logsdb` mode, the `@timestamp` field is automatically injected if it's not already present. If you apply custom sort settings, the `@timestamp` field is injected into the mappings but is not
automatically added to the list of sort fields.
NOTE: If you apply custom sort settings, the `@timestamp` field is injected into the mappings but is not
automatically added to the list of sort fields. It is highly recommended to include it manually, as the last sort
field with `desc` ordering.

[discrete]
[[logsdb-host-name]]
==== Existing data streams

If you're enabling `logsdb` index mode on a data stream that already exists, make sure to check mappings and sorting. The `logsdb` mode automatically maps `host.name` as a keyword if it's included in the sort settings. If a `host.name` field already exists but has a different type, mapping errors might occur, preventing `logsdb` mode from being fully applied.
If you're enabling `logsdb` index mode on a data stream that already exists, make sure to check mappings and sorting. The `logsdb` mode automatically maps `host.name` as a keyword if it's included in the sort settings. If a `host.name` field already exists but has a different type, mapping errors might occur, preventing `logsdb` mode from being fully applied.

To avoid mapping conflicts, consider these options:

Expand All @@ -114,7 +105,30 @@ To avoid mapping conflicts, consider these options:

* **Switch to a different <<index-mode-setting,index mode>>**: If resolving `host.name` mapping conflicts is not feasible, you can choose not to use `logsdb` mode.

IMPORTANT: On existing data streams, `logsdb` mode is applied on <<data-streams-rollover,rollover>> (automatic or manual).
IMPORTANT: On existing data streams, `logsdb` mode is applied on <<data-streams-rollover,rollover>> (automatic or manual).

[discrete]
[[logsdb-sort-routign]]
==== Optimized routing on sort fields

The storage footprint of `logsdb` indexes can further be reduced by enabling a routing optimization that relies on
the fields in the sort configuration (except for `@timestamp`) to route documents to shards. The storage wins depend on
the sort configuration and the nature of the logged data - we observed 20% storage reductions in our benchmarks,
compared to the default configuration for `logsdb` mode. Combined with a negligible penalty to ingest
performance (1-4%), this optimization is a good option for data streams that are expected to grow substantially with
time.

Configuring the routing optimization requires the following:

* Include index setting `[index.logsdb.route_on_sort_fields:true]` in the data stream configuration.
* <<index-modules-index-sorting, Configure index sorting>> with 2 or more fields, in addition to `@timestamp`.
* Make sure <<mapping-id-field, field `_id`>> is not populated in ingested documents, as it needs to get
auto-generated.

Using a custom sort configuration is required to minimize the possibility of creating hotspots, in case of a
logging spike producing documents that all get routed to a single shard. To prevent this, and to improve storage
efficiency, it is recommended to use a few fields that have a rather low cardinality and don't co-vary
(e.g. `host.name` and `host.id` are likely a bad choice).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Using a custom sort configuration is required to minimize the possibility of creating hotspots, in case of a
logging spike producing documents that all get routed to a single shard. To prevent this, and to improve storage
efficiency, it is recommended to use a few fields that have a rather low cardinality and don't co-vary
(e.g. `host.name` and `host.id` are likely a bad choice).
Logging spikes can cause hotspots by producing documents that all get routed to a single
shard. To prevent hotspots and improve storage efficiency, your configuration should use a few sort fields that have a relatively low cardinality and don't co-vary (for example, `host.name` and `host.id` are not optimal).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic requires a custom sort config to reduce the likelihood of hotspots, as opposed to working with the default sort config. I think the updated text (and my version, possibly) missed this part. Maybe we can clarify this better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks, I see what you mean. I'd suggest losing the logging spikes sentence -- WDYT of this?

Suggested change
Using a custom sort configuration is required to minimize the possibility of creating hotspots, in case of a
logging spike producing documents that all get routed to a single shard. To prevent this, and to improve storage
efficiency, it is recommended to use a few fields that have a rather low cardinality and don't co-vary
(e.g. `host.name` and `host.id` are likely a bad choice).
A custom sort configuration is required, to minimize hotspots and improve storage efficiency. For best results, use a few sort fields that have a relatively low cardinality and don't co-vary
(for example, `host.name` and `host.id` are not optimal).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, though it'd be nice to explain what leads to hotspots - I don't think this is mentioned elsewhere in this page. Another possibility is to include such a note above, where we describe the option for custom sort config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK one more try 🙂

A custom sort configuration is required, to improve storage efficiency and to
minimize hotspots from logging spikes that route documents to a single shard. 
For best results, use a few sort fields that have a relatively low cardinality and
don't co-vary (for example, `host.name` and `host.id` are not optimal).


[discrete]
[[logsdb-specialized-codecs]]
Expand All @@ -123,7 +137,7 @@ IMPORTANT: On existing data streams, `logsdb` mode is applied on <<data-streams-
By default, `logsdb` index mode uses the `best_compression` <<index-codec,codec>>, which applies {wikipedia}/Zstd[ZSTD]
compression to stored fields. You can switch to the `default` codec for faster compression with a slightly larger storage footprint.

The `logsdb` index mode also automatically applies specialized codecs for numeric doc values, in order to optimize storage usage. Numeric fields are
The `logsdb` index mode also automatically applies specialized codecs for numeric doc values, in order to optimize storage usage. Numeric fields are
encoded using the following sequence of codecs:

* **Delta encoding**:
Expand Down Expand Up @@ -173,9 +187,9 @@ _characters._ Using UTF-8 encoding, this results in a limit of 32764 bytes, depe

The mapping-level `ignore_above` setting takes precedence. If a specific field has an `ignore_above` value
defined in its mapping, that value overrides the index-level `index.mapping.ignore_above` value. This default
behavior helps to optimize indexing performance by preventing excessively large string values from being indexed.
behavior helps to optimize indexing performance by preventing excessively large string values from being indexed.

If you need to customize the limit, you can override it at the mapping level or change the index level default.
If you need to customize the limit, you can override it at the mapping level or change the index level default.

[discrete]
[[logs-db-ignore-limit]]
Expand All @@ -202,7 +216,7 @@ reconstructing the original value.
[[logsdb-settings-summary]]
=== Settings reference

The `logsdb` index mode uses the following settings:
The `logsdb` index mode uses the following settings:

* **`index.mode`**: `"logsdb"`

Expand Down