Improve Logsdb docs including default values (#115205)

salvatore-campagna · web-flow · commit ebec1a2fe2bc · 2024-10-24T18:25:38.000+02:00
This PR adds detailed documentation for `logsdb` mode, covering several key aspects of its default behavior and configuration options.

It includes:
- default settings for index sorting (`index.sort.field`, `index.sort.order`, etc.).
- usage of synthetic `_source` by default.
- information about specialized codecs and how users can override them.
- default behavior for `ignore_malformed` and `ignore_above` settings, including precedence rules.
- explanation of how fields without `doc_values` are handled and what we do if they are missing.
diff --git a/docs/reference/data-streams/logs.asciidoc b/docs/reference/data-streams/logs.asciidoc
@@ -8,14 +8,6 @@ A logs data stream is a data stream type that stores log data more efficiently.
 In benchmarks, log data stored in a logs data stream used ~2.5 times less disk space than a regular data
 stream. The exact impact will vary depending on your data set.
 
-The following features are enabled in a logs data stream:
-
-* <<synthetic-source,Synthetic source>>, which omits storing the `_source` field. When the document source is requested, it is synthesized from document fields upon retrieval.
-
-* Index sorting. This yields a lower storage footprint. By default indices are sorted by `host.name` and `@timestamp` fields at index time.
-
-* More space efficient compression for fields with <<doc-values,`doc_values`>> enabled.
-
 [discrete]
 [[how-to-use-logsds]]
 === Create a logs data stream
@@ -50,3 +42,175 @@ DELETE _index_template/my-index-template
 ----
 // TEST[continued]
 ////
+
+[[logsdb-default-settings]]
+
+[discrete]
+[[logsdb-synthtic-source]]
+=== Synthetic source
+
+By default, `logsdb` mode uses <<synthetic-source,synthetic source>>, which omits storing the original `_source`
+field and synthesizes it from doc values or stored fields upon document retrieval. Synthetic source comes with a few
+restrictions which you can read more about in the <<synthetic-source,documentation>> section dedicated to it.
+
+NOTE: When dealing with multi-value fields, the `index.mapping.synthetic_source_keep` setting controls how field values
+are preserved for <<synthetic-source,synthetic source>> reconstruction. In `logsdb`, the default value is `arrays`,
+which retains both duplicate values and the order of entries but not necessarily the exact structure when it comes to
+array elements or objects. Preserving duplicates and ordering could be critical for some log fields. This could be the
+case, for instance, for DNS A records, HTTP headers, or log entries that represent sequential or repeated events.
+
+For more details on this setting and ways to refine or bypass it, check out <<synthetic-source-keep, this section>>.
+
+[discrete]
+[[logsdb-sort-settings]]
+=== Index sort settings
+
+The following settings are applied by default when using the `logsdb` mode for index sorting:
+
+* `index.sort.field`: `["host.name", "@timestamp"]`
+  In `logsdb` mode, indices are sorted by `host.name` and `@timestamp` fields by default. For data streams, the
+  `@timestamp` field is automatically injected if it is not present.
+
+* `index.sort.order`: `["desc", "desc"]`
+  The default sort order for both fields is descending (`desc`), prioritizing the latest data.
+
+* `index.sort.mode`: `["min", "min"]`
+  The default sort mode is `min`, ensuring that indices are sorted by the minimum value of multi-value fields.
+
+* `index.sort.missing`: `["_first", "_first"]`
+  Missing values are sorted to appear first (`_first`) in `logsdb` index mode.
+
+`logsdb` index mode allows users to override the default sort settings. For instance, users can specify their own fields
+and order for sorting by modifying the `index.sort.field` and `index.sort.order`.
+
+When using default sort settings, the `host.name` field is automatically injected into the mappings of the
+index as a `keyword` field to ensure that sorting can be applied. This guarantees that logs are efficiently sorted and
+retrieved based on the `host.name` and `@timestamp` fields.
+
+NOTE: If `subobjects` is set to `true` (which is the default), the `host.name` field will be mapped as an object field
+named `host`, containing a `name` child field of type `keyword`. On the other hand, if `subobjects` is set to `false`,
+a single `host.name` field will be mapped as a `keyword` field.
+
+Once an index is created, the sort settings are immutable and cannot be modified. To apply different sort settings,
+a new index must be created with the desired configuration. For data streams, this can be achieved by means of an index
+rollover after updating relevant (component) templates.
+
+If the default sort settings are not suitable for your use case, consider modifying them. Keep in mind that sort
+settings can influence indexing throughput, query latency, and may affect compression efficiency due to the way data
+is organized after sorting. For more details, refer to our documentation on
+<<index-modules-index-sorting,index sorting>>.
+
+NOTE: For <<data-streams, data streams>>, the `@timestamp` field is automatically injected if not already present.
+However, if custom sort settings are applied, the `@timestamp` field is injected into the mappings, but it is not
+automatically added to the list of sort fields.
+
+[discrete]
+[[logsdb-specialized-codecs]]
+=== Specialized codecs
+
+`logsdb` index mode uses the `best_compression` <<index-codec,codec>> by default, which applies {wikipedia}/Zstd[ZSTD]
+compression to stored fields. Users are allowed to override it and switch to the `default` codec for faster compression
+at the expense of slightly larger storage footprint.
+
+`logsdb` index mode also adopts specialized codecs for numeric doc values that are crafted to optimize storage usage.
+Users can rely on these specialized codecs being applied by default when using `logsdb` index mode.
+
+Doc values encoding for numeric fields in `logsdb` follows a static sequence of codecs, applying each one in the
+following order: delta encoding, offset encoding, Greatest Common Divisor GCD encoding, and finally Frame Of Reference
+(FOR) encoding. The decision to apply each encoding is based on heuristics determined by the data distribution.
+For example, before applying delta encoding, the algorithm checks if the data is monotonically non-decreasing or
+non-increasing. If the data fits this pattern, delta encoding is applied; otherwise, the next encoding is considered.
+
+The encoding is specific to each Lucene segment and is also re-applied at segment merging time. The merged Lucene segment
+may use a different encoding compared to the original Lucene segments, based on the characteristics of the merged data.
+
+The following methods are applied sequentially:
+
+* **Delta encoding**:
+  a compression method that stores the difference between consecutive values instead of the actual values.
+
+* **Offset encoding**:
+  a compression method that stores the difference from a base value rather than between consecutive values.
+
+* **Greatest Common Divisor (GCD) encoding**:
+  a compression method that finds the greatest common divisor of a set of values and stores the differences
+  as multiples of the GCD.
+
+* **Frame Of Reference (FOR) encoding**:
+  a compression method that determines the smallest number of bits required to encode a block of values and uses
+  bit-packing to fit such values into larger 64-bit blocks.
+
+For keyword fields, **Run Length Encoding (RLE)** is applied to the ordinals, which represent positions in the Lucene
+segment-level keyword dictionary. This compression is used when multiple consecutive documents share the same keyword.
+
+[discrete]
+[[logsdb-ignored-settings]]
+=== `ignore_malformed`, `ignore_above`, `ignore_dynamic_beyond_limit`
+
+By default, `logsdb` index mode sets `ignore_malformed` to `true`. This setting allows documents with malformed fields
+to be indexed without causing indexing failures, ensuring that log data ingestion continues smoothly even when some
+fields contain invalid or improperly formatted data.
+
+Users can override this setting by setting `index.mapping.ignore_malformed` to `false`. However, this is not recommended
+as it might result in documents with malformed fields being rejected and not indexed at all.
+
+In `logsdb` index mode, the `index.mapping.ignore_above` setting is applied by default at the index level to ensure
+efficient storage and indexing of large keyword fields.The index-level default for `ignore_above` is set to 8191
+**characters**. If using UTF-8 encoding, this results in a limit of 32764 bytes, depending on character encoding.
+The mapping-level `ignore_above` setting still takes precedence. If a specific field has an `ignore_above` value
+defined in its mapping, that value will override the index-level `index.mapping.ignore_above` value. This default
+behavior helps to optimize indexing performance by preventing excessively large string values from being indexed, while
+still allowing users to customize the limit, overriding it at the mapping level or changing the index level default
+setting.
+
+In `logsdb` index mode, the setting `index.mapping.total_fields.ignore_dynamic_beyond_limit` is set to `true` by
+default. This allows dynamically mapped fields to be added on top of statically defined fields without causing document
+rejection, even after the total number of fields exceeds the limit defined by `index.mapping.total_fields.limit`. The
+`index.mapping.total_fields.limit` setting specifies the maximum number of fields an index can have (static, dynamic
+and runtime). When the limit is reached, new dynamically mapped fields will be ignored instead of failing the document
+indexing, ensuring continued log ingestion without errors.
+
+NOTE: When automatically injected, `host.name` and `@timestamp` contribute to the limit of mapped fields. When
+`host.name` is mapped with `subobjects: true` it consists of two fields. When `host.name` is mapped with
+`subobjects: false` it only consists of one field.
+
+[discrete]
+[[logsdb-nodocvalue-fields]]
+=== Fields without doc values
+
+When `logsdb` index mode uses synthetic `_source`, and `doc_values` are disabled for a field in the mapping,
+Elasticsearch may set the `store` setting to `true` for that field as a last resort option to ensure that the field's
+data is still available for reconstructing the document’s source when retrieving it via
+<<synthetic-source,synthetic source>>.
+
+For example, this happens with text fields when `store` is `false` and there is no suitable multi-field available to
+reconstruct the original value in <<synthetic-source,synthetic source>>.
+
+This automatic adjustment allows synthetic source to work correctly, even when doc values are not enabled for certain
+fields.
+
+[discrete]
+[[logsdb-settings-summary]]
+=== LogsDB settings summary
+
+The following is a summary of key settings that apply when using `logsdb` index mode in Elasticsearch:
+
+* **`index.mode`**: `"logsdb"`
+
+* **`index.mapping.synthetic_source_keep`**: `"arrays"`
+
+* **`index.sort.field`**: `["host.name", "@timestamp"]`
+
+* **`index.sort.order`**: `["desc", "desc"]`
+
+* **`index.sort.mode`**: `["min", "min"]`
+
+* **`index.sort.missing`**: `["_first", "_first"]`
+
+* **`index.codec`**: `"best_compression"`
+
+* **`index.mapping.ignore_malformed`**: `true`
+
+* **`index.mapping.ignore_above`**: `8191`
+
+* **`index.mapping.total_fields.ignore_dynamic_beyond_limit`**: `true`