@@ -8,14 +8,6 @@ A logs data stream is a data stream type that stores log data more efficiently.
88In benchmarks, log data stored in a logs data stream used ~2.5 times less disk space than a regular data
99stream. The exact impact will vary depending on your data set.
1010
11- The following features are enabled in a logs data stream:
12-
13- * <<synthetic-source,Synthetic source>>, which omits storing the `_source` field. When the document source is requested, it is synthesized from document fields upon retrieval.
14-
15- * Index sorting. This yields a lower storage footprint. By default indices are sorted by `host.name` and `@timestamp` fields at index time.
16-
17- * More space efficient compression for fields with <<doc-values,`doc_values`>> enabled.
18-
1911[discrete]
2012[[how-to-use-logsds]]
2113=== Create a logs data stream
@@ -50,3 +42,175 @@ DELETE _index_template/my-index-template
5042----
5143// TEST[continued]
5244////
45+
46+ [[logsdb-default-settings]]
47+
48+ [discrete]
49+ [[logsdb-synthtic-source]]
50+ === Synthetic source
51+
52+ By default, `logsdb` mode uses <<synthetic-source,synthetic source>>, which omits storing the original `_source`
53+ field and synthesizes it from doc values or stored fields upon document retrieval. Synthetic source comes with a few
54+ restrictions which you can read more about in the <<synthetic-source,documentation>> section dedicated to it.
55+
56+ NOTE: When dealing with multi-value fields, the `index.mapping.synthetic_source_keep` setting controls how field values
57+ are preserved for <<synthetic-source,synthetic source>> reconstruction. In `logsdb`, the default value is `arrays`,
58+ which retains both duplicate values and the order of entries but not necessarily the exact structure when it comes to
59+ array elements or objects. Preserving duplicates and ordering could be critical for some log fields. This could be the
60+ case, for instance, for DNS A records, HTTP headers, or log entries that represent sequential or repeated events.
61+
62+ For more details on this setting and ways to refine or bypass it, check out <<synthetic-source-keep, this section>>.
63+
64+ [discrete]
65+ [[logsdb-sort-settings]]
66+ === Index sort settings
67+
68+ The following settings are applied by default when using the `logsdb` mode for index sorting:
69+
70+ * `index.sort.field`: `["host.name", "@timestamp"]`
71+ In `logsdb` mode, indices are sorted by `host.name` and `@timestamp` fields by default. For data streams, the
72+ `@timestamp` field is automatically injected if it is not present.
73+
74+ * `index.sort.order`: `["desc", "desc"]`
75+ The default sort order for both fields is descending (`desc`), prioritizing the latest data.
76+
77+ * `index.sort.mode`: `["min", "min"]`
78+ The default sort mode is `min`, ensuring that indices are sorted by the minimum value of multi-value fields.
79+
80+ * `index.sort.missing`: `["_first", "_first"]`
81+ Missing values are sorted to appear first (`_first`) in `logsdb` index mode.
82+
83+ `logsdb` index mode allows users to override the default sort settings. For instance, users can specify their own fields
84+ and order for sorting by modifying the `index.sort.field` and `index.sort.order`.
85+
86+ When using default sort settings, the `host.name` field is automatically injected into the mappings of the
87+ index as a `keyword` field to ensure that sorting can be applied. This guarantees that logs are efficiently sorted and
88+ retrieved based on the `host.name` and `@timestamp` fields.
89+
90+ NOTE: If `subobjects` is set to `true` (which is the default), the `host.name` field will be mapped as an object field
91+ named `host`, containing a `name` child field of type `keyword`. On the other hand, if `subobjects` is set to `false`,
92+ a single `host.name` field will be mapped as a `keyword` field.
93+
94+ Once an index is created, the sort settings are immutable and cannot be modified. To apply different sort settings,
95+ a new index must be created with the desired configuration. For data streams, this can be achieved by means of an index
96+ rollover after updating relevant (component) templates.
97+
98+ If the default sort settings are not suitable for your use case, consider modifying them. Keep in mind that sort
99+ settings can influence indexing throughput, query latency, and may affect compression efficiency due to the way data
100+ is organized after sorting. For more details, refer to our documentation on
101+ <<index-modules-index-sorting,index sorting>>.
102+
103+ NOTE: For <<data-streams, data streams>>, the `@timestamp` field is automatically injected if not already present.
104+ However, if custom sort settings are applied, the `@timestamp` field is injected into the mappings, but it is not
105+ automatically added to the list of sort fields.
106+
107+ [discrete]
108+ [[logsdb-specialized-codecs]]
109+ === Specialized codecs
110+
111+ `logsdb` index mode uses the `best_compression` <<index-codec,codec>> by default, which applies {wikipedia}/Zstd[ZSTD]
112+ compression to stored fields. Users are allowed to override it and switch to the `default` codec for faster compression
113+ at the expense of slightly larger storage footprint.
114+
115+ `logsdb` index mode also adopts specialized codecs for numeric doc values that are crafted to optimize storage usage.
116+ Users can rely on these specialized codecs being applied by default when using `logsdb` index mode.
117+
118+ Doc values encoding for numeric fields in `logsdb` follows a static sequence of codecs, applying each one in the
119+ following order: delta encoding, offset encoding, Greatest Common Divisor GCD encoding, and finally Frame Of Reference
120+ (FOR) encoding. The decision to apply each encoding is based on heuristics determined by the data distribution.
121+ For example, before applying delta encoding, the algorithm checks if the data is monotonically non-decreasing or
122+ non-increasing. If the data fits this pattern, delta encoding is applied; otherwise, the next encoding is considered.
123+
124+ The encoding is specific to each Lucene segment and is also re-applied at segment merging time. The merged Lucene segment
125+ may use a different encoding compared to the original Lucene segments, based on the characteristics of the merged data.
126+
127+ The following methods are applied sequentially:
128+
129+ * **Delta encoding**:
130+ a compression method that stores the difference between consecutive values instead of the actual values.
131+
132+ * **Offset encoding**:
133+ a compression method that stores the difference from a base value rather than between consecutive values.
134+
135+ * **Greatest Common Divisor (GCD) encoding**:
136+ a compression method that finds the greatest common divisor of a set of values and stores the differences
137+ as multiples of the GCD.
138+
139+ * **Frame Of Reference (FOR) encoding**:
140+ a compression method that determines the smallest number of bits required to encode a block of values and uses
141+ bit-packing to fit such values into larger 64-bit blocks.
142+
143+ For keyword fields, **Run Length Encoding (RLE)** is applied to the ordinals, which represent positions in the Lucene
144+ segment-level keyword dictionary. This compression is used when multiple consecutive documents share the same keyword.
145+
146+ [discrete]
147+ [[logsdb-ignored-settings]]
148+ === `ignore_malformed`, `ignore_above`, `ignore_dynamic_beyond_limit`
149+
150+ By default, `logsdb` index mode sets `ignore_malformed` to `true`. This setting allows documents with malformed fields
151+ to be indexed without causing indexing failures, ensuring that log data ingestion continues smoothly even when some
152+ fields contain invalid or improperly formatted data.
153+
154+ Users can override this setting by setting `index.mapping.ignore_malformed` to `false`. However, this is not recommended
155+ as it might result in documents with malformed fields being rejected and not indexed at all.
156+
157+ In `logsdb` index mode, the `index.mapping.ignore_above` setting is applied by default at the index level to ensure
158+ efficient storage and indexing of large keyword fields.The index-level default for `ignore_above` is set to 8191
159+ **characters**. If using UTF-8 encoding, this results in a limit of 32764 bytes, depending on character encoding.
160+ The mapping-level `ignore_above` setting still takes precedence. If a specific field has an `ignore_above` value
161+ defined in its mapping, that value will override the index-level `index.mapping.ignore_above` value. This default
162+ behavior helps to optimize indexing performance by preventing excessively large string values from being indexed, while
163+ still allowing users to customize the limit, overriding it at the mapping level or changing the index level default
164+ setting.
165+
166+ In `logsdb` index mode, the setting `index.mapping.total_fields.ignore_dynamic_beyond_limit` is set to `true` by
167+ default. This allows dynamically mapped fields to be added on top of statically defined fields without causing document
168+ rejection, even after the total number of fields exceeds the limit defined by `index.mapping.total_fields.limit`. The
169+ `index.mapping.total_fields.limit` setting specifies the maximum number of fields an index can have (static, dynamic
170+ and runtime). When the limit is reached, new dynamically mapped fields will be ignored instead of failing the document
171+ indexing, ensuring continued log ingestion without errors.
172+
173+ NOTE: When automatically injected, `host.name` and `@timestamp` contribute to the limit of mapped fields. When
174+ `host.name` is mapped with `subobjects: true` it consists of two fields. When `host.name` is mapped with
175+ `subobjects: false` it only consists of one field.
176+
177+ [discrete]
178+ [[logsdb-nodocvalue-fields]]
179+ === Fields without doc values
180+
181+ When `logsdb` index mode uses synthetic `_source`, and `doc_values` are disabled for a field in the mapping,
182+ Elasticsearch may set the `store` setting to `true` for that field as a last resort option to ensure that the field's
183+ data is still available for reconstructing the document’s source when retrieving it via
184+ <<synthetic-source,synthetic source>>.
185+
186+ For example, this happens with text fields when `store` is `false` and there is no suitable multi-field available to
187+ reconstruct the original value in <<synthetic-source,synthetic source>>.
188+
189+ This automatic adjustment allows synthetic source to work correctly, even when doc values are not enabled for certain
190+ fields.
191+
192+ [discrete]
193+ [[logsdb-settings-summary]]
194+ === LogsDB settings summary
195+
196+ The following is a summary of key settings that apply when using `logsdb` index mode in Elasticsearch:
197+
198+ * **`index.mode`**: `"logsdb"`
199+
200+ * **`index.mapping.synthetic_source_keep`**: `"arrays"`
201+
202+ * **`index.sort.field`**: `["host.name", "@timestamp"]`
203+
204+ * **`index.sort.order`**: `["desc", "desc"]`
205+
206+ * **`index.sort.mode`**: `["min", "min"]`
207+
208+ * **`index.sort.missing`**: `["_first", "_first"]`
209+
210+ * **`index.codec`**: `"best_compression"`
211+
212+ * **`index.mapping.ignore_malformed`**: `true`
213+
214+ * **`index.mapping.ignore_above`**: `8191`
215+
216+ * **`index.mapping.total_fields.ignore_dynamic_beyond_limit`**: `true`
0 commit comments