-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Improve Logsdb docs including default values #115205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 9 commits
f8336f2
5a80d15
554523f
2e5002e
b366d7e
aa2fb88
b47ab32
1889f8b
80c6e8f
b349842
7f53dba
b2528a0
29bf9cd
c792141
f8151b0
28a58cf
d01a30f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -50,3 +50,196 @@ DELETE _index_template/my-index-template | |||||
---- | ||||||
// TEST[continued] | ||||||
//// | ||||||
|
||||||
[[logsdb-default-settings]] | ||||||
|
||||||
[discrete] | ||||||
[[logsdb-synthtic-source]] | ||||||
=== Synthetic source | ||||||
|
||||||
By default, `logsdb` mode uses <<synthetic-source,synthetic `_source`>>, which omits storing the original `_source` | ||||||
field and synthesizes it from doc values or stored fields upon document retrieval. Synthetic source comes with a few | ||||||
restrictions which you can read more about in the <<synthetic-source,documentation>> section dedicated to it. | ||||||
|
||||||
NOTE: When dealing with multi-value fields, the `index.mapping.synthetic_source_keep` setting controls how field values | ||||||
are preserved for <<synthetic-source,synthetic `_source`>> reconstruction. In `logsdb`, the default value is `arrays`, | ||||||
which retains both duplicate values and the order of entries but not necessarily the exact structure when it comes to | ||||||
array elements or objects. Preserving duplicates and ordering could be critical for some log fields. This could be the | ||||||
case, for instance, for DNS A records, HTTP headers, or log entries that represent sequential or repeated events. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe add something like: For more details on this setting and ways to refine or bypass it, check out <<synthetic-source-keep, this section>>. |
||||||
|
||||||
[discrete] | ||||||
[[logsdb-data-streams]] | ||||||
=== LogsDB for logs data streams | ||||||
|
||||||
In Elasticsearch, `logsdb` mode is applied by default for data streams whose name matches the pattern `logs-*-*`. | ||||||
|
||||||
This pattern identifies a logs data stream, and Elasticsearch automatically configures the data stream to use LogsDB. | ||||||
We recommend using `logsdb` index mode for data streams by means of standard or custom (component) templates. | ||||||
|
||||||
Users are allowed to opt out of `logsdb` index mode by overriding the `index.mode` setting in the index settings or by | ||||||
|
Users are allowed to opt out of `logsdb` index mode by overriding the `index.mode` setting in the index settings or by | |
Users can opt out of `logsdb` index mode by overriding the `index.mode` setting in the index settings or by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this nicer? You mean less formal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, not a big deal though.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add: "In the case of a data stream, this happens through rollover".
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding an example on how to override, and mention that sorting on @timestamp is automatically added (for data streams?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well examples for index sorting are available in the page about index sorting...I will just link that page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no such thing adding sorting on @timestamp
. For logsdb we wither sort on both fields host.name
and @timestamp
or users override it with whtever they like...we don't add sorting on @timestamp other than with default sort settings. We add the @timestamp
mapping for data streams which is already explained elsewhere but we do not necessarily sort on it. Defining sort fields and injecting the mappings are separate things. If a user defines sorting on something like agent.id
for example, we still inject the @timestamp
field (for a data stream) but we do not sort on it. I will point out this in the documentation by means of a note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mapping of @timestamp
is explained in <<data-streams,data stream>>
with
Every document indexed to a data stream must contain a @timestamp field, mapped as a [date](https://www.elastic.co/guide/en/elasticsearch/reference/current/date.html) or [date_nanos](https://www.elastic.co/guide/en/elasticsearch/reference/current/date_nanos.html) field type. If the index template doesn’t specify a mapping for the @timestamp field, Elasticsearch maps @timestamp as a date field with default options.
martijnvg marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option is named default
and not best_speed
. In the codec this is known as best speed, but that isn't what the configuration option's name is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thanks
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just link to the documentation about index.codec
setting? (https://www.elastic.co/guide/en/elasticsearch/reference/8.16/index-modules.html)
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index.mapping.ignore_malformed
?
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
efficient storage and indexing of large text fields.The index-level default for `ignore_above` is set to 8191 | |
efficient storage and indexing of large keyword fields. The index-level default for `ignore_above` is set to 8191 |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm this is more of an internal implementation detail.. I wonder if we should be documenting this, as its use may change in the future. Do we expect users to care about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We expose it via the fields
and stored_fields
api anyway...so they can actually fetch it. I wrote that they should not rely on the name or the encoding. I think this is fair. The idea is that this should only be used for debugging purposes. If there is an issue it will be handy asking them about getting the value for this field.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only do this for text
and annotated_text
when store
is false
and there is no multi field suitable for synthetic source. If there is no doc_values for all other fields we use fallback synthetic source via _ignored_source
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to go into the details of saying for which field types we do this and which not just to avoid that if we change something this goes out of sync and we forget updating. Also I think is an implementation detail. I wanted to mention this just to let users know that we sometimes might do this....I will add something like sometime might set store to true
.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the default value? If so, let's skip it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if multi-value fields is clear enough. Maybe "when dealing with arrays of values"?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other places in the documents we use multi-value fields...which by the way is the correct name. Elasticsearch doesn't normally need to maintain array order because its core functionality revolves around searching based on the presence of values, not their position. This is true also for aggregations. Therefore, it treats arrays and multi-value fields as a set of independent values, where order doesn't play a role in indexing or querying. So, IMO it is where we use "array" that we make a mistake. An array is a (concrete) ordered data structure...a multi-value field is an abstract collection of values where order does not matter. I don't want to sound picky but again...I think "array" is incorrect. A lot of our code is written without considering ordering an issue (including the way synthetic source works normally and aggregations work). If we use "array" we suggest, instead, that ordering matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this context, sounds good.