-
Notifications
You must be signed in to change notification settings - Fork 25.5k
ignore_above
default to 8191 for logsdb
#113442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ignore_above
default to 8191 for logsdb
#113442
Conversation
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
The changes look good, but I think 1024 is too small to be the default. |
@felixbarny @andrewkroh We would like to know if this value is good for our integrations. My opinion is that we can still have a default...maybe changing the default to a larger value...2k, 4k...and that making sure that fields requiring larger-then-the-default value are explicit about it. |
1024 is the default that is used in ECS. I'm not sure where this exact number was determined and why we haven't chosen a slightly larger default. The intention behind it is to avoid document rejections when running into the Lucene's byte-length limit of 32kb. I think we can choose a generic default for LogsDB that would prevent us running into that limit (with any unicode character) but that's bigger than the ECS limit of 1024. While I think it makes sense to have a default value for What seems missing in your PR is to actually use the index setting in keyword field mappers and updating the docs to describe that there's a way to set an index-level default, similar to what we do with |
The |
@felixbarny @andrewkroh we need feedback here, possibly before GA release. I would rather go for a high value like 8k to 12k rather than not having it. |
+1 on having a default value for What's the highest value of ignore_above that would guarantee us to be under the hard-limit in Lucene? How are we encoding chars and what's the highest number of bytes per char? Is there any other static overhead in the encoding? How are we handling unicode code points that consist of multiple characters? |
From our documentation:
In Lucene the hard limit is 32766 bytes. The highest value for Applying the value will depend on encoding and character count will match bytes count only if all characters are encoded using one byte. In UTF-8 encoding might result in 1 to 4 bytes per character if we consider also non-ascii characters but I guess for logging purposes we can safely assume that we are dealing with ASCII characters. (AFAIK Elasticsearch uses UTF-8 encoding for strings). About the static overhead I don't see any issue...keywords and text are normally converted to arrays of bytes. |
Let's set the default limit to 8191 then. |
ignore_above
default to 1024 for logsdb
ignore_above
default to 8191 for logsdb
@elasticmachine update branch |
merge conflict between base and head |
public static final Setting<Integer> IGNORE_ABOVE_SETTING = Setting.intSetting("index.mapping.ignore_above", settings -> { | ||
if (IndexSettings.MODE.get(settings) == IndexMode.LOGSDB | ||
&& IndexMetadata.SETTING_INDEX_VERSION_CREATED.get(settings).onOrAfter(IndexVersions.ENABLE_IGNORE_ABOVE_LOGSDB)) { | ||
return "8191"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we override this for an index with logsdb mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, this is the default value.. Let's add a comment above, or move it to a static helper function for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
settings -> { ... }
lambda just determines the default value if no explicit value is provided for the setting.
I will extract the lambda in a method with a descriptive name.
@elasticmachine update branch |
merge conflict between base and head |
@elasticmachine update branch |
@martijnvg @felixbarny I need an approval if we ant to merge this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@elasticmachine update branch |
💔 Backport failedThe backport operation could not be completed due to the following error:
You can use sqren/backport to manually backport by running |
In LogsDB we would like to use a default value of `8191` for the index-level setting `index.mapping.ignore_above`. The value for `ignore_above` is the _character count_, but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8 characters may occupy at most 4 bytes.
In LogsDB we would like to use a default value of `8191` for the index-level setting `index.mapping.ignore_above`. The value for `ignore_above` is the _character count_, but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8 characters may occupy at most 4 bytes.
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
In LogsDB we would like to use a default value of `8191` for the index-level setting `index.mapping.ignore_above`. The value for `ignore_above` is the _character count_, but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8 characters may occupy at most 4 bytes. (cherry picked from commit 521e434) # Conflicts: # server/src/main/java/org/elasticsearch/common/settings/Setting.java
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
In LogsDB we would like to use a default value of `8191` for the index-level setting `index.mapping.ignore_above`. The value for `ignore_above` is the _character count_, but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8 characters may occupy at most 4 bytes. (cherry picked from commit 521e434) # Conflicts: # server/src/main/java/org/elasticsearch/common/settings/Setting.java
In LogsDB we would like to use a default value of `8191` for the index-level setting `index.mapping.ignore_above`. The value for `ignore_above` is the _character count_, but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8 characters may occupy at most 4 bytes. (cherry picked from commit 521e434) # Conflicts: # server/src/main/java/org/elasticsearch/common/settings/Setting.java Co-authored-by: Salvatore Campagna <[email protected]>
In LogsDB we would like to use a default value of `8191` for the index-level setting `index.mapping.ignore_above`. The value for `ignore_above` is the _character count_, but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8 characters may occupy at most 4 bytes. (cherry picked from commit 521e434) # Conflicts: # server/src/main/java/org/elasticsearch/common/settings/Setting.java Co-authored-by: Salvatore Campagna <[email protected]> Co-authored-by: Elastic Machine <[email protected]>
In LogsDB we would like to use a default value of `8191` for the index-level setting `index.mapping.ignore_above`. The value for `ignore_above` is the _character count_, but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8 characters may occupy at most 4 bytes. (cherry picked from commit 521e434)
(cherry picked from commit 521e434)
In LogsDB we would like to use a default value of
8191
for the index-level settingindex.mapping.ignore_above
. The value forignore_above
is the character count,but Lucene counts bytes. Here we set the limit to
32766 / 4 = 8191
since UTF-8characters may occupy at most 4 bytes.