`ignore_above` default to 8191 for `logsdb` #113442

salvatore-campagna · 2024-09-24T10:02:46Z

In LogsDB we would like to use a default value of 8191 for the index-level setting
index.mapping.ignore_above. The value for ignore_above is the character count,
but Lucene counts bytes. Here we set the limit to 32766 / 4 = 8191 since UTF-8
characters may occupy at most 4 bytes.

elasticsearchmachine · 2024-09-24T10:03:10Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

dnhatn · 2024-09-24T14:13:22Z

The changes look good, but I think 1024 is too small to be the default.

salvatore-campagna · 2024-09-24T15:17:10Z

@felixbarny @andrewkroh We would like to know if this value is good for our integrations.
Especially for fields that do not have a value set at mapping level, an index-level setting would result in applying a possibly unwanted limit. For fields which might include stack traces we might end-up missing the ability to search on them. This would result in a regression for fields not using ignore_above and expecting no limit.

My opinion is that we can still have a default...maybe changing the default to a larger value...2k, 4k...and that making sure that fields requiring larger-then-the-default value are explicit about it.

felixbarny · 2024-09-25T07:46:42Z

1024 is the default that is used in ECS. I'm not sure where this exact number was determined and why we haven't chosen a slightly larger default. The intention behind it is to avoid document rejections when running into the Lucene's byte-length limit of 32kb. I think we can choose a generic default for LogsDB that would prevent us running into that limit (with any unicode character) but that's bigger than the ECS limit of 1024.

While I think it makes sense to have a default value for ignore_above in LogsDB to avoid rejections, I think we should discuss whether there are better alternatives to ignore_above. For example, we could ignore values only if they would actually exceed the hard-limit in Lucene. Or instead of ignoring them entirely, we could truncate the values. There are some discussions in this issue that we may want to revise: #60329

What seems missing in your PR is to actually use the index setting in keyword field mappers and updating the docs to describe that there's a way to set an index-level default, similar to what we do with ignore_malformed.

salvatore-campagna · 2024-09-25T08:00:06Z

1024 is the default that is used in ECS. I'm not sure where this exact number was determined and why we haven't chosen a slightly larger default. The intention behind it is to avoid document rejections when running into the Lucene's byte-length limit of 32kb. I think we can choose a generic default for LogsDB that would prevent us running into that limit (with any unicode character) but that's bigger than the ECS limit of 1024.

While I think it makes sense to have a default value for ignore_above in LogsDB to avoid rejections, I think we should discuss whether there are better alternatives to ignore_above. For example, we could ignore values only if they would actually exceed the hard-limit in Lucene. Or instead of ignoring them entirely, we could truncate the values. There are some discussions in this issue that we may want to revise: #60329

What seems missing in your PR is to actually use the index setting in keyword field mappers and updating the docs to describe that there's a way to set an index-level default, similar to what we do with ignore_malformed.

The ignore_above index setting was already introduced by #113121.
In this PR we just change the default value from Inter.MAX_VALUE to 1024 (or any other suitable value we decide) for LogsDB.

salvatore-campagna · 2024-09-30T09:50:44Z

@felixbarny @andrewkroh we need feedback here, possibly before GA release. I would rather go for a high value like 8k to 12k rather than not having it.

felixbarny · 2024-09-30T10:34:29Z

+1 on having a default value for ignore_above for now that's potentially higher than 1024 but still guarantees that we're not hitting the hard limit of 32kb in Lucene.

What's the highest value of ignore_above that would guarantee us to be under the hard-limit in Lucene? How are we encoding chars and what's the highest number of bytes per char? Is there any other static overhead in the encoding? How are we handling unicode code points that consist of multiple characters?

salvatore-campagna · 2024-09-30T10:55:52Z

+1 on having a default value for ignore_above for now that's potentially higher than 1024 but still guarantees that we're not hitting the hard limit of 32kb in Lucene.

What's the highest value of ignore_above that would guarantee us to be under the hard-limit in Lucene? How are we encoding chars and what's the highest number of bytes per char? Is there any other static overhead in the encoding? How are we handling unicode code points that consist of multiple characters?

From our documentation:

NOTE: The value for `ignore_above` is the _character count_, but Lucene counts
bytes. If you use UTF-8 text with many non-ASCII characters, you may want to
set the limit to `32766 / 4 = 8191` since UTF-8 characters may occupy at most
4 bytes.

In Lucene the hard limit is 32766 bytes.

The highest value for ignore_above that guarantees being under the Lucene limit, accounting for the worst-case 4-byte encoding, is 8191.

Applying the value will depend on encoding and character count will match bytes count only if all characters are encoded using one byte. In UTF-8 encoding might result in 1 to 4 bytes per character if we consider also non-ascii characters but I guess for logging purposes we can safely assume that we are dealing with ASCII characters.

(AFAIK Elasticsearch uses UTF-8 encoding for strings).

About the static overhead I don't see any issue...keywords and text are normally converted to arrays of bytes.

felixbarny · 2024-09-30T11:02:35Z

The highest value for ignore_above that guarantees being under the Lucene limit, accounting for the worst-case 4-byte encoding, is 8191.

Let's set the default limit to 8191 then.

salvatore-campagna · 2024-09-30T11:52:07Z

@elasticmachine update branch

elasticmachine · 2024-09-30T11:52:10Z

merge conflict between base and head

kkrik-es · 2024-09-30T12:30:45Z

server/src/main/java/org/elasticsearch/index/IndexSettings.java

+    public static final Setting<Integer> IGNORE_ABOVE_SETTING = Setting.intSetting("index.mapping.ignore_above", settings -> {
+        if (IndexSettings.MODE.get(settings) == IndexMode.LOGSDB
+            && IndexMetadata.SETTING_INDEX_VERSION_CREATED.get(settings).onOrAfter(IndexVersions.ENABLE_IGNORE_ABOVE_LOGSDB)) {
+            return "8191";


How can we override this for an index with logsdb mode?

I see, this is the default value.. Let's add a comment above, or move it to a static helper function for clarity.

The

settings -> { ... }

lambda just determines the default value if no explicit value is provided for the setting.
I will extract the lambda in a method with a descriptive name.

salvatore-campagna · 2024-10-01T08:51:41Z

@elasticmachine update branch

elasticmachine · 2024-10-01T08:51:44Z

merge conflict between base and head

salvatore-campagna · 2024-10-01T12:50:53Z

@elasticmachine update branch

salvatore-campagna · 2024-10-01T14:09:47Z

@martijnvg @felixbarny I need an approval if we ant to merge this

martijnvg

LGTM

salvatore-campagna · 2024-10-02T09:26:49Z

@elasticmachine update branch

elasticsearchmachine · 2024-10-02T13:03:06Z

💔 Backport failed

The backport operation could not be completed due to the following error:

An unexpected error occurred when attempting to backport this PR.

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 113442

In LogsDB we would like to use a default value of `8191` for the index-level setting `index.mapping.ignore_above`. The value for `ignore_above` is the _character count_, but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8 characters may occupy at most 4 bytes.

lkts · 2024-10-22T22:49:52Z

💚 All backports created successfully

Status	Branch	Result
✅	8.x

Questions ?

Please refer to the Backport tool documentation

In LogsDB we would like to use a default value of `8191` for the index-level setting `index.mapping.ignore_above`. The value for `ignore_above` is the _character count_, but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8 characters may occupy at most 4 bytes. (cherry picked from commit 521e434) # Conflicts: # server/src/main/java/org/elasticsearch/common/settings/Setting.java

lkts · 2024-10-23T18:09:07Z

💚 All backports created successfully

Status	Branch	Result
✅	8.16

Questions ?

Please refer to the Backport tool documentation

In LogsDB we would like to use a default value of `8191` for the index-level setting `index.mapping.ignore_above`. The value for `ignore_above` is the _character count_, but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8 characters may occupy at most 4 bytes. (cherry picked from commit 521e434) # Conflicts: # server/src/main/java/org/elasticsearch/common/settings/Setting.java

In LogsDB we would like to use a default value of `8191` for the index-level setting `index.mapping.ignore_above`. The value for `ignore_above` is the _character count_, but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8 characters may occupy at most 4 bytes. (cherry picked from commit 521e434) # Conflicts: # server/src/main/java/org/elasticsearch/common/settings/Setting.java Co-authored-by: Salvatore Campagna <[email protected]>

In LogsDB we would like to use a default value of `8191` for the index-level setting `index.mapping.ignore_above`. The value for `ignore_above` is the _character count_, but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8 characters may occupy at most 4 bytes. (cherry picked from commit 521e434) # Conflicts: # server/src/main/java/org/elasticsearch/common/settings/Setting.java Co-authored-by: Salvatore Campagna <[email protected]> Co-authored-by: Elastic Machine <[email protected]>

In LogsDB we would like to use a default value of `8191` for the index-level setting `index.mapping.ignore_above`. The value for `ignore_above` is the _character count_, but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8 characters may occupy at most 4 bytes. (cherry picked from commit 521e434)

(cherry picked from commit 521e434)

feature: ignore_above default to 1024 for logsdb

8f6d6f8

salvatore-campagna added auto-backport-and-merge :StorageEngine/Logs You know, for Logs v8.16.0 v9.0.0 labels Sep 24, 2024

salvatore-campagna self-assigned this Sep 24, 2024

elasticsearchmachine added the Team:StorageEngine label Sep 24, 2024

salvatore-campagna added the >non-issue label Sep 24, 2024

salvatore-campagna requested review from kkrik-es and martijnvg September 24, 2024 11:06

salvatore-campagna requested review from andrewkroh and felixbarny September 24, 2024 15:13

fix: default to 32766 / 4 = 8191

eed95bd

salvatore-campagna changed the title ~~ignore_above default to 1024 for logsdb~~ ignore_above default to 8191 for logsdb Sep 30, 2024

Merge branch 'main' into feature/ignore-above-logsdb

677c222

kkrik-es reviewed Sep 30, 2024

View reviewed changes

salvatore-campagna requested a review from kkrik-es September 30, 2024 13:00

salvatore-campagna added 2 commits September 30, 2024 15:27

fix: extract method getIgnoreAboveDefaultValue

a47efe3

test: update default value

e95feb7

Merge branch 'main' into feature/ignore-above-logsdb

da0af1c

Merge branch 'main' into feature/ignore-above-logsdb

0c29423

felixbarny approved these changes Oct 1, 2024

View reviewed changes

martijnvg approved these changes Oct 2, 2024

View reviewed changes

test: include some more values

5dff174

Merge branch 'main' into feature/ignore-above-logsdb

b4300ee

salvatore-campagna merged commit 521e434 into elastic:main Oct 2, 2024
16 checks passed

elasticsearchmachine added the backport pending label Oct 2, 2024

lkts mentioned this pull request Oct 22, 2024

[8.x] ignore_above default to 8191 for logsdb (#113442) #115373

Merged

lkts mentioned this pull request Oct 23, 2024

[8.16] ignore_above default to 8191 for logsdb (#113442) #115451

Merged

lkts added a commit that referenced this pull request Nov 2, 2024

ignore_above default to 8191 for logsdb (#113442) (#116122)

06d003b

(cherry picked from commit 521e434)

ignore_above default to 8191 for logsdb #113442

ignore_above default to 8191 for logsdb #113442

Uh oh!

Conversation

salvatore-campagna commented Sep 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 24, 2024

Uh oh!

dnhatn commented Sep 24, 2024

Uh oh!

salvatore-campagna commented Sep 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felixbarny commented Sep 25, 2024

Uh oh!

salvatore-campagna commented Sep 25, 2024

Uh oh!

salvatore-campagna commented Sep 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felixbarny commented Sep 30, 2024

Uh oh!

salvatore-campagna commented Sep 30, 2024

Uh oh!

felixbarny commented Sep 30, 2024

Uh oh!

salvatore-campagna commented Sep 30, 2024

Uh oh!

elasticmachine commented Sep 30, 2024

Uh oh!

kkrik-es Sep 30, 2024

Choose a reason for hiding this comment

Uh oh!

kkrik-es Sep 30, 2024

Choose a reason for hiding this comment

Uh oh!

salvatore-campagna Sep 30, 2024

Choose a reason for hiding this comment

Uh oh!

salvatore-campagna commented Oct 1, 2024

Uh oh!

elasticmachine commented Oct 1, 2024

Uh oh!

salvatore-campagna commented Oct 1, 2024

Uh oh!

salvatore-campagna commented Oct 1, 2024

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

salvatore-campagna commented Oct 2, 2024

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 2, 2024

💔 Backport failed

Uh oh!

lkts commented Oct 22, 2024

💚 All backports created successfully

Questions ?

Uh oh!

lkts commented Oct 23, 2024

💚 All backports created successfully

Questions ?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

`ignore_above` default to 8191 for `logsdb` #113442

`ignore_above` default to 8191 for `logsdb` #113442

salvatore-campagna commented Sep 24, 2024 •

edited

Loading

salvatore-campagna commented Sep 24, 2024 •

edited

Loading

salvatore-campagna commented Sep 30, 2024 •

edited

Loading