Skip to content

Conversation

salvatore-campagna
Copy link
Contributor

@salvatore-campagna salvatore-campagna commented Sep 24, 2024

In LogsDB we would like to use a default value of 8191 for the index-level setting
index.mapping.ignore_above. The value for ignore_above is the character count,
but Lucene counts bytes. Here we set the limit to 32766 / 4 = 8191 since UTF-8
characters may occupy at most 4 bytes.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@dnhatn
Copy link
Member

dnhatn commented Sep 24, 2024

The changes look good, but I think 1024 is too small to be the default.

@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Sep 24, 2024

@felixbarny @andrewkroh We would like to know if this value is good for our integrations.
Especially for fields that do not have a value set at mapping level, an index-level setting would result in applying a possibly unwanted limit. For fields which might include stack traces we might end-up missing the ability to search on them. This would result in a regression for fields not using ignore_above and expecting no limit.

My opinion is that we can still have a default...maybe changing the default to a larger value...2k, 4k...and that making sure that fields requiring larger-then-the-default value are explicit about it.

@felixbarny
Copy link
Member

1024 is the default that is used in ECS. I'm not sure where this exact number was determined and why we haven't chosen a slightly larger default. The intention behind it is to avoid document rejections when running into the Lucene's byte-length limit of 32kb. I think we can choose a generic default for LogsDB that would prevent us running into that limit (with any unicode character) but that's bigger than the ECS limit of 1024.

While I think it makes sense to have a default value for ignore_above in LogsDB to avoid rejections, I think we should discuss whether there are better alternatives to ignore_above. For example, we could ignore values only if they would actually exceed the hard-limit in Lucene. Or instead of ignoring them entirely, we could truncate the values. There are some discussions in this issue that we may want to revise: #60329

What seems missing in your PR is to actually use the index setting in keyword field mappers and updating the docs to describe that there's a way to set an index-level default, similar to what we do with ignore_malformed.

@salvatore-campagna
Copy link
Contributor Author

1024 is the default that is used in ECS. I'm not sure where this exact number was determined and why we haven't chosen a slightly larger default. The intention behind it is to avoid document rejections when running into the Lucene's byte-length limit of 32kb. I think we can choose a generic default for LogsDB that would prevent us running into that limit (with any unicode character) but that's bigger than the ECS limit of 1024.

While I think it makes sense to have a default value for ignore_above in LogsDB to avoid rejections, I think we should discuss whether there are better alternatives to ignore_above. For example, we could ignore values only if they would actually exceed the hard-limit in Lucene. Or instead of ignoring them entirely, we could truncate the values. There are some discussions in this issue that we may want to revise: #60329

What seems missing in your PR is to actually use the index setting in keyword field mappers and updating the docs to describe that there's a way to set an index-level default, similar to what we do with ignore_malformed.

The ignore_above index setting was already introduced by #113121.
In this PR we just change the default value from Inter.MAX_VALUE to 1024 (or any other suitable value we decide) for LogsDB.

@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Sep 30, 2024

@felixbarny @andrewkroh we need feedback here, possibly before GA release. I would rather go for a high value like 8k to 12k rather than not having it.

@felixbarny
Copy link
Member

+1 on having a default value for ignore_above for now that's potentially higher than 1024 but still guarantees that we're not hitting the hard limit of 32kb in Lucene.

What's the highest value of ignore_above that would guarantee us to be under the hard-limit in Lucene? How are we encoding chars and what's the highest number of bytes per char? Is there any other static overhead in the encoding? How are we handling unicode code points that consist of multiple characters?

@salvatore-campagna
Copy link
Contributor Author

+1 on having a default value for ignore_above for now that's potentially higher than 1024 but still guarantees that we're not hitting the hard limit of 32kb in Lucene.

What's the highest value of ignore_above that would guarantee us to be under the hard-limit in Lucene? How are we encoding chars and what's the highest number of bytes per char? Is there any other static overhead in the encoding? How are we handling unicode code points that consist of multiple characters?

From our documentation:

NOTE: The value for `ignore_above` is the _character count_, but Lucene counts
bytes. If you use UTF-8 text with many non-ASCII characters, you may want to
set the limit to `32766 / 4 = 8191` since UTF-8 characters may occupy at most
4 bytes.

In Lucene the hard limit is 32766 bytes.

The highest value for ignore_above that guarantees being under the Lucene limit, accounting for the worst-case 4-byte encoding, is 8191.

Applying the value will depend on encoding and character count will match bytes count only if all characters are encoded using one byte. In UTF-8 encoding might result in 1 to 4 bytes per character if we consider also non-ascii characters but I guess for logging purposes we can safely assume that we are dealing with ASCII characters.

(AFAIK Elasticsearch uses UTF-8 encoding for strings).

About the static overhead I don't see any issue...keywords and text are normally converted to arrays of bytes.

@felixbarny
Copy link
Member

The highest value for ignore_above that guarantees being under the Lucene limit, accounting for the worst-case 4-byte encoding, is 8191.

Let's set the default limit to 8191 then.

@salvatore-campagna salvatore-campagna changed the title ignore_above default to 1024 for logsdb ignore_above default to 8191 for logsdb Sep 30, 2024
@salvatore-campagna
Copy link
Contributor Author

@elasticmachine update branch

@elasticmachine
Copy link
Collaborator

merge conflict between base and head

public static final Setting<Integer> IGNORE_ABOVE_SETTING = Setting.intSetting("index.mapping.ignore_above", settings -> {
if (IndexSettings.MODE.get(settings) == IndexMode.LOGSDB
&& IndexMetadata.SETTING_INDEX_VERSION_CREATED.get(settings).onOrAfter(IndexVersions.ENABLE_IGNORE_ABOVE_LOGSDB)) {
return "8191";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we override this for an index with logsdb mode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this is the default value.. Let's add a comment above, or move it to a static helper function for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The

settings -> { ... }

lambda just determines the default value if no explicit value is provided for the setting.
I will extract the lambda in a method with a descriptive name.

@salvatore-campagna
Copy link
Contributor Author

@elasticmachine update branch

@elasticmachine
Copy link
Collaborator

merge conflict between base and head

@salvatore-campagna
Copy link
Contributor Author

@elasticmachine update branch

@salvatore-campagna
Copy link
Contributor Author

@martijnvg @felixbarny I need an approval if we ant to merge this

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@salvatore-campagna
Copy link
Contributor Author

@elasticmachine update branch

@salvatore-campagna salvatore-campagna merged commit 521e434 into elastic:main Oct 2, 2024
16 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

The backport operation could not be completed due to the following error:

An unexpected error occurred when attempting to backport this PR.

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 113442

gmarouli pushed a commit to gmarouli/elasticsearch that referenced this pull request Oct 3, 2024
In LogsDB we would like to use a default value of `8191` for the index-level setting
`index.mapping.ignore_above`. The value for `ignore_above` is the _character count_,
but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8
characters may occupy at most 4 bytes.
matthewabbott pushed a commit to matthewabbott/elasticsearch that referenced this pull request Oct 4, 2024
In LogsDB we would like to use a default value of `8191` for the index-level setting
`index.mapping.ignore_above`. The value for `ignore_above` is the _character count_,
but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8
characters may occupy at most 4 bytes.
@lkts
Copy link
Contributor

lkts commented Oct 22, 2024

💚 All backports created successfully

Status Branch Result
8.x

Questions ?

Please refer to the Backport tool documentation

lkts pushed a commit to lkts/elasticsearch that referenced this pull request Oct 22, 2024
In LogsDB we would like to use a default value of `8191` for the index-level setting
`index.mapping.ignore_above`. The value for `ignore_above` is the _character count_,
but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8
characters may occupy at most 4 bytes.

(cherry picked from commit 521e434)

# Conflicts:
#	server/src/main/java/org/elasticsearch/common/settings/Setting.java
@lkts
Copy link
Contributor

lkts commented Oct 23, 2024

💚 All backports created successfully

Status Branch Result
8.16

Questions ?

Please refer to the Backport tool documentation

lkts pushed a commit to lkts/elasticsearch that referenced this pull request Oct 23, 2024
In LogsDB we would like to use a default value of `8191` for the index-level setting
`index.mapping.ignore_above`. The value for `ignore_above` is the _character count_,
but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8
characters may occupy at most 4 bytes.

(cherry picked from commit 521e434)

# Conflicts:
#	server/src/main/java/org/elasticsearch/common/settings/Setting.java
lkts added a commit that referenced this pull request Oct 23, 2024
In LogsDB we would like to use a default value of `8191` for the index-level setting
`index.mapping.ignore_above`. The value for `ignore_above` is the _character count_,
but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8
characters may occupy at most 4 bytes.

(cherry picked from commit 521e434)

# Conflicts:
#	server/src/main/java/org/elasticsearch/common/settings/Setting.java

Co-authored-by: Salvatore Campagna <[email protected]>
lkts added a commit that referenced this pull request Oct 23, 2024
In LogsDB we would like to use a default value of `8191` for the index-level setting
`index.mapping.ignore_above`. The value for `ignore_above` is the _character count_,
but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8
characters may occupy at most 4 bytes.

(cherry picked from commit 521e434)

# Conflicts:
#	server/src/main/java/org/elasticsearch/common/settings/Setting.java

Co-authored-by: Salvatore Campagna <[email protected]>
Co-authored-by: Elastic Machine <[email protected]>
lkts added a commit to lkts/elasticsearch that referenced this pull request Nov 1, 2024
In LogsDB we would like to use a default value of `8191` for the index-level setting
`index.mapping.ignore_above`. The value for `ignore_above` is the _character count_,
but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8
characters may occupy at most 4 bytes.

(cherry picked from commit 521e434)
lkts added a commit that referenced this pull request Nov 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants