-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Store high-cardinality keyword fields in binary doc values #138548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Store high-cardinality keyword fields in binary doc values #138548
Conversation
|
Hi @jordan-powers, I've created a changelog YAML for you. |
| if (hasValue) { | ||
| for (int i = 0; i < sortedBinaryDocValues.docValueCount(); i++) { | ||
| BytesRef bytesRef = sortedBinaryDocValues.nextValue(); | ||
| emit(bytesRef.utf8ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that the conversion from ut8 to utf16 and then back here is a significant overhead on wildcard/regex queries.
It worth removing this roundtrip, though probably belongs in a follow-up. Here's a hacky approach I made to fix this: parkertimmins@fa13b3b#diff-cf9d201e04fb4fd754a3981f450cda5e68c551392e781a78aaa4ef8ccc48bccd
Another idea is that maybe you could use BinaryDvConfirmedQuery. This operates on BytesRefs directly so probably does not have this round trip. (Though I'm not really sure if it makes sense to use)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into that, thanks! Although if it requires more than a couple of lines to fix, I think it'll be best left as a follow-up. This PR is getting long enough as-is
This PR adds a mapping parameter to keyword fields
doc_values.cardinality. When this parameter is set tolow(the default), keyword fields will use sorted set doc values as normal. However, when this parameter is set tohigh, keyword fields will instead use binary doc values.This is an optimization to remove the overhead of looking up keyword values by ordinal when the keyword field has high-cardinality.
This is still a work in progress, but I am opening the draft PR to start getting CI running.