-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Esql mv_contains function #133636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Esql mv_contains function #133636
Conversation
…entation improvements. Fix issue with byteref being empty, which caused fold to fail.
…ompatibility test environment. - not sure how to test it as, I feel like the version should be main on main / dev. Doing the dance for now.
…lifecycle metadata Documentation rewording. Co-authored-by: Liam Thompson <[email protected]>
- Fixing tests by removing logic to return null if all parameters are null. The standard generator had to be circumvented, should follow up with separate PR to make it more intelligent to avoid it. - Overwritten part of the test methods to avoid the null expectation.
Co-authored-by: Iván Cea Fontenla <[email protected]>
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Hi @mjmbischoff, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a second look on the evaluators, I found what looks like a bug. Added also a comment around how we can test that case to better catch them
...main/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvContains.java
Outdated
Show resolved
Hide resolved
ROW a = "a", b = ["a", "b", "c"], n = null | ||
| EVAL aa = mv_contains(a, a), | ||
bb = mv_contains(b, b), | ||
ab = mv_contains(a, b), | ||
ba = mv_contains(b,a), | ||
na = mv_contains(n, a), | ||
an = mv_contains(a, n), | ||
nn = mv_contains(n,n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reproduce the other comment problem with nulls, this won't be enough, as this always works with a single row.
A possible way to generate the data for that could be:
ROW a = [1, 2, 3], n = null
| MV_EXPAND a # Now we have multiple rows
| EVAL a = CASE(a == 2, null, a) # And we add a null to the non-null column
| EVAL
an = mv_contains(a, n).
na = mv_contains(n, a)
If it works how I think and this effectively uses the MvContainsNullEvaluator, this should end up with an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really love a ROWS
source command that can take multiple ROW
's for both examples and test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unresolving this, as the test for mixed nulls wasn't added yet (Mixed in the same column) 👀
The ROWS
or similar syntax would be handy, but it would be a breaking change, and it's not planned. Usually the test indices have enough things to test anyway, or a new index can be added (Not a quick change, I would avoid that here).
For custom index tests, we would then make a YAML test (Example). But rarely for functions, as CSV tests are usually enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 66ba735 Let me know if I missed any combinations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for adding it! You could add the same CASE(...) for the subset too (blocks with some nulls inside for subsets). Since the custom evaluator was removed, this isn't as important. But could be a nice test to have, specially if we autogenerate the evaluator at a later stage
...main/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvContains.java
Outdated
Show resolved
Hide resolved
Refactor null handling in `MvContains` evaluators and add `MvContainsNullSupersetEvaluator` for better type-specific evaluation logic.
...main/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvContains.java
Show resolved
Hide resolved
...main/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvContains.java
Outdated
Show resolved
Hide resolved
...main/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvContains.java
Outdated
Show resolved
Hide resolved
…torFactory` and adding global constant boolean evaluators for improved reusability.
…torFactory` and adding global constant boolean evaluators for improved reusability.
…ed records for `ConstantNull`, `ConstantTrue`, and `ConstantFalse`, adding memory usage tracking.
x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/operator/EvalOperator.java
Show resolved
Hide resolved
...main/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvContains.java
Outdated
Show resolved
Hide resolved
final var valueCount = subset.getValueCount(position); | ||
final var startIndex = subset.getFirstValueIndex(position); | ||
for (int valueIndex = startIndex; valueIndex < startIndex + valueCount; valueIndex++) { | ||
var value = valueExtractor.extractValue(subset, valueIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm I didn't see this before, but I think boxing could be a problem performance-wise.
We usually have an evaluator/vector/block per type for 2 reasons:
- Avoid itable accesses
- Avoid primitive boxing
Now, how much extra time will this take, I don't know. This MV_CONTAINS
function has more overhead than other scalar functions, so maybe it's not that important. But I'm not sure about that.
There are 2 optimizations that would be ideal here:
- Having a method per type
- Having a specialization for sorted values (See
Block#mvOrdering()
)
The second one can be made later. The first one too, I guess?
It's in preview, so I think it's fine either way. If it's merged as-is, I would create an issue to improve it later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, to do this per-type, whether now or later, this could be a case of using StringTemplates (Example), to avoid repeated code.
These functions could be extracted into their own static classes (Or the same, but for the full evaluators, with the functions directly inside).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, should create a followup issue for this. Also want to follow up on improving the Implementers to avoid having Evaluators here and have them autogenerated. Leaving this unresolved until after merge so I don't forget to open an issue.
})); | ||
} | ||
|
||
// Adjusted from static method anyNullIsNull in {@code AbstractScalarFunctionTestCase#} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a line describing exactly what changed here? So it's easier later to remove this override if/when we extend the original to handle this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 23d92f0 but want to leave this unresolved as well to follow up - need to look into making the test class more adaptive.
ROW a = "a", b = ["a", "b", "c"], n = null | ||
| EVAL aa = mv_contains(a, a), | ||
bb = mv_contains(b, b), | ||
ab = mv_contains(a, b), | ||
ba = mv_contains(b,a), | ||
na = mv_contains(n, a), | ||
an = mv_contains(a, n), | ||
nn = mv_contains(n,n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unresolving this, as the test for mixed nulls wasn't added yet (Mixed in the same column) 👀
The ROWS
or similar syntax would be handy, but it would be a breaking change, and it's not planned. Usually the test indices have enough things to test anyway, or a new index can be added (Not a quick change, I would avoid that here).
For custom index tests, we would then make a YAML test (Example). But rarely for functions, as CSV tests are usually enough
...java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvContainsTests.java
Outdated
Show resolved
Hide resolved
… visibility of methods / constructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes!
And as a summary, the things to work on later that I remember are:
- Multivalue evaluator implementer with nulls
- Performance improvements: avoiding boxing and using inherent sorting when available (Probably to be solved before removing the preview label? Some microbenchmarks would be interesting here too)
💔 Backport failedThe backport operation could not be completed due to the following error:
You can use sqren/backport to manually backport by running |
This is a fixup for #133099 which was reverted from main as ESQL: Track memory in evaluators (#133392) got merged to main at the same time. Causing compile errors.