-
Notifications
You must be signed in to change notification settings - Fork 25.5k
ESQL: Fix CASE when conditions are multivalued #112401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When CASE hits a multivalued field it was previously either crashing on fold or evaluating it to the first value. Since booleans are loaded in sorted order from lucene that *usually* means `false`. This changes the behavior to line up with the rest of ESQL - now multivalued fields are treated as `false` with a warning. You might say "hey wait! multivalued fields usually become `null`, not `false`!". Yes, dear reader, you are right. Very right. But! `CASE`'s contract is to immediatly convert its values into `true` or `false` using the standard boolean tri-valued logic. So `null` just become `false` immediately. This is how PostgreSQL, MySQL, and SQLite behave: ``` > SELECT CASE WHEN null THEN 1 ELSE 2 END; 2 ``` They turn that `null` into a false. And we're right there with them. Except, of course, that we're turning `[false, false]` and the like into `null` first. See!? It's consitent. Consistently confusing, but sane at least. The warning message just says "treating multivalued field as false" rather than explaining all of that. This also fixes up a few of CASE's docs which I noticed were kind of busted while working on CASE. I think the docs generation is having a lot of trouble with CASE so I've manually hacked the right thing into place, but we should figure out a better solution eventually. Closes elastic#112359
Documentation preview: |
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Hi @nik9000, I've created a changelog YAML for you. |
* Create a new warnings object based on the given mode | ||
* @param warningsMode The warnings collection strategy to use | ||
* @param source used to indicate where in the query the warning occured | ||
* @param source used to indicate where in the query the warning occurred |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've given us a little more flexibility, specifically so I can tell folks that multivalues in CASE default to false
. If it's confusing why we do that read the PR description. Er, well, it's still confusing a bit. But that tries to explain it.
.../src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/conditional/Case.java
Show resolved
Hide resolved
.../src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/conditional/Case.java
Show resolved
Hide resolved
+ ", elseVal=" | ||
+ elseValueFactory | ||
+ ']'; | ||
return "CaseEvaluator[conditions=" + conditionsFactories + ", elseVal=" + elseValueFactory + ']'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just making testing harder and doesn't really provide any debug value.
new IllegalArgumentException("CASE expects a single-valued boolean") | ||
); | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the core of the fix! It's fun how it's like 5 lines of fix, 200 testing, 200 ceremony to plumb the fix properly, and another 200 of docs updates.
|
||
List<Set<String>> typesFromSignature = new ArrayList<>(); | ||
Set<String> returnFromSignature = new HashSet<>(); | ||
Set<String> returnFromSignature = new TreeSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes comparing the errors easier because they are in alphabetical order....
// CASE is so weird let's just hack this together manually | ||
Sequence seq = new Sequence(new Literal("condition"), new Syntax(","), new Literal("trueValue")); | ||
expressions.add(new Repetition(seq, 1, null)); | ||
expressions.add(new Repetition(new Literal("elseValue"), 0, 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think CASE is the only function that has this funny signature. It's actually pretty helpful to see the railroad diagram for CASE too!
DataType.GEO_POINT, | ||
DataType.CARTESIAN_SHAPE, | ||
DataType.GEO_SHAPE | ||
)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously these were all written by hand, but we know a lot more about how to write these now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a second go at this tomorrow; I agree with the proposed semantics of CASE
when presented with multi-valued booleans.
@luigidellaquila: I think CASE
is something where users would profit from keywords like ANY
and ALL
. I think it's reasonable to have a multi-valued field and wanting to express something like CASE (ANY multivalued_field, 1, 0)
or CASE(ALL multivalued_field, 1, 0)
; without the warning that this PR adds, it could look like CASE
already does it, so it's good we are being explicit about it in the warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice tests!
- I'd add one with
ROW foo = CASE([true, true], ...
just to make it obvious thatCASE
doesn't have implicitANY
orALL
logic. - Should we add
ROW a = [true, false] | EVAL c = CASE(MV_SLICE(a, 0, 2), "foo", "bar")
as well? This one failed differently from others, so I guess it exercised a different code path? A test case with an mv expression insideCASE
is probably a good idea, anyway. Or, similarly but more realistically, using a conversion function likeROW a = ["true", "false"] | EVAL c = CASE(a::boolean, "foo", "bar")
- If we want to be supremely paranoid, we could also check the interaction with multivalued union types, e.g.
FROM idx* | EVAL x = CASE(field::boolean, 1, 2)
; I'd only do that if we expect weird interactions with block loading, though, which probably is not the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally if you ask "should we add this csv-spec test?" the answer is just yes. It's cheap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Nik, this is great. I think there's a bug in Case.foldable
which we should fix before merging, otherwise my remarks are rather minor (suggesting more tests, as is becoming tradition :D ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting this up to date!
} | ||
} | ||
throw new IllegalStateException("Unreachable"); | ||
return createWarnings(warningsMode, source, "treating result as null"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: to avoid drift, we could put treating result as null
into a string constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
if (o instanceof List) { | ||
/* | ||
* multivalued fields fold to null which folds to false. | ||
* So they *are* foldable if the value is foldable. | ||
*/ | ||
return condition.value.foldable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is wrong, and it's inconsistent with partiallyFold
.
We special case for a MV - in which case we know that the value will never be evaluated, so we probably should continue
, as we'll fall down into the next condition.
I think this prevents situations where ConstantFolding
could already fully trivialize a Case
and bubble that further up in the expression; instead it'll take another run of the optimizer rules because we need PartiallyFoldCase
to do the job here.
Since the foldability is less than trivial for CASE
, I think we should add one or two cases to the CaseExtraTests
and check if Case.foldable()
returns correct results. Or, insideLogicalPlanOptimizerTests
, create a logical plan with a CASE
and only run ConstantFolding
, resp. only PartiallyFoldCase
, on it, rather than the whole optimizer; currently, PartiallyFoldCase
jumps to the rescue, even if ConstantFolding
can't do its job correctly, so we don't see this bug take effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the list be empty? if so, is that the same as false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think lists occur unless they have at least 2 elements.
I think the intention is to treat any multivalue the same as false, which I think is correct, but the implementation in these lines is a bug.
More interestingly, there could be lists with 1 element which is true
. I'd say that's a bug because that should simply be a boolean literal, but I'm not sure we never create such literals, either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty lists can't occur, not. If it's empty it's null
instead.
Object o = condition.condition.fold(); | ||
if (o instanceof List) { | ||
// multivalued field folds to null which folds to false | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should use a helper method (or two) on Condition
to tell us whether the condition is always true/false? That could be reused in Case.foldable
and would avoid the foldability logic to drift apart.
* We treat failures as null just like any other failure. | ||
* It's just that we then *immediately* convert it to | ||
* true or false using the tri-valued boolean logic stuff. | ||
* And that makes it into false. This is, *exactly* what | ||
* happens in PostgreSQL and MySQL and SQLite: | ||
* > SELECT CASE WHEN null THEN 1 ELSE 2 END; | ||
* 2 | ||
* Rather than go into depth about this in the warning message, | ||
* we just say "false". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @costin and @astefan : Nik is nicely making the MV semantics for CASE consistent.
Just FYI because we could also choose to define CASE([true, true], ... ) == null
similarly to functions, which applied to multivalues just become null
. I think Nik's approach is more useful, because making CASE
on multivalues always return null
would probably also imply that CASE(true, a, [true, true], b)
should return null
, even though it doesn't matter that the second condition is multivalued.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Nik's approach is more useful
It's also the same approach as PostgreSQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Case makes the equivalence of null == false rather it checks if the condition is true.
That is SELECT CASE WHEN null THEN 1 ELSE 2 END
returns 2 because the condition null
is not true.
It's a subtle difference but it means CASE(true, a, [true, true], b)
should return a because the condition is true
so the else branch is not even evaluated.
FunctionDefinition definition = definition(name); | ||
if (definition != null) { | ||
EsqlFunctionRegistry.FunctionDescription description = EsqlFunctionRegistry.description(definition); | ||
if (name.equals("case")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should we use a constant for case
here and in RailRoadDiagram.java
? Maybe also in the function registry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meh.
public void testPartialFoldMv() { | ||
Case c = new Case( | ||
Source.synthetic("case"), | ||
new Literal(Source.EMPTY, List.of(true, true), DataType.BOOLEAN), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Maybe let's randomize the booleans and the number of MVs?
- Maybe let's also consider cases with multiple multivalued conditions, not just one?
- Also maybe let's add cases where
partiallyFold
can fully fold, and returns the same asfold
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first two I'm doing now. The last one I'll have a think about. I think I'd like to make a method to test partiallFold inside CaseTests
so it can use the test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a bunch of cases.
suppliers.add(threeArgCase(null, false, type, List.of())); | ||
suppliers.add( | ||
threeArgCase( | ||
List.of(true, true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Let's randomize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit out of scope, but: shouldn't we also at least test the 4 and 5 parameter cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. We get those tests from the Extra
tests right now. Not great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be overlooking some implementation details however I would focus on defining the contract of Case as the condition having to be a single-value true; anything else (multi value, false, null, etc...) would cause the else branch to be evaluated which would repeat the above process.
There's a question on whether we should give a warning when checking for true - which we could add (better be safe than sorry) however to your point, this is not about null == false rather about the value not being true.
if (o instanceof List) { | ||
/* | ||
* multivalued fields fold to null which folds to false. | ||
* So they *are* foldable if the value is foldable. | ||
*/ | ||
return condition.value.foldable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the list be empty? if so, is that the same as false?
* We treat failures as null just like any other failure. | ||
* It's just that we then *immediately* convert it to | ||
* true or false using the tri-valued boolean logic stuff. | ||
* And that makes it into false. This is, *exactly* what | ||
* happens in PostgreSQL and MySQL and SQLite: | ||
* > SELECT CASE WHEN null THEN 1 ELSE 2 END; | ||
* 2 | ||
* Rather than go into depth about this in the warning message, | ||
* we just say "false". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Case makes the equivalence of null == false rather it checks if the condition is true.
That is SELECT CASE WHEN null THEN 1 ELSE 2 END
returns 2 because the condition null
is not true.
It's a subtle difference but it means CASE(true, a, [true, true], b)
should return a because the condition is true
so the else branch is not even evaluated.
.../src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/conditional/Case.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alex-spies wanted to have another look at this one with all the new tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Nik, don't wait for another review from me! I was just curious about the added testing that you showed me the other day, but you had my thumbs-up already before that :)
} | ||
Object o = condition.condition.fold(); | ||
if (o instanceof List) { | ||
if (Boolean.TRUE.equals(condition.condition.fold())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
if (Boolean.TRUE.equals(condition.condition.fold())) { | ||
/* | ||
* multivalued fields fold to null which folds to false. | ||
* So they *are* foldable if the value is foldable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment here means the converse.
* So they *are* foldable if the value is foldable. | |
* So they *are* foldable if remaining list of conditions is foldable. |
Same in line 271.
When CASE hits a multivalued field it was previously either crashing on fold or evaluating it to the first value. Since booleans are loaded in sorted order from lucene that usually means
false
. This changes the behavior to line up with the rest of ESQL - now multivalued fields are treated asfalse
with a warning.You might say "hey wait! multivalued fields usually become
null
, notfalse
!". Yes, dear reader, you are right. Very right. But!CASE
's contract is to immediatly convert its values intotrue
orfalse
using the standard boolean tri-valued logic. Sonull
just becomefalse
immediately. This is how PostgreSQL, MySQL, and SQLite behave:They turn that
null
into a false. And we're right there with them. Except, of course, that we're turning[false, false]
and the like intonull
first. See!? It's consitent. Consistently confusing, but sane at least.The warning message just says "treating multivalued field as false" rather than explaining all of that.
This also fixes up a few of CASE's docs which I noticed were kind of busted while working on CASE. I think the docs generation is having a lot of trouble with CASE so I've manually hacked the right thing into place, but we should figure out a better solution eventually.
Closes #112359