Skip to content

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Aug 30, 2024

When CASE hits a multivalued field it was previously either crashing on fold or evaluating it to the first value. Since booleans are loaded in sorted order from lucene that usually means false. This changes the behavior to line up with the rest of ESQL - now multivalued fields are treated as false with a warning.

You might say "hey wait! multivalued fields usually become null, not false!". Yes, dear reader, you are right. Very right. But! CASE's contract is to immediatly convert its values into true or false using the standard boolean tri-valued logic. So null just become false immediately. This is how PostgreSQL, MySQL, and SQLite behave:

> SELECT CASE WHEN null THEN 1 ELSE 2 END;
2

They turn that null into a false. And we're right there with them. Except, of course, that we're turning [false, false] and the like into null first. See!? It's consitent. Consistently confusing, but sane at least.

The warning message just says "treating multivalued field as false" rather than explaining all of that.

This also fixes up a few of CASE's docs which I noticed were kind of busted while working on CASE. I think the docs generation is having a lot of trouble with CASE so I've manually hacked the right thing into place, but we should figure out a better solution eventually.

Closes #112359

When CASE hits a multivalued field it was previously either crashing on
fold or evaluating it to the first value. Since booleans are loaded in
sorted order from lucene that *usually* means `false`. This changes the
behavior to line up with the rest of ESQL - now multivalued fields are
treated as `false` with a warning.

You might say "hey wait! multivalued fields usually become `null`, not
`false`!". Yes, dear reader, you are right. Very right. But! `CASE`'s
contract is to immediatly convert its values into `true` or `false`
using the standard boolean tri-valued logic. So `null` just become
`false` immediately. This is how PostgreSQL, MySQL, and SQLite behave:
```
> SELECT CASE WHEN null THEN 1 ELSE 2 END;
2
```

They turn that `null` into a false. And we're right there with them.
Except, of course, that we're turning `[false, false]` and the like into
`null` first. See!? It's consitent. Consistently confusing, but sane at
least.

The warning message just says "treating multivalued field as false"
rather than explaining all of that.

This also fixes up a few of CASE's docs which I noticed were kind of
busted while working on CASE. I think the docs generation is having a
lot of trouble with CASE so I've manually hacked the right thing into
place, but we should figure out a better solution eventually.

Closes elastic#112359
Copy link
Contributor

Documentation preview:

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @nik9000, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Aug 30, 2024
* Create a new warnings object based on the given mode
* @param warningsMode The warnings collection strategy to use
* @param source used to indicate where in the query the warning occured
* @param source used to indicate where in the query the warning occurred
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've given us a little more flexibility, specifically so I can tell folks that multivalues in CASE default to false. If it's confusing why we do that read the PR description. Er, well, it's still confusing a bit. But that tries to explain it.

+ ", elseVal="
+ elseValueFactory
+ ']';
return "CaseEvaluator[conditions=" + conditionsFactories + ", elseVal=" + elseValueFactory + ']';
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just making testing harder and doesn't really provide any debug value.

new IllegalArgumentException("CASE expects a single-valued boolean")
);
continue;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the core of the fix! It's fun how it's like 5 lines of fix, 200 testing, 200 ceremony to plumb the fix properly, and another 200 of docs updates.


List<Set<String>> typesFromSignature = new ArrayList<>();
Set<String> returnFromSignature = new HashSet<>();
Set<String> returnFromSignature = new TreeSet<>();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes comparing the errors easier because they are in alphabetical order....

// CASE is so weird let's just hack this together manually
Sequence seq = new Sequence(new Literal("condition"), new Syntax(","), new Literal("trueValue"));
expressions.add(new Repetition(seq, 1, null));
expressions.add(new Repetition(new Literal("elseValue"), 0, 1));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CASE is the only function that has this funny signature. It's actually pretty helpful to see the railroad diagram for CASE too!

DataType.GEO_POINT,
DataType.CARTESIAN_SHAPE,
DataType.GEO_SHAPE
)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously these were all written by hand, but we know a lot more about how to write these now.

@nik9000 nik9000 requested review from ivancea and alex-spies August 30, 2024 16:46
Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a second go at this tomorrow; I agree with the proposed semantics of CASE when presented with multi-valued booleans.

@luigidellaquila: I think CASE is something where users would profit from keywords like ANY and ALL. I think it's reasonable to have a multi-valued field and wanting to express something like CASE (ANY multivalued_field, 1, 0) or CASE(ALL multivalued_field, 1, 0); without the warning that this PR adds, it could look like CASE already does it, so it's good we are being explicit about it in the warning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tests!

  • I'd add one with ROW foo = CASE([true, true], ... just to make it obvious that CASE doesn't have implicit ANY or ALL logic.
  • Should we add ROW a = [true, false] | EVAL c = CASE(MV_SLICE(a, 0, 2), "foo", "bar") as well? This one failed differently from others, so I guess it exercised a different code path? A test case with an mv expression inside CASE is probably a good idea, anyway. Or, similarly but more realistically, using a conversion function like ROW a = ["true", "false"] | EVAL c = CASE(a::boolean, "foo", "bar")
  • If we want to be supremely paranoid, we could also check the interaction with multivalued union types, e.g. FROM idx* | EVAL x = CASE(field::boolean, 1, 2); I'd only do that if we expect weird interactions with block loading, though, which probably is not the case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally if you ask "should we add this csv-spec test?" the answer is just yes. It's cheap.

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nik, this is great. I think there's a bug in Case.foldable which we should fix before merging, otherwise my remarks are rather minor (suggesting more tests, as is becoming tradition :D ).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this up to date!

}
}
throw new IllegalStateException("Unreachable");
return createWarnings(warningsMode, source, "treating result as null");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: to avoid drift, we could put treating result as null into a string constant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 228 to 233
if (o instanceof List) {
/*
* multivalued fields fold to null which folds to false.
* So they *are* foldable if the value is foldable.
*/
return condition.value.foldable();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong, and it's inconsistent with partiallyFold.

We special case for a MV - in which case we know that the value will never be evaluated, so we probably should continue, as we'll fall down into the next condition.

I think this prevents situations where ConstantFolding could already fully trivialize a Case and bubble that further up in the expression; instead it'll take another run of the optimizer rules because we need PartiallyFoldCase to do the job here.

Since the foldability is less than trivial for CASE, I think we should add one or two cases to the CaseExtraTests and check if Case.foldable() returns correct results. Or, insideLogicalPlanOptimizerTests , create a logical plan with a CASE and only run ConstantFolding, resp. only PartiallyFoldCase, on it, rather than the whole optimizer; currently, PartiallyFoldCase jumps to the rescue, even if ConstantFolding can't do its job correctly, so we don't see this bug take effect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the list be empty? if so, is that the same as false?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think lists occur unless they have at least 2 elements.

I think the intention is to treat any multivalue the same as false, which I think is correct, but the implementation in these lines is a bug.

More interestingly, there could be lists with 1 element which is true. I'd say that's a bug because that should simply be a boolean literal, but I'm not sure we never create such literals, either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty lists can't occur, not. If it's empty it's null instead.

Comment on lines 270 to 274
Object o = condition.condition.fold();
if (o instanceof List) {
// multivalued field folds to null which folds to false
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use a helper method (or two) on Condition to tell us whether the condition is always true/false? That could be reused in Case.foldable and would avoid the foldability logic to drift apart.

Comment on lines +327 to +335
* We treat failures as null just like any other failure.
* It's just that we then *immediately* convert it to
* true or false using the tri-valued boolean logic stuff.
* And that makes it into false. This is, *exactly* what
* happens in PostgreSQL and MySQL and SQLite:
* > SELECT CASE WHEN null THEN 1 ELSE 2 END;
* 2
* Rather than go into depth about this in the warning message,
* we just say "false".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @costin and @astefan : Nik is nicely making the MV semantics for CASE consistent.

Just FYI because we could also choose to define CASE([true, true], ... ) == null similarly to functions, which applied to multivalues just become null. I think Nik's approach is more useful, because making CASE on multivalues always return null would probably also imply that CASE(true, a, [true, true], b) should return null, even though it doesn't matter that the second condition is multivalued.

Copy link
Member Author

@nik9000 nik9000 Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Nik's approach is more useful

It's also the same approach as PostgreSQL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Case makes the equivalence of null == false rather it checks if the condition is true.
That is SELECT CASE WHEN null THEN 1 ELSE 2 END returns 2 because the condition null is not true.
It's a subtle difference but it means CASE(true, a, [true, true], b) should return a because the condition is true so the else branch is not even evaluated.

FunctionDefinition definition = definition(name);
if (definition != null) {
EsqlFunctionRegistry.FunctionDescription description = EsqlFunctionRegistry.description(definition);
if (name.equals("case")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should we use a constant for case here and in RailRoadDiagram.java? Maybe also in the function registry?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh.

public void testPartialFoldMv() {
Case c = new Case(
Source.synthetic("case"),
new Literal(Source.EMPTY, List.of(true, true), DataType.BOOLEAN),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Maybe let's randomize the booleans and the number of MVs?
  • Maybe let's also consider cases with multiple multivalued conditions, not just one?
  • Also maybe let's add cases where partiallyFold can fully fold, and returns the same as fold?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first two I'm doing now. The last one I'll have a think about. I think I'd like to make a method to test partiallFold inside CaseTests so it can use the test cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a bunch of cases.

suppliers.add(threeArgCase(null, false, type, List.of()));
suppliers.add(
threeArgCase(
List.of(true, true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's randomize?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit out of scope, but: shouldn't we also at least test the 4 and 5 parameter cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. We get those tests from the Extra tests right now. Not great.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be overlooking some implementation details however I would focus on defining the contract of Case as the condition having to be a single-value true; anything else (multi value, false, null, etc...) would cause the else branch to be evaluated which would repeat the above process.
There's a question on whether we should give a warning when checking for true - which we could add (better be safe than sorry) however to your point, this is not about null == false rather about the value not being true.

Comment on lines 228 to 233
if (o instanceof List) {
/*
* multivalued fields fold to null which folds to false.
* So they *are* foldable if the value is foldable.
*/
return condition.value.foldable();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the list be empty? if so, is that the same as false?

Comment on lines +327 to +335
* We treat failures as null just like any other failure.
* It's just that we then *immediately* convert it to
* true or false using the tri-valued boolean logic stuff.
* And that makes it into false. This is, *exactly* what
* happens in PostgreSQL and MySQL and SQLite:
* > SELECT CASE WHEN null THEN 1 ELSE 2 END;
* 2
* Rather than go into depth about this in the warning message,
* we just say "false".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Case makes the equivalence of null == false rather it checks if the condition is true.
That is SELECT CASE WHEN null THEN 1 ELSE 2 END returns 2 because the condition null is not true.
It's a subtle difference but it means CASE(true, a, [true, true], b) should return a because the condition is true so the else branch is not even evaluated.

@nik9000 nik9000 mentioned this pull request Sep 5, 2024
86 tasks
Copy link
Member Author

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alex-spies wanted to have another look at this one with all the new tests.

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Nik, don't wait for another review from me! I was just curious about the added testing that you showed me the other day, but you had my thumbs-up already before that :)

}
Object o = condition.condition.fold();
if (o instanceof List) {
if (Boolean.TRUE.equals(condition.condition.fold())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

if (Boolean.TRUE.equals(condition.condition.fold())) {
/*
* multivalued fields fold to null which folds to false.
* So they *are* foldable if the value is foldable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment here means the converse.

Suggested change
* So they *are* foldable if the value is foldable.
* So they *are* foldable if remaining list of conditions is foldable.

Same in line 271.

@nik9000 nik9000 added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 6, 2024
@nik9000 nik9000 requested a review from a team as a code owner September 9, 2024 15:30
@elasticsearchmachine elasticsearchmachine merged commit ef3a5a1 into elastic:main Sep 9, 2024
14 checks passed
@nik9000 nik9000 deleted the esql_fixup_case branch September 9, 2024 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: CASE bugs with multivalued conditions
4 participants