-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Fix categorize csv test #113089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix categorize csv test #113089
Changes from 4 commits
64454dd
c089300
7f515dd
bf22956
cb64f98
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,15 @@ | ||
categorize | ||
required_capability: categorize | ||
|
||
# Drop the category ID, because it's non-deterministic | ||
FROM sample_data | ||
| STATS count=COUNT(), values=VALUES(message) BY category=CATEGORIZE(message) | ||
| STATS count=COUNT(), values=MV_SORT(VALUES(message)) BY category=CATEGORIZE(message) | ||
| SORT count DESC, category ASC | ||
| DROP category | ||
; | ||
|
||
count:long | values:keyword | category:integer | ||
3 | [Connected to 10.1.0.1, Connected to 10.1.0.2, Connected to 10.1.0.3] | 0 | ||
3 | [Connection error] | 1 | ||
1 | [Disconnected] | 2 | ||
count:long | values:keyword | ||
3 | [Connected to 10.1.0.1, Connected to 10.1.0.2, Connected to 10.1.0.3] | ||
3 | [Connection error] | ||
1 | [Disconnected] | ||
; |
Original file line number | Diff line number | Diff line change | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -304,7 +304,12 @@ public enum Cap { | |||||||||||||||
/** | ||||||||||||||||
* Support explicit casting from string literal to DATE_PERIOD or TIME_DURATION. | ||||||||||||||||
*/ | ||||||||||||||||
CAST_STRING_LITERAL_TO_TEMPORAL_AMOUNT; | ||||||||||||||||
CAST_STRING_LITERAL_TO_TEMPORAL_AMOUNT, | ||||||||||||||||
|
||||||||||||||||
/** | ||||||||||||||||
* Supported the text categorization function "CATEGORIZE". | ||||||||||||||||
*/ | ||||||||||||||||
CATEGORIZE(true); | ||||||||||||||||
|
||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: does this piece of code work as intended? If you add a non-existing required capability to a CSV test, it always passes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's a capability presence check here: elasticsearch/x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/CsvTests.java Lines 255 to 261 in bb872e6
It was moved to snapshot-only for reasons I don't remember. It probably shouldn't. I'm not sure if the CI runs a snapshot or not actually (I guess not); it feels weird that it didn't fail in any part of the process Edit: Thinking about this again, the features are also automatically included as capabilities, so in snapshot mode, it maybe worked? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the CI runs snapshot, and therefore it succeeded There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's a tag you can add to make the tests run in release mode if you want. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've run the test locally with and without snapshot and both work, so I think we're good. |
||||||||||||||||
private final boolean snapshotOnly; | ||||||||||||||||
private final FeatureFlag featureFlag; | ||||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -179,14 +179,9 @@ public class EsqlFeatures implements FeatureSpecification { | |
*/ | ||
public static final NodeFeature RESOLVE_FIELDS_API = new NodeFeature("esql.resolve_fields_api"); | ||
|
||
/** | ||
* Support categorize | ||
*/ | ||
public static final NodeFeature CATEGORIZE = new NodeFeature("esql.categorize"); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh yikes. I hadn't noticed the feature vs capability. You can't really remove these once they've left snapshot-land. But this one stayed in snapshot-land forever so I think it's safe to remove it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And I missed the pretty explicit:
|
||
|
||
private Set<NodeFeature> snapshotBuildFeatures() { | ||
assert Build.current().isSnapshot() : Build.current(); | ||
return Set.of(METRICS_SYNTAX, CATEGORIZE); | ||
return Set.of(METRICS_SYNTAX); | ||
} | ||
|
||
@Override | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not deterministic because of the order of the rows? What about sorting the data before it enters the STATS?
Dumping the tested function output is weird 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've re-added the
category
column and aSORT message
inbetween. Thanks for the suggestion.For my understanding: does the additional
SORT
step mean that everyCATEGORIZE
is happening on the coordinating node? I think it must be, otherwise this isn't working correctly, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Your instincts for where we split are almost certainly right. We'll just run the agg after the sort. It's not ideal testing, but it's fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, the CSV tests are executed in different ways. The
CsvTests
class, I think it executes like an unit test, like a single node. There are also tests like https://github.com/elastic/elasticsearch/blob/c1daf18bf5f5178d43dc17d3a3d1f5db9773098e/x-pack/plugin/esql/qa/server/multi-clusters/src/javaRestTest/java/org/elasticsearch/xpack/esql/ccq/MultiClusterSpecIT.java, which use a full cluster.But now thinking about the categorize function and how it works, I don't really know how is the sort executed. And specially as this is the first grouping function depending on order, maybe it makes sense in fact to not check the category column...
Sorry, I think the original solution was correct 🙏
Let's see if @nik9000 can confirm this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This failing test suggests that data does not arrive at the categorizer in sorted order.
@nik9000 do you understand why?
I'll leave the test muted for now. Once we replace the (arbitrary) IDs by regexes, the output will be deterministic.