Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 0 additions & 6 deletions muted-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -221,12 +221,6 @@ tests:
- class: org.elasticsearch.xpack.ml.integration.MlJobIT
method: testCreateJob_WithClashingFieldMappingsFails
issue: https://github.com/elastic/elasticsearch/issues/113046
- class: org.elasticsearch.xpack.esql.qa.multi_node.EsqlSpecIT
method: test {categorize.Categorize SYNC}
issue: https://github.com/elastic/elasticsearch/issues/113054
- class: org.elasticsearch.xpack.esql.qa.multi_node.EsqlSpecIT
method: test {categorize.Categorize ASYNC}
issue: https://github.com/elastic/elasticsearch/issues/113055
- class: org.elasticsearch.xpack.sql.qa.security.JdbcSqlSpecIT
method: test {case-functions.testUcaseInline1}
issue: https://github.com/elastic/elasticsearch/issues/112641
Expand Down
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
categorize
required_capability: categorize

# Drop the category ID, because it's non-deterministic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not deterministic because of the order of the rows? What about sorting the data before it enters the STATS?
Dumping the tested function output is weird 👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've re-added the category column and a SORT message inbetween. Thanks for the suggestion.

For my understanding: does the additional SORT step mean that every CATEGORIZE is happening on the coordinating node? I think it must be, otherwise this isn't working correctly, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding: does the additional SORT step mean that every CATEGORIZE is happening on the coordinating node? I think it must be, otherwise this isn't working correctly, right?

Yes. Your instincts for where we split are almost certainly right. We'll just run the agg after the sort. It's not ideal testing, but it's fine for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, the CSV tests are executed in different ways. The CsvTests class, I think it executes like an unit test, like a single node. There are also tests like https://github.com/elastic/elasticsearch/blob/c1daf18bf5f5178d43dc17d3a3d1f5db9773098e/x-pack/plugin/esql/qa/server/multi-clusters/src/javaRestTest/java/org/elasticsearch/xpack/esql/ccq/MultiClusterSpecIT.java, which use a full cluster.
But now thinking about the categorize function and how it works, I don't really know how is the sort executed. And specially as this is the first grouping function depending on order, maybe it makes sense in fact to not check the category column...

Sorry, I think the original solution was correct 🙏
Let's see if @nik9000 can confirm this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This failing test suggests that data does not arrive at the categorizer in sorted order.

@nik9000 do you understand why?

I'll leave the test muted for now. Once we replace the (arbitrary) IDs by regexes, the output will be deterministic.

FROM sample_data
| STATS count=COUNT(), values=VALUES(message) BY category=CATEGORIZE(message)
| STATS count=COUNT(), values=MV_SORT(VALUES(message)) BY category=CATEGORIZE(message)
| SORT count DESC, category ASC
| DROP category
;

count:long | values:keyword | category:integer
3 | [Connected to 10.1.0.1, Connected to 10.1.0.2, Connected to 10.1.0.3] | 0
3 | [Connection error] | 1
1 | [Disconnected] | 2
count:long | values:keyword
3 | [Connected to 10.1.0.1, Connected to 10.1.0.2, Connected to 10.1.0.3]
3 | [Connection error]
1 | [Disconnected]
;
Original file line number Diff line number Diff line change
Expand Up @@ -304,7 +304,12 @@ public enum Cap {
/**
* Support explicit casting from string literal to DATE_PERIOD or TIME_DURATION.
*/
CAST_STRING_LITERAL_TO_TEMPORAL_AMOUNT;
CAST_STRING_LITERAL_TO_TEMPORAL_AMOUNT,

/**
* Supported the text categorization function "CATEGORIZE".
*/
CATEGORIZE(true);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: does this piece of code work as intended?

If you add a non-existing required capability to a CSV test, it always passes.

Copy link
Contributor

@ivancea ivancea Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a capability presence check here:

if (Build.current().isSnapshot()) {
assertThat(
"Capability is not included in the enabled list capabilities on a snapshot build. Spelling mistake?",
testCase.requiredCapabilities,
everyItem(in(EsqlCapabilities.CAPABILITIES))
);
} else {

It was moved to snapshot-only for reasons I don't remember. It probably shouldn't.
I'm not sure if the CI runs a snapshot or not actually (I guess not); it feels weird that it didn't fail in any part of the process

Edit: Thinking about this again, the features are also automatically included as capabilities, so in snapshot mode, it maybe worked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the CI runs snapshot, and therefore it succeeded

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a tag you can add to make the tests run in release mode if you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run the test locally with and without snapshot and both work, so I think we're good.

private final boolean snapshotOnly;
private final FeatureFlag featureFlag;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -179,14 +179,9 @@ public class EsqlFeatures implements FeatureSpecification {
*/
public static final NodeFeature RESOLVE_FIELDS_API = new NodeFeature("esql.resolve_fields_api");

/**
* Support categorize
*/
public static final NodeFeature CATEGORIZE = new NodeFeature("esql.categorize");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yikes. I hadn't noticed the feature vs capability. You can't really remove these once they've left snapshot-land. But this one stayed in snapshot-land forever so I think it's safe to remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I missed the pretty explicit:

Don't make more of those - add them to {@link EsqlCapabilities} instead.


private Set<NodeFeature> snapshotBuildFeatures() {
assert Build.current().isSnapshot() : Build.current();
return Set.of(METRICS_SYNTAX, CATEGORIZE);
return Set.of(METRICS_SYNTAX);
}

@Override
Expand Down
Loading