Skip to content

Conversation

@jan-elastic
Copy link
Contributor

No description provided.

@jan-elastic jan-elastic added >non-issue :ml Machine learning Team:ML Meta label for the ML team v8.18.0 labels Nov 28, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

COUNT():long | category:keyword
1 | .*?\.\*\?Connected\.\+\?to\.\*\?.*?
1 | .*?\.\*\?Connection\.\+\?error\.\*\?.*?
1 | .*?\.\*\?Disconnected\.\*\?.*?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The categorization analyzer is more aggressive w.r.t. specials chars, hence these changes.

;

COUNT():long | category:keyword
2 | .*?DB.+?servers.*?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DB is parsed as a hex number and dropped. Obviously suboptimal, but unrelated to this PR.

@jan-elastic jan-elastic force-pushed the esql-categorize-analyzer branch 2 times, most recently from ef76db0 to 7f61b4e Compare November 28, 2024 15:22
Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jan-elastic ! I have mostly minor comments; the only comment I care a bit more about is whether we need to pass the analysis registry as far down as we currently do, but even so this could be merged as-is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: To show that the analysis registry is used correctly, could we maybe add one more test that uses a different analysis registry that leads to a different categorization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to write such test. How do I construct a different analysis registry, that contains different versions of the necessary filters/tokenizers. Sounds like quite some effort with low reward.

Comment on lines -291 to +308
1 | .*?😊👍🏽.+?detcennocsiD.*?
3 | .*?😊👍🏽.+?ot.+?detcennoC.*?
3 | .*?😊👍🏽.+?rorre.+?noitcennoC.*?
1 | .*?detcennocsiD.*?
3 | .*?ot.+?detcennoC.*?
3 | .*?rorre.+?noitcennoC.*?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivancea , this change will solve the recent failing tests, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, pretty sure it does

enrichLookupService,
lookupFromIndexService,
new EsPhysicalOperationProviders(contexts)
new EsPhysicalOperationProviders(contexts, searchService.getIndicesService().getAnalysis())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could actually already build the categorizer here, to avoid unnecessary dependencies on the analysis registry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said in a different comment: in the future the analyzer should be configurable, and to achieve I think that it should be wired like this. (I agree that for now we could have created the fixed analyzer here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not so sure anymore, see #117695 (comment)

@jan-elastic jan-elastic force-pushed the esql-categorize-analyzer branch from 7f61b4e to 8ebf21e Compare November 29, 2024 09:00
@jan-elastic jan-elastic merged commit d729558 into main Nov 29, 2024
17 checks passed
@jan-elastic jan-elastic deleted the esql-categorize-analyzer branch November 29, 2024 10:00
jan-elastic added a commit that referenced this pull request Nov 29, 2024
* Correct categorization analyzer in ES|QL categorize

* close categorizer if constructing analyzer fails

* Rename capability CATEGORIZE_V4

* add comments
elasticsearchmachine pushed a commit that referenced this pull request Nov 29, 2024
* Correct categorization analyzer in ES|QL categorize

* close categorizer if constructing analyzer fails

* Rename capability CATEGORIZE_V4

* add comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:ml Machine learning >non-issue Team:ML Meta label for the ML team v8.18.0 v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants