Skip to content

Conversation

@jan-elastic
Copy link
Contributor

No description provided.

@jan-elastic jan-elastic added >non-issue :ml Machine learning v9.0.0 v8.18.0 Team:ML Meta label for the ML team >feature and removed >non-issue labels Dec 6, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @jan-elastic, I've created a changelog YAML for you.

@jan-elastic jan-elastic marked this pull request as draft December 6, 2024 16:02
@jan-elastic jan-elastic requested a review from nik9000 December 6, 2024 16:03
@alex-spies alex-spies self-requested a review December 6, 2024 16:04
@jan-elastic jan-elastic force-pushed the esql-categorize-multiple-groupings branch 2 times, most recently from 099e143 to 7409695 Compare December 9, 2024 09:05
@jan-elastic jan-elastic force-pushed the esql-categorize-multiple-groupings branch from 7409695 to 35e9811 Compare December 9, 2024 14:22
@jan-elastic jan-elastic marked this pull request as ready for review December 9, 2024 14:32
@elasticsearchmachine
Copy link
Collaborator

Hi @jan-elastic, I've created a changelog YAML for you.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

@jan-elastic jan-elastic requested a review from nik9000 December 9, 2024 16:17
@ivancea ivancea self-requested a review December 10, 2024 11:52
@jan-elastic jan-elastic added the auto-backport Automatically create backport pull requests when merged label Dec 10, 2024
Copy link
Contributor

@ivancea ivancea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 👀
Some nits and new tests in comments; the impl looks quite solid to me

if (id == 0) {
builder.appendNull();
} else {
builder.appendBytesRef(regexes.getBytesRef(id + idsOffset, scratch));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for now: We're repeating, potentially, a lot of bytesref values here. I wonder if there is or it would make sense to have a BytesRefBlock that instead of all the bytesrefs, stores every value just once, and then a reference per position:

AAAAAA
BBBBBBB
AAAAAA
AAAAAA

->

// 1: AAAAAA
// 2: BBBBBBB
1
2
1
1

@nik9000 Something to consider for later? Maybe it's too specific for this. And anyway, the next EVAL or whatever will duplicate the value again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a nice thing to have, but definitely out of scope for this PR.

However, the next EVAL should not duplicate the value again.

If you have:

// 1: AAAAAA
// 2: BBBBBBB
1
2
1
1

then an efficient EVAL x=SUBSTRING(x, 1, 2) should give

// 1: AA
// 2: BB
1
2
1
1

without ever duplicating.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that SUBSTRING to not duplicate, we would need to add that "hashtable" strategy in the BytesRefBlockBuilder. It looks goo (?), but I wonder if using that by default could perform negatively in some scenarios. Something to try eventually probably

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like worth trying in the future. Are you making a note (issue) of this, so that the idea doesn't get lost?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I'll comment it with Nik, just in case it was considered and discarded already, and then I'll document it in an issue somewhere

Copy link
Contributor

@ivancea ivancea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heya, this generally looks very good, thank you @jan-elastic !

I only have one observation about potential untracked memory that I think we should ponder at least a bit to ensure we're safe; see below.

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait with merging until we had a chance to look at this subtle point.

@jan-elastic
Copy link
Contributor Author

jan-elastic commented Dec 11, 2024

@alex-spies Thanks for the review! Processed all your comments, except the OOM one, which needs more discussion. PTAL

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jan-elastic !

Let's take care of the memory accounting in a follow-up.

Maybe we want to add a couple more block hash test cases, but otherwise this LGTM and should be unblocked :)

@jan-elastic jan-elastic merged commit d270158 into main Dec 12, 2024
17 checks passed
@jan-elastic jan-elastic deleted the esql-categorize-multiple-groupings branch December 12, 2024 15:50
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 118173

jan-elastic added a commit that referenced this pull request Dec 12, 2024
* ES|QL categorize with multiple groupings.

* Fix VerifierTests

* Close stuff when constructing CategorizePackedValuesBlockHash fails

* CategorizePackedValuesBlockHashTests

* Improve categorize javadocs

* Update docs/changelog/118173.yaml

* Create CategorizePackedValuesBlockHash's deletegate page differently

* Double check in BlockHash builder for single categorize

* Reuse blocks array

* More CSV tests

* Remove assumeTrue categorize_v5

* Rename test

* Two more verifier tests

* more CSV tests

* Add JavaDocs/comments

* spotless

* Refactor/unify recategorize

* Better memory accounting

* fix csv test

* randomize CategorizePackedValuesBlockHashTests

* Add TODO
elasticsearchmachine pushed a commit that referenced this pull request Dec 13, 2024
* ES|QL categorize with multiple groupings.

* Fix VerifierTests

* Close stuff when constructing CategorizePackedValuesBlockHash fails

* CategorizePackedValuesBlockHashTests

* Improve categorize javadocs

* Update docs/changelog/118173.yaml

* Create CategorizePackedValuesBlockHash's deletegate page differently

* Double check in BlockHash builder for single categorize

* Reuse blocks array

* More CSV tests

* Remove assumeTrue categorize_v5

* Rename test

* Two more verifier tests

* more CSV tests

* Add JavaDocs/comments

* spotless

* Refactor/unify recategorize

* Better memory accounting

* fix csv test

* randomize CategorizePackedValuesBlockHashTests

* Add TODO

Co-authored-by: Elastic Machine <[email protected]>
maxhniebergall pushed a commit to maxhniebergall/elasticsearch that referenced this pull request Dec 16, 2024
…8590)

* ES|QL categorize with multiple groupings.

* Fix VerifierTests

* Close stuff when constructing CategorizePackedValuesBlockHash fails

* CategorizePackedValuesBlockHashTests

* Improve categorize javadocs

* Update docs/changelog/118173.yaml

* Create CategorizePackedValuesBlockHash's deletegate page differently

* Double check in BlockHash builder for single categorize

* Reuse blocks array

* More CSV tests

* Remove assumeTrue categorize_v5

* Rename test

* Two more verifier tests

* more CSV tests

* Add JavaDocs/comments

* spotless

* Refactor/unify recategorize

* Better memory accounting

* fix csv test

* randomize CategorizePackedValuesBlockHashTests

* Add TODO

Co-authored-by: Elastic Machine <[email protected]>
maxhniebergall pushed a commit to maxhniebergall/elasticsearch that referenced this pull request Dec 16, 2024
…8590)

* ES|QL categorize with multiple groupings.

* Fix VerifierTests

* Close stuff when constructing CategorizePackedValuesBlockHash fails

* CategorizePackedValuesBlockHashTests

* Improve categorize javadocs

* Update docs/changelog/118173.yaml

* Create CategorizePackedValuesBlockHash's deletegate page differently

* Double check in BlockHash builder for single categorize

* Reuse blocks array

* More CSV tests

* Remove assumeTrue categorize_v5

* Rename test

* Two more verifier tests

* more CSV tests

* Add JavaDocs/comments

* spotless

* Refactor/unify recategorize

* Better memory accounting

* fix csv test

* randomize CategorizePackedValuesBlockHashTests

* Add TODO

Co-authored-by: Elastic Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged backport pending >feature :ml Machine learning Team:ML Meta label for the ML team v8.18.0 v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants