Skip to content

Conversation

@kderusso
Copy link
Member

@kderusso kderusso commented Sep 8, 2025

Adds a new function, CHUNK that takes text from a field and returns chunks based on the requested chunking strategy.

For this PR, we're inputting a size which will correspond to the default number of words in a sentence based chunking strategy. Future planned PRs will include the support for explicit chunking settings or an inference ID on top of these defaults. Future optimizations could also include supporting a max chunk size of LIMIT and optimizations to semantic text fields.

Examples of how to call this function:

FROM wikipedia
 | WHERE MATCH(content, \"churchill\") 
 | EVAL chunks = chunk(content, {"num_chunks": 3, "chunk_size": 20}) 
 | MV_EXPAND chunks
 | KEEP chunks
 | LIMIT 10

FROM wikipedia
 | WHERE MATCH(content, \"churchill\") 
 | EVAL chunks = chunk(content) 
 | MV_EXPAND chunks
 | KEEP chunks
 | LIMIT 10

@kderusso kderusso changed the title Kderusso/esql chunk function [ES|QL] Add CHUNK function Sep 8, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Oct 22, 2025

🔍 Preview links for changed docs

@github-actions
Copy link
Contributor

ℹ️ Important: Docs version tagging

👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version.

We use applies_to tags to mark version-specific features and changes.

Expand for a quick overview

When to use applies_to tags:

✅ At the page level to indicate which products/deployments the content applies to (mandatory)
✅ When features change state (e.g. preview, ga) in a specific version
✅ When availability differs across deployments and environments

What NOT to do:

❌ Don't remove or replace information that applies to an older version
❌ Don't add new information that applies to a specific version without an applies_to tag
❌ Don't forget that applies_to tags can be used at the page, section, and inline level

🤔 Need help?

@kderusso kderusso added >enhancement :Search Relevance/ES|QL Search functionality in ES|QL labels Oct 22, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @kderusso, I've created a changelog YAML for you.

@kderusso kderusso marked this pull request as ready for review October 22, 2025 17:33
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Oct 22, 2025
@kderusso kderusso requested a review from a team October 22, 2025 17:34
Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! 💯

A couple of minor issues on validation and testing.

I think it would be worth adding a VerifierTests to ensure we catch nulls on the numeric params. I believe CSV tests are getting all other testing I can think of 👍

testCase.requiredCapabilities.contains(EsqlCapabilities.Cap.MULTI_MATCH_FUNCTION.capabilityName())
);
assumeFalse(
"CSV tests cannot currently handle CHUNK function",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Are we using something specific from Lucene on the chunker that will make this not to work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can run the CSV tests via the integration spec test, but CsvTest returns null for all chunked input. Perhaps this is due to the fact that we're using the chunker? I haven't been able to debug the actual CSV test implementation.

@kderusso kderusso requested a review from carlosdelest October 23, 2025 17:51
@kderusso kderusso requested a review from ioanatia October 27, 2025 13:57
Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Minor comments, I think we can simplify the options handling on evaluation.

);
assumeFalse(
"CSV tests cannot currently handle CHUNK function",
testCase.requiredCapabilities.contains(EsqlCapabilities.Cap.CHUNK_FUNCTION.capabilityName())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused about why this doesn't work 🤔 . We can do that in a follow up, and we have the EsqlSpecIT fields - but it would be nice to be able to run CSV tests on this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a good chance that we won't do it as a followup may entail also performing a Lucene query - TBD.

@kderusso kderusso enabled auto-merge (squash) October 29, 2025 20:23
@kderusso kderusso merged commit 03bc16c into elastic:main Oct 29, 2025
34 checks passed
shmuelhanoch pushed a commit to shmuelhanoch/elasticsearch that referenced this pull request Oct 29, 2025
* Add new function to chunk strings

* Refactor CHUNK function to support multiple values

* Default to returning all chunks

* [CI] Auto commit changes from spotless

* Handle warnings

* Loosen export restrictions to try to get compile error working

* Remove inference dependencies

* Fix compilation errors

* Remove more inference deps

* Fix compile errors from merge

* Fix existing tests

* Exclude from CSV tests

* Add more tests

* Cleanup

* [CI] Auto commit changes from spotless

* Cleanup

* Update docs/changelog/134320.yaml

* PR feedback

* Remove null field constraint

* [CI] Auto commit changes from spotless

* PR feedback: Refactor to use an options map

* Cleanup

* Regenerate docs

* Add test on a concatenated field

* Add multivalued field test

* Don't hardcode strings

* [CI] Auto commit changes from spotless

* PR feedback

---------

Co-authored-by: elasticsearchmachine <[email protected]>
chrisparrinello pushed a commit to chrisparrinello/elasticsearch that referenced this pull request Nov 3, 2025
* Add new function to chunk strings

* Refactor CHUNK function to support multiple values

* Default to returning all chunks

* [CI] Auto commit changes from spotless

* Handle warnings

* Loosen export restrictions to try to get compile error working

* Remove inference dependencies

* Fix compilation errors

* Remove more inference deps

* Fix compile errors from merge

* Fix existing tests

* Exclude from CSV tests

* Add more tests

* Cleanup

* [CI] Auto commit changes from spotless

* Cleanup

* Update docs/changelog/134320.yaml

* PR feedback

* Remove null field constraint

* [CI] Auto commit changes from spotless

* PR feedback: Refactor to use an options map

* Cleanup

* Regenerate docs

* Add test on a concatenated field

* Add multivalued field test

* Don't hardcode strings

* [CI] Auto commit changes from spotless

* PR feedback

---------

Co-authored-by: elasticsearchmachine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Relevance/ES|QL Search functionality in ES|QL Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants