-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[ES|QL] Add CHUNK function #134320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ES|QL] Add CHUNK function #134320
Conversation
🔍 Preview links for changed docs |
ℹ️ Important: Docs version tagging👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version. We use applies_to tags to mark version-specific features and changes. Expand for a quick overviewWhen to use applies_to tags:✅ At the page level to indicate which products/deployments the content applies to (mandatory) What NOT to do:❌ Don't remove or replace information that applies to an older version 🤔 Need help?
|
|
Hi @kderusso, I've created a changelog YAML for you. |
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
carlosdelest
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good! 💯
A couple of minor issues on validation and testing.
I think it would be worth adding a VerifierTests to ensure we catch nulls on the numeric params. I believe CSV tests are getting all other testing I can think of 👍
...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java
Outdated
Show resolved
Hide resolved
| testCase.requiredCapabilities.contains(EsqlCapabilities.Cap.MULTI_MATCH_FUNCTION.capabilityName()) | ||
| ); | ||
| assumeFalse( | ||
| "CSV tests cannot currently handle CHUNK function", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? Are we using something specific from Lucene on the chunker that will make this not to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can run the CSV tests via the integration spec test, but CsvTest returns null for all chunked input. Perhaps this is due to the fact that we're using the chunker? I haven't been able to debug the actual CSV test implementation.
...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java
Outdated
Show resolved
Hide resolved
...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/qa/testFixtures/src/main/resources/chunk.csv-spec
Outdated
Show resolved
Hide resolved
carlosdelest
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Minor comments, I think we can simplify the options handling on evaluation.
...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java
Outdated
Show resolved
Hide resolved
| ); | ||
| assumeFalse( | ||
| "CSV tests cannot currently handle CHUNK function", | ||
| testCase.requiredCapabilities.contains(EsqlCapabilities.Cap.CHUNK_FUNCTION.capabilityName()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still confused about why this doesn't work 🤔 . We can do that in a follow up, and we have the EsqlSpecIT fields - but it would be nice to be able to run CSV tests on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a good chance that we won't do it as a followup may entail also performing a Lucene query - TBD.
* Add new function to chunk strings * Refactor CHUNK function to support multiple values * Default to returning all chunks * [CI] Auto commit changes from spotless * Handle warnings * Loosen export restrictions to try to get compile error working * Remove inference dependencies * Fix compilation errors * Remove more inference deps * Fix compile errors from merge * Fix existing tests * Exclude from CSV tests * Add more tests * Cleanup * [CI] Auto commit changes from spotless * Cleanup * Update docs/changelog/134320.yaml * PR feedback * Remove null field constraint * [CI] Auto commit changes from spotless * PR feedback: Refactor to use an options map * Cleanup * Regenerate docs * Add test on a concatenated field * Add multivalued field test * Don't hardcode strings * [CI] Auto commit changes from spotless * PR feedback --------- Co-authored-by: elasticsearchmachine <[email protected]>
* Add new function to chunk strings * Refactor CHUNK function to support multiple values * Default to returning all chunks * [CI] Auto commit changes from spotless * Handle warnings * Loosen export restrictions to try to get compile error working * Remove inference dependencies * Fix compilation errors * Remove more inference deps * Fix compile errors from merge * Fix existing tests * Exclude from CSV tests * Add more tests * Cleanup * [CI] Auto commit changes from spotless * Cleanup * Update docs/changelog/134320.yaml * PR feedback * Remove null field constraint * [CI] Auto commit changes from spotless * PR feedback: Refactor to use an options map * Cleanup * Regenerate docs * Add test on a concatenated field * Add multivalued field test * Don't hardcode strings * [CI] Auto commit changes from spotless * PR feedback --------- Co-authored-by: elasticsearchmachine <[email protected]>
Adds a new function,
CHUNKthat takes text from a field and returns chunks based on the requested chunking strategy.For this PR, we're inputting a size which will correspond to the default number of words in a sentence based chunking strategy. Future planned PRs will include the support for explicit chunking settings or an inference ID on top of these defaults. Future optimizations could also include supporting a max chunk size of LIMIT and optimizations to semantic text fields.
Examples of how to call this function: