-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Handle empty input inference #123763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle empty input inference #123763
Conversation
|
@elasticmachine update branch |
Mikep86
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start to this. Can you add a highlighter test where empty input is part of multi-chunk input (ex: ["some test data", " ", "now with chunks"])? The highlighter uses the offsets at query time to recreate the chunks, so this is a good test that the chunk offsets are valid end-to-end.
...nference/src/main/java/org/elasticsearch/xpack/inference/mapper/SemanticTextFieldMapper.java
Outdated
Show resolved
Hide resolved
...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java
Outdated
Show resolved
Hide resolved
...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java
Show resolved
Hide resolved
...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java
Outdated
Show resolved
Hide resolved
...rence/src/yamlRestTest/resources/rest-api-spec/test/inference/30_semantic_text_inference.yml
Show resolved
Hide resolved
|
@elasticmachine update branch |
...nce/src/yamlRestTest/resources/rest-api-spec/test/inference/90_semantic_text_highlighter.yml
Show resolved
Hide resolved
|
Pinging @elastic/search-eng (Team:SearchOrg) |
|
Pinging @elastic/search-relevance (Team:Search - Relevance) |
|
Hi @Samiul-TheSoccerFan, I've created a changelog YAML for you. |
kderusso
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work so far Samiul!
| "Empty semantic_text field skips embedding generation": | ||
| - requires: | ||
| cluster_features: "semantic_text.handle_empty_input" | ||
| reason: skips generating embeddings when semantic_text field is contains empty or whitespace only input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: We usually put the reason as when the fix was introduced e.g. 8.19.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we only mention 8.19 or we should mention 9.1.0 as well?
Skips embedding generation when semantic_text is empty or contains only whitespace, effective from 8.19 and 9.1.0.. How about this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's perfect!
Mikep86
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great iterations on this 🙌 ! This is coming along nicely. There's a bug when using the legacy format, which you can replicate with:
PUT test-index
{
"settings": {
"index.mapping.semantic_text.use_legacy_format": true
},
"mappings": {
"properties": {
"inference": {
"type": "semantic_text"
}
}
}
}
POST test-index/_doc/1
{
"inference": " "
}
GET test-index/_doc/1
You get a parsing error, when you should get:
{
"_index": "test-index",
"_id": "1",
"_version": 1,
"_seq_no": 0,
"_primary_term": 1,
"found": true,
"_source": {
"inference": {
"text": " ",
"inference": {
"inference_id": ".elser-2-elasticsearch",
"model_settings": null,
"chunks": []
}
}
}
}
...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java
Outdated
Show resolved
Hide resolved
...ain/java/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilter.java
Outdated
Show resolved
Hide resolved
...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java
Outdated
Show resolved
Hide resolved
...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java
Outdated
Show resolved
Hide resolved
...nce/src/yamlRestTest/resources/rest-api-spec/test/inference/90_semantic_text_highlighter.yml
Show resolved
Hide resolved
...nce/src/yamlRestTest/resources/rest-api-spec/test/inference/90_semantic_text_highlighter.yml
Outdated
Show resolved
Hide resolved
...rence/src/yamlRestTest/resources/rest-api-spec/test/inference/30_semantic_text_inference.yml
Show resolved
Hide resolved
...nce/src/yamlRestTest/resources/rest-api-spec/test/inference/90_semantic_text_highlighter.yml
Show resolved
Hide resolved
|
@elasticmachine update branch |
kderusso
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice work!
Mikep86
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, this is very close. Just a few things to clean up before this is good to merge :)
...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java
Outdated
Show resolved
Hide resolved
...e/src/yamlRestTest/resources/rest-api-spec/test/inference/30_semantic_text_inference_bwc.yml
Outdated
Show resolved
Hide resolved
...src/yamlRestTest/resources/rest-api-spec/test/inference/90_semantic_text_highlighter_bwc.yml
Show resolved
Hide resolved
Mikep86
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the iterations!
💚 Backport successful
|
* Added check for blank string to skip generating embeddings with unit test * Adding yaml tests for skipping embedding generation * dynamic update not required if model_settings stays null * Updating node feature for handling empty input name and description * Update yaml tests with refresh=true * Update unit test to follow more accurate behavior * Added yaml tests for multu chunks * [CI] Auto commit changes from spotless * Adding highlighter yaml tests for empty input * Update docs/changelog/123763.yaml * Update changelog and test reason to have more polished documentation * adding input value into the response source and fixing unit tests by reformating * Adding highligher test for backward compatibility and refactor existing test * Added bwc tests for empty input and multi chunks * Removed reindex for empty input from bwc * [CI] Auto commit changes from spotless * Fixing yaml test * Update unit tests helper function to support both format * [CI] Auto commit changes from spotless * Adding cluster features for bwc * Centralize logic for assertInference helper --------- Co-authored-by: Elastic Machine <[email protected]> Co-authored-by: elasticsearchmachine <[email protected]>
* Added check for blank string to skip generating embeddings with unit test * Adding yaml tests for skipping embedding generation * dynamic update not required if model_settings stays null * Updating node feature for handling empty input name and description * Update yaml tests with refresh=true * Update unit test to follow more accurate behavior * Added yaml tests for multu chunks * [CI] Auto commit changes from spotless * Adding highlighter yaml tests for empty input * Update docs/changelog/123763.yaml * Update changelog and test reason to have more polished documentation * adding input value into the response source and fixing unit tests by reformating * Adding highligher test for backward compatibility and refactor existing test * Added bwc tests for empty input and multi chunks * Removed reindex for empty input from bwc * [CI] Auto commit changes from spotless * Fixing yaml test * Update unit tests helper function to support both format * [CI] Auto commit changes from spotless * Adding cluster features for bwc * Centralize logic for assertInference helper --------- Co-authored-by: Elastic Machine <[email protected]> Co-authored-by: elasticsearchmachine <[email protected]>
* Added check for blank string to skip generating embeddings with unit test * Adding yaml tests for skipping embedding generation * dynamic update not required if model_settings stays null * Updating node feature for handling empty input name and description * Update yaml tests with refresh=true * Update unit test to follow more accurate behavior * Added yaml tests for multu chunks * [CI] Auto commit changes from spotless * Adding highlighter yaml tests for empty input * Update docs/changelog/123763.yaml * Update changelog and test reason to have more polished documentation * adding input value into the response source and fixing unit tests by reformating * Adding highligher test for backward compatibility and refactor existing test * Added bwc tests for empty input and multi chunks * Removed reindex for empty input from bwc * [CI] Auto commit changes from spotless * Fixing yaml test * Update unit tests helper function to support both format * [CI] Auto commit changes from spotless * Adding cluster features for bwc * Centralize logic for assertInference helper --------- Co-authored-by: Elastic Machine <[email protected]> Co-authored-by: elasticsearchmachine <[email protected]>
This PR skips embedding generation when a sematic_text field is empty or with whitespace only.
doc2anddoc3will return without_inference_fieldsMulti field
doc1anddoc2will return without_inference_fields