Handle empty input inference #123763

Samiul-TheSoccerFan · 2025-02-28T21:13:18Z

This PR skips embedding generation when a sematic_text field is empty or with whitespace only.

PUT _inference/sparse_embedding/my-elser-model
{
  "service": "elser",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  },
  "task_settings": {}
}

PUT index1
{
    "mappings": {
        "properties": {
            "semantic_text_field": {
                "type": "semantic_text",
                "inference_id": "my-elser-model"
            }
        }
    }
}

PUT index1/_doc/doc1
{
  "semantic_text_field": "test value"
}

PUT index1/_doc/doc2
{
  "semantic_text_field": ""
}

PUT index1/_doc/doc3
{
  "semantic_text_field": " "
}

GET index1/_search
{
  "query": {
    "match_all": {}
  },
  "fields": ["_inference_fields"]
}

doc2 and doc3 will return without _inference_fields

Multi field

PUT index3
{
  "mappings": {
    "properties": {
      "city": {
        "type": "semantic_text",
        "inference_id": "my-elser-model",
        "fields": {
          "sparse": {
            "type":  "semantic_text",
            "inference_id": "my-elser-model"
          }
        }
      }
    }
  }
}

PUT index3/_doc/doc1
{
  "city": ["new york", "new york fries"]
}

PUT index3/_doc/doc2
{
  "city": ["", " "]
}

GET index3/_search
{
  "query": {
    "match_all": {}
  },
  "fields": ["_inference_fields"]
}

doc1 and doc2 will return without _inference_fields

…test

Samiul-TheSoccerFan · 2025-03-03T16:11:32Z

@elasticmachine update branch

Mikep86

Good start to this. Can you add a highlighter test where empty input is part of multi-chunk input (ex: ["some test data", " ", "now with chunks"])? The highlighter uses the offsets at query time to recreate the chunks, so this is a good test that the chunk offsets are valid end-to-end.

...nference/src/main/java/org/elasticsearch/xpack/inference/mapper/SemanticTextFieldMapper.java

...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java

...rence/src/yamlRestTest/resources/rest-api-spec/test/inference/30_semantic_text_inference.yml

Samiul-TheSoccerFan · 2025-03-05T15:35:31Z

@elasticmachine update branch

...nce/src/yamlRestTest/resources/rest-api-spec/test/inference/90_semantic_text_highlighter.yml

elasticsearchmachine · 2025-03-05T16:35:53Z

Pinging @elastic/search-eng (Team:SearchOrg)

elasticsearchmachine · 2025-03-05T16:35:53Z

Pinging @elastic/search-relevance (Team:Search - Relevance)

elasticsearchmachine · 2025-03-05T16:35:55Z

Hi @Samiul-TheSoccerFan, I've created a changelog YAML for you.

kderusso

Nice work so far Samiul!

docs/changelog/123763.yaml

kderusso · 2025-03-05T18:11:35Z

...rence/src/yamlRestTest/resources/rest-api-spec/test/inference/30_semantic_text_inference.yml

+"Empty semantic_text field skips embedding generation":
+  - requires:
+      cluster_features: "semantic_text.handle_empty_input"
+      reason: skips generating embeddings when semantic_text field is contains empty or whitespace only input


Nitpick: We usually put the reason as when the fix was introduced e.g. 8.19.

Do we only mention 8.19 or we should mention 9.1.0 as well?

Skips embedding generation when semantic_text is empty or contains only whitespace, effective from 8.19 and 9.1.0.. How about this one?

That's perfect!

Mikep86

Great iterations on this 🙌 ! This is coming along nicely. There's a bug when using the legacy format, which you can replicate with:

PUT test-index
{
  "settings": {
    "index.mapping.semantic_text.use_legacy_format": true
  },
  "mappings": {
    "properties": {
      "inference": {
        "type": "semantic_text"
      }
    }
  }
}

POST test-index/_doc/1
{
  "inference": "   "
}

GET test-index/_doc/1

You get a parsing error, when you should get:

{
    "_index": "test-index",
    "_id": "1",
    "_version": 1,
    "_seq_no": 0,
    "_primary_term": 1,
    "found": true,
    "_source": {
        "inference": {
            "text": "   ",
            "inference": {
                "inference_id": ".elser-2-elasticsearch",
                "model_settings": null,
                "chunks": []
            }
        }
    }
}

...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java

...ain/java/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilter.java

...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java

...nce/src/yamlRestTest/resources/rest-api-spec/test/inference/90_semantic_text_highlighter.yml

...rence/src/yamlRestTest/resources/rest-api-spec/test/inference/30_semantic_text_inference.yml

...nce/src/yamlRestTest/resources/rest-api-spec/test/inference/90_semantic_text_highlighter.yml

…reformating

…ng test

Samiul-TheSoccerFan · 2025-03-07T12:08:09Z

@elasticmachine update branch

kderusso

LGTM, nice work!

Mikep86

Great work, this is very close. Just a few things to clean up before this is good to merge :)

...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java

...e/src/yamlRestTest/resources/rest-api-spec/test/inference/30_semantic_text_inference_bwc.yml

...src/yamlRestTest/resources/rest-api-spec/test/inference/90_semantic_text_highlighter_bwc.yml

Mikep86

LGTM, thanks for the iterations!

elasticsearchmachine · 2025-03-08T04:40:13Z

💚 Backport successful

Status	Branch	Result
✅	8.x

* Added check for blank string to skip generating embeddings with unit test * Adding yaml tests for skipping embedding generation * dynamic update not required if model_settings stays null * Updating node feature for handling empty input name and description * Update yaml tests with refresh=true * Update unit test to follow more accurate behavior * Added yaml tests for multu chunks * [CI] Auto commit changes from spotless * Adding highlighter yaml tests for empty input * Update docs/changelog/123763.yaml * Update changelog and test reason to have more polished documentation * adding input value into the response source and fixing unit tests by reformating * Adding highligher test for backward compatibility and refactor existing test * Added bwc tests for empty input and multi chunks * Removed reindex for empty input from bwc * [CI] Auto commit changes from spotless * Fixing yaml test * Update unit tests helper function to support both format * [CI] Auto commit changes from spotless * Adding cluster features for bwc * Centralize logic for assertInference helper --------- Co-authored-by: Elastic Machine <[email protected]> Co-authored-by: elasticsearchmachine <[email protected]>

Samiul-TheSoccerFan added 2 commits February 28, 2025 14:16

Added check for blank string to skip generating embeddings with unit …

9d6a32e

…test

Adding yaml tests for skipping embedding generation

16f0b5a

Samiul-TheSoccerFan added >enhancement auto-backport Automatically create backport pull requests when merged v8.19.0 v9.1.0 labels Feb 28, 2025

Merge branch 'main' into handle-empty-input-inference

96605bb

Mikep86 reviewed Mar 3, 2025

View reviewed changes

Samiul-TheSoccerFan and others added 6 commits March 5, 2025 07:09

dynamic update not required if model_settings stays null

6403aa0

Updating node feature for handling empty input name and description

6e0d484

Update yaml tests with refresh=true

aeaf117

Update unit test to follow more accurate behavior

bb99b3b

Added yaml tests for multu chunks

6509870

[CI] Auto commit changes from spotless

3c4c3ed

elasticmachine and others added 2 commits March 5, 2025 16:35

Merge branch 'main' into handle-empty-input-inference

f085df3

Adding highlighter yaml tests for empty input

f7d9359

Samiul-TheSoccerFan commented Mar 5, 2025

View reviewed changes

...nce/src/yamlRestTest/resources/rest-api-spec/test/inference/90_semantic_text_highlighter.yml Show resolved Hide resolved

Samiul-TheSoccerFan marked this pull request as ready for review March 5, 2025 16:35

Samiul-TheSoccerFan added the :SearchOrg/Relevance Label for the Search (solution/org) Relevance team label Mar 5, 2025

Update docs/changelog/123763.yaml

285226a

kderusso reviewed Mar 5, 2025

View reviewed changes

Mikep86 reviewed Mar 6, 2025

View reviewed changes

Samiul-TheSoccerFan added 3 commits March 6, 2025 15:57

Update changelog and test reason to have more polished documentation

43406db

adding input value into the response source and fixing unit tests by …

78c5e12

…reformating

Adding highligher test for backward compatibility and refactor existi…

33a533a

…ng test

Samiul-TheSoccerFan and others added 3 commits March 7, 2025 06:39

Added bwc tests for empty input and multi chunks

cd15c9e

Removed reindex for empty input from bwc

2fb0092

[CI] Auto commit changes from spotless

1a275db

Samiul-TheSoccerFan requested review from Mikep86 and kderusso March 7, 2025 12:01

Merge branch 'main' into handle-empty-input-inference

7486fe8

kderusso approved these changes Mar 7, 2025

View reviewed changes

Mikep86 reviewed Mar 7, 2025

View reviewed changes

Samiul-TheSoccerFan and others added 6 commits March 7, 2025 14:37

Fixing yaml test

6123d1a

Update unit tests helper function to support both format

d31d281

[CI] Auto commit changes from spotless

78a390c

Adding cluster features for bwc

09a298a

Centralize logic for assertInference helper

72886bf

resolve conflicts from main

1179f84

Mikep86 approved these changes Mar 7, 2025

View reviewed changes

Samiul-TheSoccerFan merged commit f0d5220 into elastic:main Mar 8, 2025
17 checks passed

Samiul-TheSoccerFan mentioned this pull request Mar 8, 2025

[8.x] Handle empty input inference (#123763) #124396

Merged

Handle empty input inference #123763

Handle empty input inference #123763

Uh oh!

Conversation

Samiul-TheSoccerFan commented Feb 28, 2025

Multi field

Uh oh!

Samiul-TheSoccerFan commented Mar 3, 2025

Uh oh!

Mikep86 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Samiul-TheSoccerFan commented Mar 5, 2025

Uh oh!

Uh oh!

elasticsearchmachine commented Mar 5, 2025

Uh oh!

elasticsearchmachine commented Mar 5, 2025

Uh oh!

elasticsearchmachine commented Mar 5, 2025

Uh oh!

kderusso left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kderusso Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

Samiul-TheSoccerFan Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

kderusso Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

Mikep86 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Samiul-TheSoccerFan commented Mar 7, 2025

Uh oh!

kderusso left a comment

Choose a reason for hiding this comment

Uh oh!

Mikep86 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mikep86 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Mar 8, 2025

💚 Backport successful

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants