Skip to content

Conversation

@Samiul-TheSoccerFan
Copy link
Contributor

This PR skips embedding generation when a sematic_text field is empty or with whitespace only.

PUT _inference/sparse_embedding/my-elser-model
{
  "service": "elser",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  },
  "task_settings": {}
}

PUT index1
{
    "mappings": {
        "properties": {
            "semantic_text_field": {
                "type": "semantic_text",
                "inference_id": "my-elser-model"
            }
        }
    }
}

PUT index1/_doc/doc1
{
  "semantic_text_field": "test value"
}

PUT index1/_doc/doc2
{
  "semantic_text_field": ""
}

PUT index1/_doc/doc3
{
  "semantic_text_field": " "
}

GET index1/_search
{
  "query": {
    "match_all": {}
  },
  "fields": ["_inference_fields"]
}

doc2 and doc3 will return without _inference_fields

Multi field

PUT index3
{
  "mappings": {
    "properties": {
      "city": {
        "type": "semantic_text",
        "inference_id": "my-elser-model",
        "fields": {
          "sparse": {
            "type":  "semantic_text",
            "inference_id": "my-elser-model"
          }
        }
      }
    }
  }
}

PUT index3/_doc/doc1
{
  "city": ["new york", "new york fries"]
}

PUT index3/_doc/doc2
{
  "city": ["", " "]
}

GET index3/_search
{
  "query": {
    "match_all": {}
  },
  "fields": ["_inference_fields"]
}

doc1 and doc2 will return without _inference_fields

@Samiul-TheSoccerFan Samiul-TheSoccerFan added >enhancement auto-backport Automatically create backport pull requests when merged v8.19.0 v9.1.0 labels Feb 28, 2025
@Samiul-TheSoccerFan
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start to this. Can you add a highlighter test where empty input is part of multi-chunk input (ex: ["some test data", " ", "now with chunks"])? The highlighter uses the offsets at query time to recreate the chunks, so this is a good test that the chunk offsets are valid end-to-end.

@Samiul-TheSoccerFan
Copy link
Contributor Author

@elasticmachine update branch

@Samiul-TheSoccerFan Samiul-TheSoccerFan marked this pull request as ready for review March 5, 2025 16:35
@Samiul-TheSoccerFan Samiul-TheSoccerFan added the :SearchOrg/Relevance Label for the Search (solution/org) Relevance team label Mar 5, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/search-eng (Team:SearchOrg)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/search-relevance (Team:Search - Relevance)

@elasticsearchmachine
Copy link
Collaborator

Hi @Samiul-TheSoccerFan, I've created a changelog YAML for you.

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work so far Samiul!

"Empty semantic_text field skips embedding generation":
- requires:
cluster_features: "semantic_text.handle_empty_input"
reason: skips generating embeddings when semantic_text field is contains empty or whitespace only input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: We usually put the reason as when the fix was introduced e.g. 8.19.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we only mention 8.19 or we should mention 9.1.0 as well?

Skips embedding generation when semantic_text is empty or contains only whitespace, effective from 8.19 and 9.1.0.. How about this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's perfect!

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great iterations on this 🙌 ! This is coming along nicely. There's a bug when using the legacy format, which you can replicate with:

PUT test-index
{
  "settings": {
    "index.mapping.semantic_text.use_legacy_format": true
  },
  "mappings": {
    "properties": {
      "inference": {
        "type": "semantic_text"
      }
    }
  }
}

POST test-index/_doc/1
{
  "inference": "   "
}

GET test-index/_doc/1

You get a parsing error, when you should get:

{
    "_index": "test-index",
    "_id": "1",
    "_version": 1,
    "_seq_no": 0,
    "_primary_term": 1,
    "found": true,
    "_source": {
        "inference": {
            "text": "   ",
            "inference": {
                "inference_id": ".elser-2-elasticsearch",
                "model_settings": null,
                "chunks": []
            }
        }
    }
}

@Samiul-TheSoccerFan
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work!

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, this is very close. Just a few things to clean up before this is good to merge :)

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the iterations!

@Samiul-TheSoccerFan Samiul-TheSoccerFan merged commit f0d5220 into elastic:main Mar 8, 2025
17 checks passed
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

Samiul-TheSoccerFan added a commit to Samiul-TheSoccerFan/elasticsearch that referenced this pull request Mar 8, 2025
* Added check for blank string to skip generating embeddings with unit test

* Adding yaml tests for skipping embedding generation

* dynamic update not required if model_settings stays null

* Updating node feature for handling empty input name and description

* Update yaml tests with refresh=true

* Update unit test to follow more accurate behavior

* Added yaml tests for multu chunks

* [CI] Auto commit changes from spotless

* Adding highlighter yaml tests for empty input

* Update docs/changelog/123763.yaml

* Update changelog and test reason to have more polished documentation

* adding input value into the response source and fixing unit tests by reformating

* Adding highligher test for backward compatibility and refactor existing test

* Added bwc tests for  empty input and multi chunks

* Removed reindex for empty input from bwc

* [CI] Auto commit changes from spotless

* Fixing yaml test

* Update unit tests helper function to support both format

* [CI] Auto commit changes from spotless

* Adding cluster features for bwc

* Centralize logic for assertInference helper

---------

Co-authored-by: Elastic Machine <[email protected]>
Co-authored-by: elasticsearchmachine <[email protected]>
elasticsearchmachine pushed a commit that referenced this pull request Mar 8, 2025
* Added check for blank string to skip generating embeddings with unit test

* Adding yaml tests for skipping embedding generation

* dynamic update not required if model_settings stays null

* Updating node feature for handling empty input name and description

* Update yaml tests with refresh=true

* Update unit test to follow more accurate behavior

* Added yaml tests for multu chunks

* [CI] Auto commit changes from spotless

* Adding highlighter yaml tests for empty input

* Update docs/changelog/123763.yaml

* Update changelog and test reason to have more polished documentation

* adding input value into the response source and fixing unit tests by reformating

* Adding highligher test for backward compatibility and refactor existing test

* Added bwc tests for  empty input and multi chunks

* Removed reindex for empty input from bwc

* [CI] Auto commit changes from spotless

* Fixing yaml test

* Update unit tests helper function to support both format

* [CI] Auto commit changes from spotless

* Adding cluster features for bwc

* Centralize logic for assertInference helper

---------

Co-authored-by: Elastic Machine <[email protected]>
Co-authored-by: elasticsearchmachine <[email protected]>
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Mar 11, 2025
* Added check for blank string to skip generating embeddings with unit test

* Adding yaml tests for skipping embedding generation

* dynamic update not required if model_settings stays null

* Updating node feature for handling empty input name and description

* Update yaml tests with refresh=true

* Update unit test to follow more accurate behavior

* Added yaml tests for multu chunks

* [CI] Auto commit changes from spotless

* Adding highlighter yaml tests for empty input

* Update docs/changelog/123763.yaml

* Update changelog and test reason to have more polished documentation

* adding input value into the response source and fixing unit tests by reformating

* Adding highligher test for backward compatibility and refactor existing test

* Added bwc tests for  empty input and multi chunks

* Removed reindex for empty input from bwc

* [CI] Auto commit changes from spotless

* Fixing yaml test

* Update unit tests helper function to support both format

* [CI] Auto commit changes from spotless

* Adding cluster features for bwc

* Centralize logic for assertInference helper

---------

Co-authored-by: Elastic Machine <[email protected]>
Co-authored-by: elasticsearchmachine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >enhancement :SearchOrg/Relevance Label for the Search (solution/org) Relevance team v8.19.0 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants