Similarity search accuracy #3360

jais001 · 2022-10-11T09:07:49Z

jais001
Oct 11, 2022

Hi Team,
I am using Haystack with ElasticsearchDocumentStore and EmbeddingRetriever in my application for semantic search. My doubt is the following, the content field is created by using Issue and Solution columns from the dataset,
eg: Issue: Error code #32
Solution: error occurred due to problem with xyz
content: "{ "Issue: Error code #32", "Solution: error occurred due to problem with xyz"}"
But when we are performing a search, with the query as "Error code #32" this particular result is coming at 8th position, I think this is because the model ignores #32 and brings results related to "Error code". Is there a way to increase the priority for #32 during the search?

Answered by julian-risch

Oct 12, 2022

Hi @JaisVJ if some of the queries are not full sentences but something like ""Error code #32" then this looks to me more like a keyword search, with "error", "code" and "32" being the keywords. You could check whether the results for this query get better if you use the BM25Retriever instead of the EmbeddingRetriever. If that's the case, you could change your pipeline such that it contains two retrievers. Our tutorial 11 contains an example. Just search for "CustomQueryClassifier" in that tutorial.
Another idea would be to store "issue" and "solution" in two different fields of the document when you create it from your dataset. "solution" could be still stored in "content" but "issue" cou…

View full answer

julian-risch · 2022-10-12T07:23:20Z

julian-risch
Oct 12, 2022
Maintainer

Hi @JaisVJ if some of the queries are not full sentences but something like ""Error code #32" then this looks to me more like a keyword search, with "error", "code" and "32" being the keywords. You could check whether the results for this query get better if you use the BM25Retriever instead of the EmbeddingRetriever. If that's the case, you could change your pipeline such that it contains two retrievers. Our tutorial 11 contains an example. Just search for "CustomQueryClassifier" in that tutorial.
Another idea would be to store "issue" and "solution" in two different fields of the document when you create it from your dataset. "solution" could be still stored in "content" but "issue" could be stored as metadata instead. When you perform a search, you could use the metadata field to filter for documents containing "error code 32".

1 reply

jais001 Oct 12, 2022
Author

Hey @julian-risch !! Thank you for your support, will try these options.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Similarity search accuracy #3360

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Similarity search accuracy #3360

Uh oh!

jais001 Oct 11, 2022

Replies: 1 comment · 1 reply

Uh oh!

julian-risch Oct 12, 2022 Maintainer

Uh oh!

jais001 Oct 12, 2022 Author

jais001
Oct 11, 2022

Replies: 1 comment 1 reply

julian-risch
Oct 12, 2022
Maintainer

jais001 Oct 12, 2022
Author