problems of choosing score_threshold in similarity_search_with_relevance_scores and faiss storage with distance== DistanceStrategy.MAX_INNER_PRODUCT


Why? There are 2 funcs in chain calls which use score_threshold in opposite purposes. First
[similarity_search_with_score_by_vector](https://github.com/langchain-ai/langchain-community/blob/main/libs/community/langchain_community/vectorstores/faiss.py#L387)
calcs scalar similarity and finds docs with similarity more than score_threshold. Peace of code under link:
```
if score_threshold is not None:
            cmp = (
                operator.ge
                if self.distance_strategy
                in (DistanceStrategy.MAX_INNER_PRODUCT, DistanceStrategy.JACCARD)
                else operator.le
            )
            docs = [
                (doc, similarity)
                for doc, similarity in docs
                if cmp(similarity, score_threshold)
            ]
```
[similarity_search_with_relevance_scores](https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/vectorstores/base.py#L534)
Then relevance_score_fn calcs distance (1.0 - similarity if dist>0) based on similarity and find elements with scores more than score_threshold
```
docs_and_similarities = [
                (doc, similarity)
                for doc, similarity in docs_and_similarities
                if similarity >= score_threshold
            ]
```
If you have doc with similarity 0.8 and score_threshold 0.6, on first step it will be chosen but then as 0.2 (1-0.8) is less than 0.6 it will be dropped

My example of code outputs (a bit below):
Sidewinder has used PowerShell to drop and execute malware loaders. {} 0.34206724
Sidewinder has used JavaScript to drop and execute malware loaders. {} 0.38181686
Use a reputed anti-virus and Internet security software package on your connected devices, including PC, laptop, and mobile. {} 0.83190054
Use a reputed anti-virus and internet security software package on your connected devices, including PC, laptop, and mobile. {} 0.83190054

but if you change score_threshold to 0.4 similar docs (first 2) will be dropped

And another question why there is a warning if you set normalize_L2 to True, seems that it is a good way to transform scalar product to cosine similarity


Example of code in colab:
!pip install -q faiss-cpu
!pip install -q langchain-huggingface
!pip install langchain-community -q
``` python

from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_community.vectorstores.faiss import DistanceStrategy


from langchain_huggingface import HuggingFaceEmbeddings
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'


mn = 'basel/ATTACK-BERT'

embed_wrapper = HuggingFaceEmbeddings(model_name=mn,
                                   model_kwargs={'device': device})

query = " execute malware loaders"

# sents = base_df['sentence'].tolist()
sents = ['Sidewinder has used JavaScript to drop and execute malware loaders.',
 'Sidewinder has used PowerShell to drop and execute malware loaders.',
 'It includes a module on internet threats designed to help end users learn how to identify and protect themselves from various types of phishing attacks.',
 'It includes a module on Internet threats designed to help end users learn how to identify and protect themselves from various types of phishing attacks.',
 'regexp_url (accessed Apr. 25, 2023).',
 'regexp_url (accessed Apr. 28, 2023).',
 'It’s unclear whether Victim 1 was impacted by Trigona.',
 'It’s unclear whether Victim 2 was impacted by Trigona.',
 'Use a reputed anti-virus and internet security software package on your connected devices, including PC, laptop, and mobile.',
 'Use a reputed anti-virus and Internet security software package on your connected devices, including PC, laptop, and mobile.']


N = 100
db = FAISS.from_texts(sents[:N], embed_wrapper,
                      distance_strategy = DistanceStrategy.MAX_INNER_PRODUCT, normalize_L2=True)

docs_scores = db.similarity_search_with_relevance_scores(query, k=4,
                                                          score_threshold = 0)
for doc, score in docs_scores:
    print(doc.page_content, doc.metadata, score)

print('\n\n\n')
docs_scores = db.similarity_search_with_relevance_scores(query, k=4,
                                                          score_threshold = 0.4)
for doc, score in docs_scores:
    print(doc.page_content, doc.metadata, score)

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

problems of choosing score_threshold in similarity_search_with_relevance_scores and faiss storage with distance== DistanceStrategy.MAX_INNER_PRODUCT #175

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

problems of choosing score_threshold in similarity_search_with_relevance_scores and faiss storage with distance== DistanceStrategy.MAX_INNER_PRODUCT #175

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions