Skip to content

problems of choosing score_threshold in similarity_search_with_relevance_scores and faiss storage with distance== DistanceStrategy.MAX_INNER_PRODUCT #175

@dtanalytic

Description

@dtanalytic

Why? There are 2 funcs in chain calls which use score_threshold in opposite purposes. First
similarity_search_with_score_by_vector
calcs scalar similarity and finds docs with similarity more than score_threshold. Peace of code under link:

if score_threshold is not None:
            cmp = (
                operator.ge
                if self.distance_strategy
                in (DistanceStrategy.MAX_INNER_PRODUCT, DistanceStrategy.JACCARD)
                else operator.le
            )
            docs = [
                (doc, similarity)
                for doc, similarity in docs
                if cmp(similarity, score_threshold)
            ]

similarity_search_with_relevance_scores
Then relevance_score_fn calcs distance (1.0 - similarity if dist>0) based on similarity and find elements with scores more than score_threshold

docs_and_similarities = [
                (doc, similarity)
                for doc, similarity in docs_and_similarities
                if similarity >= score_threshold
            ]

If you have doc with similarity 0.8 and score_threshold 0.6, on first step it will be chosen but then as 0.2 (1-0.8) is less than 0.6 it will be dropped

My example of code outputs (a bit below):
Sidewinder has used PowerShell to drop and execute malware loaders. {} 0.34206724
Sidewinder has used JavaScript to drop and execute malware loaders. {} 0.38181686
Use a reputed anti-virus and Internet security software package on your connected devices, including PC, laptop, and mobile. {} 0.83190054
Use a reputed anti-virus and internet security software package on your connected devices, including PC, laptop, and mobile. {} 0.83190054

but if you change score_threshold to 0.4 similar docs (first 2) will be dropped

And another question why there is a warning if you set normalize_L2 to True, seems that it is a good way to transform scalar product to cosine similarity

Example of code in colab:
!pip install -q faiss-cpu
!pip install -q langchain-huggingface
!pip install langchain-community -q

from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_community.vectorstores.faiss import DistanceStrategy


from langchain_huggingface import HuggingFaceEmbeddings
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'


mn = 'basel/ATTACK-BERT'

embed_wrapper = HuggingFaceEmbeddings(model_name=mn,
                                   model_kwargs={'device': device})

query = " execute malware loaders"

# sents = base_df['sentence'].tolist()
sents = ['Sidewinder has used JavaScript to drop and execute malware loaders.',
 'Sidewinder has used PowerShell to drop and execute malware loaders.',
 'It includes a module on internet threats designed to help end users learn how to identify and protect themselves from various types of phishing attacks.',
 'It includes a module on Internet threats designed to help end users learn how to identify and protect themselves from various types of phishing attacks.',
 'regexp_url (accessed Apr. 25, 2023).',
 'regexp_url (accessed Apr. 28, 2023).',
 'It’s unclear whether Victim 1 was impacted by Trigona.',
 'It’s unclear whether Victim 2 was impacted by Trigona.',
 'Use a reputed anti-virus and internet security software package on your connected devices, including PC, laptop, and mobile.',
 'Use a reputed anti-virus and Internet security software package on your connected devices, including PC, laptop, and mobile.']


N = 100
db = FAISS.from_texts(sents[:N], embed_wrapper,
                      distance_strategy = DistanceStrategy.MAX_INNER_PRODUCT, normalize_L2=True)

docs_scores = db.similarity_search_with_relevance_scores(query, k=4,
                                                          score_threshold = 0)
for doc, score in docs_scores:
    print(doc.page_content, doc.metadata, score)

print('\n\n\n')
docs_scores = db.similarity_search_with_relevance_scores(query, k=4,
                                                          score_threshold = 0.4)
for doc, score in docs_scores:
    print(doc.page_content, doc.metadata, score)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions