-
Notifications
You must be signed in to change notification settings - Fork 168
Description
Why? There are 2 funcs in chain calls which use score_threshold in opposite purposes. First
similarity_search_with_score_by_vector
calcs scalar similarity and finds docs with similarity more than score_threshold. Peace of code under link:
if score_threshold is not None:
cmp = (
operator.ge
if self.distance_strategy
in (DistanceStrategy.MAX_INNER_PRODUCT, DistanceStrategy.JACCARD)
else operator.le
)
docs = [
(doc, similarity)
for doc, similarity in docs
if cmp(similarity, score_threshold)
]
similarity_search_with_relevance_scores
Then relevance_score_fn calcs distance (1.0 - similarity if dist>0) based on similarity and find elements with scores more than score_threshold
docs_and_similarities = [
(doc, similarity)
for doc, similarity in docs_and_similarities
if similarity >= score_threshold
]
If you have doc with similarity 0.8 and score_threshold 0.6, on first step it will be chosen but then as 0.2 (1-0.8) is less than 0.6 it will be dropped
My example of code outputs (a bit below):
Sidewinder has used PowerShell to drop and execute malware loaders. {} 0.34206724
Sidewinder has used JavaScript to drop and execute malware loaders. {} 0.38181686
Use a reputed anti-virus and Internet security software package on your connected devices, including PC, laptop, and mobile. {} 0.83190054
Use a reputed anti-virus and internet security software package on your connected devices, including PC, laptop, and mobile. {} 0.83190054
but if you change score_threshold to 0.4 similar docs (first 2) will be dropped
And another question why there is a warning if you set normalize_L2 to True, seems that it is a good way to transform scalar product to cosine similarity
Example of code in colab:
!pip install -q faiss-cpu
!pip install -q langchain-huggingface
!pip install langchain-community -q
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_community.vectorstores.faiss import DistanceStrategy
from langchain_huggingface import HuggingFaceEmbeddings
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
mn = 'basel/ATTACK-BERT'
embed_wrapper = HuggingFaceEmbeddings(model_name=mn,
model_kwargs={'device': device})
query = " execute malware loaders"
# sents = base_df['sentence'].tolist()
sents = ['Sidewinder has used JavaScript to drop and execute malware loaders.',
'Sidewinder has used PowerShell to drop and execute malware loaders.',
'It includes a module on internet threats designed to help end users learn how to identify and protect themselves from various types of phishing attacks.',
'It includes a module on Internet threats designed to help end users learn how to identify and protect themselves from various types of phishing attacks.',
'regexp_url (accessed Apr. 25, 2023).',
'regexp_url (accessed Apr. 28, 2023).',
'It’s unclear whether Victim 1 was impacted by Trigona.',
'It’s unclear whether Victim 2 was impacted by Trigona.',
'Use a reputed anti-virus and internet security software package on your connected devices, including PC, laptop, and mobile.',
'Use a reputed anti-virus and Internet security software package on your connected devices, including PC, laptop, and mobile.']
N = 100
db = FAISS.from_texts(sents[:N], embed_wrapper,
distance_strategy = DistanceStrategy.MAX_INNER_PRODUCT, normalize_L2=True)
docs_scores = db.similarity_search_with_relevance_scores(query, k=4,
score_threshold = 0)
for doc, score in docs_scores:
print(doc.page_content, doc.metadata, score)
print('\n\n\n')
docs_scores = db.similarity_search_with_relevance_scores(query, k=4,
score_threshold = 0.4)
for doc, score in docs_scores:
print(doc.page_content, doc.metadata, score)