Skip to content

choosing score_threshold in similarity_search_with_relevance_scores and faiss storage with distance==DistanceStrategy.MAX_INNER_PRODUCT #32045

@dtanalytic

Description

@dtanalytic

Checked other resources

  • I added a very descriptive title to this issue.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
  • I posted a self-contained, minimal, reproducible example. A maintainer can copy it and run it AS IS.

Example Code

Example of code in colab:
!pip install -q faiss-cpu
!pip install -q langchain-huggingface
!pip install langchain-community -q

from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_community.vectorstores.faiss import DistanceStrategy


from langchain_huggingface import HuggingFaceEmbeddings
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'


mn = 'basel/ATTACK-BERT'

embed_wrapper = HuggingFaceEmbeddings(model_name=mn,
                                   model_kwargs={'device': device})


# sents = base_df['sentence'].tolist()
sents = ['Sidewinder has used JavaScript to drop and execute malware loaders.',
 'Sidewinder has used PowerShell to drop and execute malware loaders.',
 'It includes a module on internet threats designed to help end users learn how to identify and protect themselves from various types of phishing attacks.',
 'It includes a module on Internet threats designed to help end users learn how to identify and protect themselves from various types of phishing attacks.',
 'regexp_url (accessed Apr. 25, 2023).',
 'regexp_url (accessed Apr. 28, 2023).',
 'It’s unclear whether Victim 1 was impacted by Trigona.',
 'It’s unclear whether Victim 2 was impacted by Trigona.',
 'Use a reputed anti-virus and internet security software package on your connected devices, including PC, laptop, and mobile.',
 'Use a reputed anti-virus and Internet security software package on your connected devices, including PC, laptop, and mobile.']


N = 100
db = FAISS.from_texts(sents[:N], embed_wrapper,
                      distance_strategy = DistanceStrategy.MAX_INNER_PRODUCT, normalize_L2=True)

docs_scores = db.similarity_search_with_relevance_scores(query, k=4,
                                                          score_threshold = 0)
for doc, score in docs_scores:
    print(doc.page_content, doc.metadata, score)

Error Message and Stack Trace (if applicable)

No response

Description

problems of choosing score_threshold in similarity_search_with_relevance_scores and faiss storage with distance== DistanceStrategy.MAX_INNER_PRODUCT

  1. Why? There are 2 funcs in chain calls which use score_threshold in opposite purposes. First

similarity_search_with_score_by_vector
calcs scalar similarity and finds docs with similarity more than score_threshold. Peace of code under link:

if score_threshold is not None:
            cmp = (
                operator.ge
                if self.distance_strategy
                in (DistanceStrategy.MAX_INNER_PRODUCT, DistanceStrategy.JACCARD)
                else operator.le
            )
            docs = [
                (doc, similarity)
                for doc, similarity in docs
                if cmp(similarity, score_threshold)
            ]

Then relevance_score_fn calcs distance (1.0 - similarity if dist>0) based on similarity and find elements with scores more than score_threshold

docs_and_similarities = [
                (doc, similarity)
                for doc, similarity in docs_and_similarities
                if similarity >= score_threshold
            ]

If you have doc with similarity 0.8 and score_threshold 0.6, on first step it will be chosen but then as 0.2 (1-0.8) is less than 0.6 it will be dropped

My example of code outputs:
Sidewinder has used PowerShell to drop and execute malware loaders. {} 0.34206724
Sidewinder has used JavaScript to drop and execute malware loaders. {} 0.38181686
Use a reputed anti-virus and Internet security software package on your connected devices, including PC, laptop, and mobile. {} 0.83190054
Use a reputed anti-virus and internet security software package on your connected devices, including PC, laptop, and mobile. {} 0.83190054

but if you change score_threshold to 0.4 similar docs (first 2) will be dropped

  1. And another question why there is a warning if you to normalize_L2 to True, seems that it is a good way to transform scalar product cosine similarity

System Info

System Information

OS: Linux
OS Version: #1 SMP PREEMPT_DYNAMIC Sun Mar 30 16:01:29 UTC 2025
Python Version: 3.11.13 (main, Jun 4 2025, 08:57:29) [GCC 11.4.0]

Package Information

langchain_core: 0.3.68
langchain: 0.3.26
langchain_community: 0.3.27
langsmith: 0.4.4
langchain_huggingface: 0.3.0
langchain_text_splitters: 0.3.8

Optional packages not installed

langserve

Other Dependencies

aiohttp<4.0.0,>=3.8.3: Installed. No version info available.
async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
dataclasses-json<0.7,>=0.5.7: Installed. No version info available.
httpx: 0.28.1
httpx-sse<1.0.0,>=0.4.0: Installed. No version info available.
huggingface-hub>=0.30.2: Installed. No version info available.
jsonpatch<2.0,>=1.33: Installed. No version info available.
langchain-anthropic;: Installed. No version info available.
langchain-aws;: Installed. No version info available.
langchain-azure-ai;: Installed. No version info available.
langchain-cohere;: Installed. No version info available.
langchain-community;: Installed. No version info available.
langchain-core<1.0.0,>=0.3.51: Installed. No version info available.
langchain-core<1.0.0,>=0.3.65: Installed. No version info available.
langchain-core<1.0.0,>=0.3.66: Installed. No version info available.
langchain-deepseek;: Installed. No version info available.
langchain-fireworks;: Installed. No version info available.
langchain-google-genai;: Installed. No version info available.
langchain-google-vertexai;: Installed. No version info available.
langchain-groq;: Installed. No version info available.
langchain-huggingface;: Installed. No version info available.
langchain-mistralai;: Installed. No version info available.
langchain-ollama;: Installed. No version info available.
langchain-openai;: Installed. No version info available.
langchain-perplexity;: Installed. No version info available.
langchain-text-splitters<1.0.0,>=0.3.8: Installed. No version info available.
langchain-together;: Installed. No version info available.
langchain-xai;: Installed. No version info available.
langchain<1.0.0,>=0.3.26: Installed. No version info available.
langsmith-pyo3: Installed. No version info available.
langsmith>=0.1.125: Installed. No version info available.
langsmith>=0.1.17: Installed. No version info available.
langsmith>=0.3.45: Installed. No version info available.
numpy>=1.26.2;: Installed. No version info available.
numpy>=2.1.0;: Installed. No version info available.
openai-agents: Installed. No version info available.
opentelemetry-api: Installed. No version info available.
opentelemetry-exporter-otlp-proto-http: Installed. No version info available.
opentelemetry-sdk: Installed. No version info available.
orjson: 3.10.18
packaging: 24.2
packaging<25,>=23.2: Installed. No version info available.
pydantic: 2.11.7
pydantic-settings<3.0.0,>=2.4.0: Installed. No version info available.
pydantic<3.0.0,>=2.7.4: Installed. No version info available.
pydantic>=2.7.4: Installed. No version info available.
pytest: 8.3.5
PyYAML>=5.3: Installed. No version info available.
requests: 2.32.3
requests-toolbelt: 1.0.0
requests<3,>=2: Installed. No version info available.
rich: 13.9.4
sentence-transformers>=2.6.0;: Installed. No version info available.
SQLAlchemy<3,>=1.4: Installed. No version info available.
tenacity!=8.4.0,<10,>=8.1.0: Installed. No version info available.
tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
tokenizers>=0.19.1: Installed. No version info available.
transformers>=4.39.0;: Installed. No version info available.
typing-extensions>=4.7: Installed. No version info available.
zstandard: 0.23.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedGood issue for contributorsinvestigateFlagged for investigation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions