Vectore Store similaritySearchWithScore (NodeJS) vs similarity_search_with_relevance_scores (Python) #4894

asif-git-hub · 2024-03-27T02:49:20Z

asif-git-hub
Mar 27, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

NodeJS Code


export function getVectorStoreConfig(tableName: string): PGVectorStoreArgs {
  const config: PGVectorStoreArgs = {
    postgresConnectionOptions: getConnectionConfig(),
    tableName,
    columns: {
      metadataColumnName: "cmetadata",
      idColumnName: "uuid",
      contentColumnName: "document",
      vectorColumnName: "embedding",
    },
    verbose: true,
  }

  return config
}

async similaritySearch(question: string, k = 25) {
    const embedding = new OpenAIEmbeddings({
      modelName: "text-embedding-ada-002",
      timeout: 5 * 1000,
      maxRetries: 3,
      verbose: true,
      onFailedAttempt: (e) => {
        console.log(e)
      },
    })
    console.log("embedding created")
    const vectorstore = await PGVectorStore.initialize(embedding, getVectorStoreConfig("langchain_pg_embedding"))
    console.log("vectorstore initialized")
    try {
      console.log("performing similarity search with langchain")
      return await vectorstore.similaritySearchWithScore(question, k)
    } catch (e) {
      console.error("similaritySearch failed. ", e)
      return []
    } finally {
      await vectorstore.end()
    }
  }

Python code:

def load_vectorstore(collection_name):
    vectorstore = PGVector(
        collection_name=collection_name,
        connection_string=CONNECTION_STRING,
        embedding_function=EMBEDDINGS
    )
    return vectorstore

vectorstore = load_vectorstore("chatbot_v1")

def find_relevant_documents(query: str, vectorstore) -> List[Tuple]:
    # Fetch 25 documents based on the query and perform similarity search
    relevant_docs = vectorstore.similarity_search_with_relevance_scores(query, k=25)
    print('Fetched 25 documents')

Description

I am playing around with the similarity search functions in both Python and NodeJS.

I have a PGVectorStore and doing a simple similarity search with score.

I noticed the average score returned in python script is easily above .82 👍
But, the same functionality in NodeJS script is around 0.12 to 0.15 👎

This experiment is with the same query which is a simple string question.

I am not sure if similarity_search_with_relevance_scores (Python) is equivalent to similaritySearchWithScore() in NodeJS? I am just looking for the same functionality in NodeJS.

🤔 The thought that the score in NodeJS get subtracted by 1 did cross my mind, but I'm not sure if that is right.

Any help will be appreciated 🙏

System Info

NodeJS specs:

"@langchain/openai": "^0.0.21",
"langchain": "^0.1.29",

node version: 18

Python specs:

langchain-community==0.0.29
langchain-core==0.1.33
langchain-openai==0.1.1
langchain-text-splitters==0.0.1
langchain==0.1.13

@dosu-bot · 2024-03-27T02:49:40Z

dosubot[bot]
bot Mar 27, 2024

The discrepancy in the average scores returned by the similarity search functions in Python and NodeJS when using a PGVectorStore with the same query could stem from several factors, including differences in the implementation of the similarity search functions, the handling of floating-point arithmetic, or the configuration of the PGVectorStore (e.g., distance strategy).

The Python function similarity_search_with_relevance_scores and the NodeJS function similaritySearchWithScore() conceptually aim to achieve similar outcomes by returning documents and their relevance scores based on a similarity search. However, without the Python code for similarity_search_with_relevance_scores, it's challenging to confirm their exact equivalence. The exact equivalence in terms of implementation details, scoring, and performance can only be determined by comparing the specific implementations in both languages.

The similaritySearchWithScore() function in LangChain JS calculates the similarity score based on the embeddings generated for the query and the documents stored in the Couchbase vector store. This involves generating embeddings for the query, searching the vector store for similar document embeddings, calculating similarity scores using a metric like cosine similarity, and returning results and scores.

Without the Python implementation for comparison, pinpointing the exact cause of the discrepancy is challenging. It would be helpful to provide the Python implementation of similarity_search_with_relevance_scores for a more accurate comparison.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

fallard84 · 2024-05-01T01:28:29Z

fallard84
May 1, 2024

I was actually looking at this today and noticed that the PGVector and some other vector store in the JS implementations are returning the distance instead of the similarity score. This means that a low distance value means a higher similarity. While using the retriever on its own doesn't really matter (because the top documents are still returned), this make it impossible to use something like the ScoreThresholdRetriever, which assumes that the score returned is the similarity score and it can only filter documents based on a minimum score (and not a maximum score, which would be required for distance values).

I just got started working with Langchain, so I don't know if there is a reason for returning the distance for some vectorstore vs standardizing on the similarity score, but since the score is exposed and many people are probably using it, changing from the distance to the similarity score would probably create a breaking change for some people.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vectore Store similaritySearchWithScore (NodeJS) vs similarity_search_with_relevance_scores (Python) #4894

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

About Dosu

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Vectore Store similaritySearchWithScore (NodeJS) vs similarity_search_with_relevance_scores (Python) #4894

Uh oh!

Uh oh!

asif-git-hub Mar 27, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 2 comments

Uh oh!

Uh oh!

dosubot[bot] bot Mar 27, 2024

Sources

About Dosu

Uh oh!

fallard84 May 1, 2024

asif-git-hub
Mar 27, 2024

dosubot[bot]
bot Mar 27, 2024

fallard84
May 1, 2024