Finding if a string/data is similar to the corpus #9950

imhans33 · 2021-12-28T20:16:16Z

imhans33
Dec 28, 2021

I have around 15000 text data which corresponds to a single specific topic. My aim is to create a system where if i give a string as input, i need to find if that string is semantically similar to the corpus in terms of probability (I am not interested in 15000 comparison values based on probability. I need a single probability value which says how much similar is the string to the corpus). I find examples where a string is compared with another string using spacy and obtaining probability value. How this can be done for a string v/s a corpus ?

Answered by DuyguA

Dec 28, 2021

OK, sounds like a combination of topic modelling and semantic similarity to me. Basically I'd do this problem in 2 steps:

get a vector representation for the corpus (call it corpus vector)
calculate semantic similarity between the corpus vector and the keyword's vector

The second part is calculating cosine similarity, so let's focus on the first part.
Simplest way to get a corpus vector is just to average the vectors of the docs that are included in the corpus. I know it sounds rough, but trust me it usually works 🙂 (Think of it as calculating a sentence vector from word vectors, good old averaging usually works fine. Transformer embeddings or sentence encoders are of course more refine…

View full answer

DuyguA · 2021-12-28T22:01:31Z

DuyguA
Dec 28, 2021

OK, sounds like a combination of topic modelling and semantic similarity to me. Basically I'd do this problem in 2 steps:

get a vector representation for the corpus (call it corpus vector)
calculate semantic similarity between the corpus vector and the keyword's vector

The second part is calculating cosine similarity, so let's focus on the first part.
Simplest way to get a corpus vector is just to average the vectors of the docs that are included in the corpus. I know it sounds rough, but trust me it usually works 🙂 (Think of it as calculating a sentence vector from word vectors, good old averaging usually works fine. Transformer embeddings or sentence encoders are of course more refined ways to calculate a sentence vector, but averaging works OK for most purposes. Same logic applies to documents and corpus)

If we're sure the corpus won't exceed 15.000 documents, another simple but working:) method is to look at the distribution of the similarity of the string and individual documents. If the average semsim is high, median is high and most of the documents have semsim>0.5 , then this corpus and the keyword is a good match.

There's also another way, which is more search engine inspired (I like search engines). Basically you view your string as a query, and your corpus as a collection of documents you'd like to match to this query. Then for each document, we calculate a match score. The difference to above methods is, we use a IR algorithm called BM25. BM25 makes a comparison based on token popularity in the corpus, hence the semantics is not as deep as word vectors. Still, it works good for some corpora and it's a different point of view to this problem. I'll use the Python package rank_bm25 to build a tiny search engine:

import spacy
from rank_bm25 import BM25Okapi
nlp = spacy.load("en_core_web_md")

corpus = [
        "He is a nice person",
        'I like hot dogs and burgers',
        'I ate 2 burgers and a pizza.'
      ]
tokenized_corpus = [[token.lemma_ for token in nlp(doc)] for doc in corpus]
tokenized_corpus
[['he', 'be', 'a', 'nice', 'person'], ['I', 'like', 'hot', 'dog', 'and', 'burger'], ['I', 'eat', '2', 'burger', 'and', 'a', 'pizza', '.']]
bm25 = BM25Okapi(tokenized_corpus)
query = ["burger"]
doc_scores = bm25.get_scores(query)
doc_scores
array([0.        , 0.06104206, 0.05328612])    ## 2nd and 3rd docs match 

query = ["hamburger"]
doc_scores = bm25.get_scores(query)
doc_scores
array([0., 0., 0.])   ## Be careful here, there's no match because burger and hamburger doesn't match as tokens

After scoring your keyword against the documents, you can again make a quick statistics to see how many of the docs you got a amtch.
If you like to more know on this topic, here's an excellent article about BM25: https://www.sciencedirect.com/science/article/pii/S1532046417302186

Overall, it depends how you wanna view this problem and I'm sure there are other approaches. For this problem, I think I'd start by trying the first approach, it's computationally cheap and implementing is not so difficult. Cheers and good luck 👋

0 replies

imhans33 · 2021-12-29T19:41:03Z

imhans33
Dec 29, 2021
Author

One quick question.

Here you specified an example like this

corpus = [
        "He is a nice person",
        'I like hot dogs and burgers',
        'I ate 2 burgers and a pizza.'
      ]

So should we have to generate a corpus vector for each sentence ? And what can be the best way to store a corpus vector when deploying the whole system in production(15000 data).

The second method is more suited for me as i am going after something similar to this. I need to find the urls producing similar kind of data. So when i compare my query with the corpus there will be around 15000 probability values which i need to handle if my understanding is correct. So by using this method will this can be a problem.

Also doc_scores generated in the above example gives array([0. , 0.06104206, 0.05328612]) . Why the values 0.06 and 0.05 ? Does that mean 6% and 5% similar ? Is this working on word matching or by semantic similarity ?

1 reply

DuyguA Jan 1, 2022

Corpus vector represents the whole corpus, we generate only one vector by averaging the vectors of the individual documents. Like this:

>>> import spacy
>>> import numpy as np
>>> nlp = spacy.load("en_core_web_md")

>>> corpus = [
      'The French Revolution was a period of radical change',
      'The underlying causes of the French Revolution are generally seen as arising from the failure of the regime to manage social and economic inequality.', 
      'Under Louis XIV, the Court at Versailles was the centre of culture, fashion and political power.'
]

>>> corpus_vec = np.mean([nlp(document).vector for document in corpus], axis=0)
>>> corpus_vec.shape
(300,)

>> query = nlp("Revolution")
>>> q_vec = query.vector
>>> cosine_similarity([corpus_vec], [q_vec])
array([[0.5949034]], dtype=float32)

If you do this implementation, space is not a problem at all, corpus_vec is a 300-dim vector, you can store it without any problems 😉

Second approach creates a distribution of similarity scores:

>>> scores = [nlp(doc).similarity(query) for doc in corpus] 
>>> scores
[0.6488846435381634, 0.5342787856586442, 0.4989478736957635]

For our tiny example, it's 3 score values, for your corpus it'd be 15.000 score values. If all the documents of the corpus are belong to the same domain and the query is related to this domain, the score distribution takes values in [0.5, 1] most probably, where the value we found in (1) shows an average value.

The method in this approach is a bit different, no word vectors involved and based on word matching. Query-doc match score is calculated by word frequencies. Basically for each word we assign an importance by counting frequency of this word in its sentence and in the whole corpus. Then we calculate a match score between the query and the words of the each document. In the above example "burger" and 2nd doc has a similarity score 0.06 and "burger" and 3rd doc has a similarity score 0.05. This is a brutal method but used in search engines for some time and worked successfully. Since word vectors, a combination of word vector based semsim and this algorithm is used (called neural search)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Finding if a string/data is similar to the corpus #9950

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Finding if a string/data is similar to the corpus #9950

Uh oh!

imhans33 Dec 28, 2021

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

DuyguA Dec 28, 2021

Uh oh!

Uh oh!

imhans33 Dec 29, 2021 Author

Uh oh!

DuyguA Jan 1, 2022

imhans33
Dec 28, 2021

Replies: 2 comments 1 reply

DuyguA
Dec 28, 2021

imhans33
Dec 29, 2021
Author