Finding if a string/data is similar to the corpus #9950
-
I have around 15000 text data which corresponds to a single specific topic. My aim is to create a system where if i give a string as input, i need to find if that string is semantically similar to the corpus in terms of probability (I am not interested in 15000 comparison values based on probability. I need a single probability value which says how much similar is the string to the corpus). I find examples where a string is compared with another string using spacy and obtaining probability value. How this can be done for a string v/s a corpus ? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
OK, sounds like a combination of topic modelling and semantic similarity to me. Basically I'd do this problem in 2 steps:
The second part is calculating cosine similarity, so let's focus on the first part. If we're sure the corpus won't exceed 15.000 documents, another simple but working:) method is to look at the distribution of the similarity of the string and individual documents. If the average semsim is high, median is high and most of the documents have semsim>0.5 , then this corpus and the keyword is a good match. There's also another way, which is more search engine inspired (I like search engines). Basically you view your string as a query, and your corpus as a collection of documents you'd like to match to this query. Then for each document, we calculate a match score. The difference to above methods is, we use a IR algorithm called
After scoring your keyword against the documents, you can again make a quick statistics to see how many of the docs you got a amtch. Overall, it depends how you wanna view this problem and I'm sure there are other approaches. For this problem, I think I'd start by trying the first approach, it's computationally cheap and implementing is not so difficult. Cheers and good luck 👋 |
Beta Was this translation helpful? Give feedback.
-
One quick question. Here you specified an example like this
So should we have to generate a corpus vector for each sentence ? And what can be the best way to store a corpus vector when deploying the whole system in production(15000 data). The second method is more suited for me as i am going after something similar to this. I need to find the urls producing similar kind of data. So when i compare my query with the corpus there will be around 15000 probability values which i need to handle if my understanding is correct. So by using this method will this can be a problem. Also doc_scores generated in the above example gives |
Beta Was this translation helpful? Give feedback.
OK, sounds like a combination of topic modelling and semantic similarity to me. Basically I'd do this problem in 2 steps:
The second part is calculating cosine similarity, so let's focus on the first part.
Simplest way to get a corpus vector is just to average the vectors of the docs that are included in the corpus. I know it sounds rough, but trust me it usually works 🙂 (Think of it as calculating a sentence vector from word vectors, good old averaging usually works fine. Transformer embeddings or sentence encoders are of course more refine…