Doc Similarity: Tf-idf weighting? #2125

alxpopa · 2018-03-21T17:46:35Z

alxpopa
Mar 21, 2018

I am currently using a German model (loaded with fasttext embeddings) to compute similarities between documents.
I am finding that as the documents get longer, the similarity values stop making sense (basically every long document becomes very similar to ever other long document). I suppose this is due to the averaging method that is used in the similarity method. It is not due to preprocessing, as I'm removing stopwords, lemmatizing etc etc the common procedure.

Intuitively, as a document's vocabulary increases, the document vector loses it's "originality" due to averaging, so I guess this behavior is expected. In order to fix this, I was thinking of weighting each word embedding with the respective tf-idf score, sum up the word vectors in the Doc and normalize the result. Hopefully, this would provide a better Doc representation and a more relevant similarity score.

Are there any efforts towards integrating such a feature? If not, is it for lack of time or because the suggested solution does not make sense?

In order to calculate the tf-idf scores, the concept of a Corpus would need to be implemented, but a quick solution could be allowing to feed in a precomputed score table for each word in the corpus.

(not really an issue, but rather a suggestion)

Your Environment

Operating System: Linux-4.10.12-041012-generic-x86_64-with-Ubuntu-16.04-xenial
Python Version Used: 3.5.2
spaCy Version Used: 2.0.9
Environment Information: loaded fasttext embeddings as in the documentation

Dobatymo · 2018-03-22T06:38:39Z

Dobatymo
Mar 22, 2018

SIF (smooth inverse frequency) has shown to be effective for document embeddings also.

0 replies

roundrobin · 2018-04-15T16:48:13Z

roundrobin
Apr 15, 2018

@mralexpopa what do you mean by long documents? could you quantify this? 500 - 700 words?

0 replies

yipcma · 2018-05-28T14:40:32Z

yipcma
May 28, 2018

Would love to see this implemented in spacy. Are there some pointers on how we could do this (using TFIDF with embeddings) ourselves?

0 replies

disimone · 2018-06-06T11:40:30Z

disimone
Jun 6, 2018

I am facing actually the same problem. I need to map to vector space some fairly large documents. some specific keywords are very common, and would end up biasing the average, bringing all documents farily close in the embedding space. Instead, I would like to give more weight to words that are rarely found in a given corpus, so my choice would be to use the tf-idf score of each token to weight the average.

I can implement the weighted average myself, of course, but I was wondering if there is some fundamental reason why this is not done already: are there known drawbacks to this approach?

Thanks for your help,

Andrea.

0 replies

honnibal · 2018-07-06T12:15:44Z

honnibal
Jul 6, 2018
Maintainer

Are there any consistent evaluations or benchmarks we could use for judging these heuristics? If we can get a benchmark in place, I'd be happy to include a heuristic for something like this in the default strategy, i.e. if the document is long we use a weighting scheme. But doing this blindly seems pretty bad.

In the meantime, you can always customise the doc.similarity() algorithm. You just need to write to the doc.user_hooks dictionary, like this: doc.user_hooks['similarity'] = my_cool_similarity_function

0 replies

alxpopa · 2018-07-06T16:23:47Z

alxpopa
Jul 6, 2018
Author

@honnibal You are correct, the custom similarity function + user_hook would solve the problem, but to me it seems like a feature that would be beneficial to the whole community.

I believe the default behavior should be normal averaging, but it would be cool to have such a functionality, in case it is needed (up to the user if they want it or not). With no benchmarking available, it shouldn't be default behavior.

Nevertheless, I was suggesting adding a new parameter to doc.similarity(), like

def similarity(self, other, custom_weights=None):

For example, it could be a dict with values containing each word's score, or a matrix with the same type of information. This should be calculated separately by the user and only fed in as an optional parameter, if needed.

Time permitting, I will create a pull request

0 replies

BradKML · 2021-11-24T12:14:29Z

BradKML
Nov 24, 2021

Alternate Intra-Document Term Weighting Apparently exists https://github.com/boudinfl/pke#implemented-models jboynyc/textnets#33

0 replies

Uh oh!

Doc Similarity: Tf-idf weighting? #2125

Uh oh!

alxpopa Mar 21, 2018

Your Environment

Replies: 7 comments

Uh oh!

Dobatymo Mar 22, 2018

Uh oh!

roundrobin Apr 15, 2018

Uh oh!

Uh oh!

yipcma May 28, 2018

Uh oh!

Uh oh!

disimone Jun 6, 2018

Uh oh!

honnibal Jul 6, 2018 Maintainer

Uh oh!

alxpopa Jul 6, 2018 Author

Uh oh!

Uh oh!

BradKML Nov 24, 2021

alxpopa
Mar 21, 2018

Dobatymo
Mar 22, 2018

roundrobin
Apr 15, 2018

yipcma
May 28, 2018

disimone
Jun 6, 2018

honnibal
Jul 6, 2018
Maintainer

alxpopa
Jul 6, 2018
Author

BradKML
Nov 24, 2021