Doc Similarity: Tf-idf weighting? #2125
Replies: 7 comments
-
SIF (smooth inverse frequency) has shown to be effective for document embeddings also. |
Beta Was this translation helpful? Give feedback.
-
@mralexpopa what do you mean by long documents? could you quantify this? 500 - 700 words? |
Beta Was this translation helpful? Give feedback.
-
Would love to see this implemented in spacy. Are there some pointers on how we could do this (using TFIDF with embeddings) ourselves? |
Beta Was this translation helpful? Give feedback.
-
I am facing actually the same problem. I need to map to vector space some fairly large documents. some specific keywords are very common, and would end up biasing the average, bringing all documents farily close in the embedding space. Instead, I would like to give more weight to words that are rarely found in a given corpus, so my choice would be to use the tf-idf score of each token to weight the average. I can implement the weighted average myself, of course, but I was wondering if there is some fundamental reason why this is not done already: are there known drawbacks to this approach? Thanks for your help, Andrea. |
Beta Was this translation helpful? Give feedback.
-
Are there any consistent evaluations or benchmarks we could use for judging these heuristics? If we can get a benchmark in place, I'd be happy to include a heuristic for something like this in the default strategy, i.e. if the document is long we use a weighting scheme. But doing this blindly seems pretty bad. In the meantime, you can always customise the |
Beta Was this translation helpful? Give feedback.
-
@honnibal You are correct, the custom similarity function + user_hook would solve the problem, but to me it seems like a feature that would be beneficial to the whole community. I believe the default behavior should be normal averaging, but it would be cool to have such a functionality, in case it is needed (up to the user if they want it or not). With no benchmarking available, it shouldn't be default behavior. Nevertheless, I was suggesting adding a new parameter to doc.similarity(), like
For example, it could be a dict with values containing each word's score, or a matrix with the same type of information. This should be calculated separately by the user and only fed in as an optional parameter, if needed. Time permitting, I will create a pull request |
Beta Was this translation helpful? Give feedback.
-
Alternate Intra-Document Term Weighting Apparently exists https://github.com/boudinfl/pke#implemented-models jboynyc/textnets#33 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am currently using a German model (loaded with fasttext embeddings) to compute similarities between documents.
I am finding that as the documents get longer, the similarity values stop making sense (basically every long document becomes very similar to ever other long document). I suppose this is due to the averaging method that is used in the similarity method. It is not due to preprocessing, as I'm removing stopwords, lemmatizing etc etc the common procedure.
Intuitively, as a document's vocabulary increases, the document vector loses it's "originality" due to averaging, so I guess this behavior is expected. In order to fix this, I was thinking of weighting each word embedding with the respective tf-idf score, sum up the word vectors in the Doc and normalize the result. Hopefully, this would provide a better Doc representation and a more relevant similarity score.
Are there any efforts towards integrating such a feature? If not, is it for lack of time or because the suggested solution does not make sense?
In order to calculate the tf-idf scores, the concept of a Corpus would need to be implemented, but a quick solution could be allowing to feed in a precomputed score table for each word in the corpus.
(not really an issue, but rather a suggestion)
Your Environment
Beta Was this translation helpful? Give feedback.
All reactions