Skip to content
Discussion options

You must be logged in to vote

OK, sounds like a combination of topic modelling and semantic similarity to me. Basically I'd do this problem in 2 steps:

  • get a vector representation for the corpus (call it corpus vector)
  • calculate semantic similarity between the corpus vector and the keyword's vector

The second part is calculating cosine similarity, so let's focus on the first part.
Simplest way to get a corpus vector is just to average the vectors of the docs that are included in the corpus. I know it sounds rough, but trust me it usually works 🙂 (Think of it as calculating a sentence vector from word vectors, good old averaging usually works fine. Transformer embeddings or sentence encoders are of course more refine…

Replies: 2 comments 1 reply

Comment options

You must be logged in to vote
0 replies
Answer selected by svlandeg
Comment options

You must be logged in to vote
1 reply
@DuyguA
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / vectors Feature: Word vectors and similarity
2 participants