-
Notifications
You must be signed in to change notification settings - Fork 102
Open
Description
Hello all,
I have noticed that when a token is present in exactly half of the documents, its contribution to the score is 0, which can be reproduced by the following snippet:
corpus = ["This text contains keyword1 and Keyword2",
"That is a text that contains keyword1 and term1",
"Page contains no keywords but contains term1 and term2",
"This text contains no keywords"]
tokenized_corpus = [doc.split() for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)
query = "This is a question about keyword1 & term1"
tokenized_query = query.split()
doc_scores = bm25.get_scores(tokenized_query)
> array([0. , 1.52856224, 0. , 0. ])
In the above example two documents containing the same tokens as the query has 0 score. I believe this is unexpected behavior.
Actually, any token that has zero or non-negative small positive value will have a lower score than a token that has negative idf. (due to negative idfs being assigned epsilon * average_idf). I suggest the distribution of scores per token is more calibrated by adopting IDF calculation given here:
https://en.wikipedia.org/wiki/Okapi_BM25
log( (N - n(qi) + 0.5) / (n(qi) + 0.5) + 1)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels