Score is 0 when a token is in exactly 50% of the documents.

Hello all,

I have noticed that when a token is present in exactly half of the documents, its contribution to the score is 0, which can be reproduced by the following snippet:

```
corpus = ["This text contains keyword1 and Keyword2",
           "That is a text that contains keyword1 and term1",
            "Page contains no keywords but contains term1 and term2",
           "This text contains no keywords"]

tokenized_corpus = [doc.split() for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

query = "This is a question about keyword1 & term1"
tokenized_query = query.split()

doc_scores = bm25.get_scores(tokenized_query)
> array([0.        , 1.52856224, 0.        , 0.        ])

```

In the above example two documents containing the same tokens as the query has 0 score. I believe this is unexpected behavior.
Actually, any token that has zero or non-negative small positive value will have a lower score than a token that has negative idf. (due to negative idfs being assigned epsilon * average_idf). I suggest the distribution of scores per token is more calibrated by adopting IDF calculation given here:
https://en.wikipedia.org/wiki/Okapi_BM25

log( (N - n(qi) + 0.5) / (n(qi) + 0.5) + 1)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Score is 0 when a token is in exactly 50% of the documents. #39

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Score is 0 when a token is in exactly 50% of the documents. #39

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions