Skip to content

Score is 0 when a token is in exactly 50% of the documents. #39

@nkarahan-ing

Description

@nkarahan-ing

Hello all,

I have noticed that when a token is present in exactly half of the documents, its contribution to the score is 0, which can be reproduced by the following snippet:

corpus = ["This text contains keyword1 and Keyword2",
           "That is a text that contains keyword1 and term1",
            "Page contains no keywords but contains term1 and term2",
           "This text contains no keywords"]

tokenized_corpus = [doc.split() for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

query = "This is a question about keyword1 & term1"
tokenized_query = query.split()

doc_scores = bm25.get_scores(tokenized_query)
> array([0.        , 1.52856224, 0.        , 0.        ])

In the above example two documents containing the same tokens as the query has 0 score. I believe this is unexpected behavior.
Actually, any token that has zero or non-negative small positive value will have a lower score than a token that has negative idf. (due to negative idfs being assigned epsilon * average_idf). I suggest the distribution of scores per token is more calibrated by adopting IDF calculation given here:
https://en.wikipedia.org/wiki/Okapi_BM25

log( (N - n(qi) + 0.5) / (n(qi) + 0.5) + 1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions