Skip to content

BM25 should not consider repeated query tokens #19

@Witiko

Description

@Witiko

@dorianbrown In the seminal paper for this package, the Okapi at TREC-3 paper, and most other places, BM25 is defined over query terms rather than tokens, which would indicate that repeated query tokens should not impact the score. However, that does not seem to be the case in the rank-bm25 library:

for q in query:

This can be easily solved by the user by passing set(query)1 rather than query to the get_scores() method, but it seems as something that the user would expect to happen automatically. At the very least, we may want to document this.


1 Alternatively, list(dict.fromkeys(query)) for reproducible ordering, since floating point summation is not always associative.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions