BM25 should not consider repeated query tokens

@dorianbrown In [the seminal paper for this package][2], [the Okapi at TREC-3 paper][1], and most other places, BM25 is defined over query *terms* rather than tokens, which would indicate that repeated query tokens should not impact the score. However, that does not seem to be the case in the rank-bm25 library:

https://github.com/dorianbrown/rank_bm25/blob/329b794e726fd513eb96d9e28dcf4db8de399ea7/rank_bm25.py#L117

This can be easily solved by the user by passing `set(query)`<sup>1</sup> rather than `query` to the `get_scores()` method, but it seems as something that the user would expect to happen automatically. At the very least, we may want to document this.

***

<sup>1</sup> Alternatively, [`list(dict.fromkeys(query))`][3] for reproducible ordering, since floating point summation is not always associative.

 [2]: http://www.cs.otago.ac.nz/homepages/andrew/papers/2014-2.pdf
 [1]: https://www.microsoft.com/en-us/research/publication/okapi-at-trec-3/
 [3]: https://stackoverflow.com/a/17016257/657401

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BM25 should not consider repeated query tokens #19

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

BM25 should not consider repeated query tokens #19

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions