-
Notifications
You must be signed in to change notification settings - Fork 20.5k
Description
Description
In the PR summited by me in #5615 I mistakenly did not include the smoothing constant (+1) in Inverse Document Frequency (IDF) calculation (line no. 218 in BM2InvertedIndex.java). When the smoothing constant (+1) is omitted from the IDF formula in the BM25 algorithm, it can lead to negative or zero IDF scores for common terms. This occurs because, without the +1, the value inside the logarithm can fall below 1 for terms that appear frequently in the document corpus. Negative or zero IDF scores can distort document relevance ranking, causing common terms to either contribute negatively or have no impact on the final score, resulting in inaccurate search results. Adding the +1 ensures all terms, even frequent ones, contribute positively, maintaining balanced relevance scoring.
Steps to reproduce
- Go To: BM2InvertedIndexTest.java
- Run function
testSearchRanking()
- Test cases passes with movie: It's a Wonderful Life (docId: 6) ranked first according to the relevance score. But in accordance with the search algorithm the movie Shawshak Redemption (docId: 1) should come first.
- Apparently the test cases were written wrong too.
- The same apply to other movies in the search list.
Excepted behavior
The movie Shawshak Redemption (docId: 1) should come first instead of the movie It's a Wonderful Life (docId: 6)
Screenshots
Additional context
Required PR : #5696