-
Notifications
You must be signed in to change notification settings - Fork 102
Open
Description
Hi! I noticed something unexpected while writing unit tests for BM25Okapi. When I lowercase both the corpus and the query, the scores come back as all zeros, even though the tokens match. The same code with capitalized text works fine. Let me know if I did something wrong!
from rank_bm25 import BM25Okapi
# Capitalized version
corpus = ["Hello world", "foo bar", "hello foo"]
query = "Hello"
tokenized_corpus = [doc.split() for doc in corpus]
tokenized_query = query.split(" ")
bm25 = BM25Okapi(tokenized_corpus)
print("Capitalized:", bm25.get_scores(tokenized_query))Capitalized: [0.51082562 0. 0. ]
The exact same but lowercase only returns a zero score:
# Lowercased version
corpus = ["hello world", "foo bar", "hello foo"]
query = "hello"
tokenized_query = query.lower().split(" ")
tokenized_corpus = [doc.lower().split() for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)
print("Lowercased:", bm25.get_scores(tokenized_query))Lowercased: [0. 0. 0.]
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels