Change default preprocessor for BM25 #24875

ssamt · 2024-07-31T09:39:02Z

ssamt
Jul 31, 2024

Checked

I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it

Feature request

Enhance the default preprocessor for BM25. Currently it is just the python "split" function.

Motivation

I noticed this when I was experimenting with multiple frameworks, and saw that Langchain performed far worse for a simple task. Turns out this was the cause.

Currently, the default preprocessing function looks like this:
https://github.com/langchain-ai/langchain/blob/99eb31ec4121e146e007ea584b2d5f4aff2a4337/libs/community/langchain_community/retrievers/bm25.py#L11

def default_preprocessing_func(text: str) -> List[str]:
    return text.split()

For comparison, this is what Haystack does:

https://github.com/deepset-ai/haystack/blob/3d1ad10385e5545abef14b811d72a405c2f5c967/haystack/document_stores/in_memory/document_store.py#L63

bm25_tokenization_regex: str = r"(?u)\b\w\w+\b",
...
self.tokenizer = re.compile(bm25_tokenization_regex).findall
...
text = text.lower()
return self.tokenizer(text)

Basically it seems to convert to lowercase and remove punctuations.

These features seem basic and lightweight enough that I think it's worth adding them as the default.

Proposal (If applicable)

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change default preprocessor for BM25 #24875

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Change default preprocessor for BM25 #24875

Uh oh!

ssamt Jul 31, 2024

Checked

Feature request

Motivation

Proposal (If applicable)

Replies: 0 comments

ssamt
Jul 31, 2024