You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it
Feature request
Enhance the default preprocessor for BM25. Currently it is just the python "split" function.
Motivation
I noticed this when I was experimenting with multiple frameworks, and saw that Langchain performed far worse for a simple task. Turns out this was the cause.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Checked
Feature request
Enhance the default preprocessor for BM25. Currently it is just the python "split" function.
Motivation
I noticed this when I was experimenting with multiple frameworks, and saw that Langchain performed far worse for a simple task. Turns out this was the cause.
Currently, the default preprocessing function looks like this:
https://github.com/langchain-ai/langchain/blob/99eb31ec4121e146e007ea584b2d5f4aff2a4337/libs/community/langchain_community/retrievers/bm25.py#L11
For comparison, this is what Haystack does:
https://github.com/deepset-ai/haystack/blob/3d1ad10385e5545abef14b811d72a405c2f5c967/haystack/document_stores/in_memory/document_store.py#L63
Basically it seems to convert to lowercase and remove punctuations.
These features seem basic and lightweight enough that I think it's worth adding them as the default.
Proposal (If applicable)
No response
Beta Was this translation helpful? Give feedback.
All reactions