Skip to content
Discussion options

You must be logged in to vote

The corpus you are passing to BM25 is optional, it is only used during retrieval and when you are saving/loading the model. You can just initialize without the corpus:

retriever = bm25s.BM25()

That said, if you still want to pass the jsonl file to the retriever's initialization call, but don't have enough memory to load it in memory, you can take a look at the bm25s.utils.corpus.JsonlCorpus class which lets you read a jsonl dynamically through memory mapping:

bm25s/bm25s/utils/corpus.py

Lines 101 to 184 in 73c7dea

class JsonlCorpus:
"""
A class to read a jsonl file line by line using mmap, allowing extremely fast
access to any line in the file. For example, y…

Replies: 2 comments 3 replies

Comment options

You must be logged in to vote
1 reply
@ddofer
Comment options

Answer selected by xhluca
Comment options

You must be logged in to vote
2 replies
@xhluca
Comment options

@xhluca
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants