Ideas on how to create bm25 index for large corpus #147

igaurab · 2025-11-24T16:56:35Z

igaurab
Nov 24, 2025

I have around 20 million documents in 500 jsonl files, sized at around 1.2TB

I do have memory restrictions: 200GB Ram

If you could provide me any idea around how i can approach this problem to create a bm25 index, that would be helpful.

xhluca · 2025-11-25T15:53:18Z

xhluca
Nov 25, 2025
Maintainer

Good question! I've thought of this problem in the past but the main issue is that i need to use scipy to construct the csr internally, which is impossible to memory map (or at least hard enough that i haven't be successful). Because the calculations are over all documents (e.g. term frequency) it's hard to chunk that in a way that stays as fast as it currently is.

Out of curiosity: how many docs can you fit so far with 200gb ram? btw you can have corpus as iterable rather than list to save memory:

corpus = function_that_returns_one_doc_at_the_time()
tokenizer = Tokenizer(
      stemmer=stemmer, stopwords=stopwords, splitter=splitter
)

corpus_tokens = tokenizer.tokenize(corpus)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideas on how to create bm25 index for large corpus #147

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Ideas on how to create bm25 index for large corpus #147

Uh oh!

igaurab Nov 24, 2025

Replies: 1 comment

Uh oh!

Uh oh!

xhluca Nov 25, 2025 Maintainer

igaurab
Nov 24, 2025

xhluca
Nov 25, 2025
Maintainer