Replies: 1 comment
-
|
Good question! I've thought of this problem in the past but the main issue is that i need to use scipy to construct the csr internally, which is impossible to memory map (or at least hard enough that i haven't be successful). Because the calculations are over all documents (e.g. term frequency) it's hard to chunk that in a way that stays as fast as it currently is. Out of curiosity: how many docs can you fit so far with 200gb ram? btw you can have |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have around 20 million documents in 500 jsonl files, sized at around 1.2TB
I do have memory restrictions: 200GB Ram
If you could provide me any idea around how i can approach this problem to create a bm25 index, that would be helpful.
Beta Was this translation helpful? Give feedback.
All reactions