-
Notifications
You must be signed in to change notification settings - Fork 315
Open
Labels
Description
Hi, I have a question about large-scale LSH index. If I have billions of documents, I suppose even 1T RAM is not enough to do in-memory LSH, is there any recommended way to use datasketch for this scenario? Thank you.
I also opened an issue #206 because for a small subset on my local machine (a 6GB pickle file containing pre-computed minhashes), if I use LSH threhold of 0.5 for inserting to minhashLSH, it takes 31GB RAM. If I use leanminhash, it takes 26GB. Then I can do a simple extrapolation, for 600GB pre-computed minhashes, the indexing process will take 3T RAM. This is just too much. Maybe mapping it to disk could be a viable solution. Looking forward to any suggestions, thank you.
Reactions are currently unavailable