Too large minhashLSH index

Hi, I have a question about large-scale LSH index. If I have billions of documents, I suppose even 1T RAM is not enough to do in-memory LSH, is there any recommended way to use datasketch for this scenario? Thank you.

I also opened an issue #206 because for a small subset on my local machine (a 6GB pickle file containing pre-computed minhashes), if I use LSH threhold of 0.5 for inserting to minhashLSH, it takes 31GB RAM. If I use leanminhash, it takes 26GB. Then I can do a simple extrapolation, for 600GB pre-computed minhashes, the indexing process will take 3T RAM. This is just too much. Maybe mapping it to disk could be a viable solution. Looking forward to any suggestions, thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too large minhashLSH index #207

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Too large minhashLSH index #207

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions