-
Notifications
You must be signed in to change notification settings - Fork 315
Description
Thanks a lot for your work :) amazing job!
Are there any plans to create an implementation that can be parallelized across multiple threads (processes in Python)?
More context:
I have a large file with millions of lines of text, each of those lines is indexed into the LSH as I'm trying to remove duplicate lines.
once I insert all of those lines into LSH I'd love to be able to parallelize deduplication process operating on the same LSH object.
in each of the workers I query whether there are candidate sentences and if there are and their keys are different from the current line I remove the current line from the LSH.
I want to do that in parallel using your lib - how?
Note: I tried adding a multiprocessing lock to your lib (quick implementation), that works but bloats up my memory as the index is copied across each of the processes so 70 GB quickly turns into 1 TB. Shared memory still requires me to unpickle the object in the process which again leads to mem bloat.
is there a way to use Cassandra or Redis to achieve this? I'd just need to synchronize the process access to the database? any hints here? :)