Process-safe, no mem bloat, implementation of LSH

Thanks a lot for your work :) amazing job!

Are there any plans to create an implementation that can be parallelized across multiple threads (processes in Python)?

More context:
I have a large file with millions of lines of text, each of those lines is indexed into the LSH as I'm trying to remove duplicate lines.

once I insert all of those lines into LSH I'd love to be able to parallelize deduplication process operating on the same LSH object.

in each of the workers I query whether there are candidate sentences and if there are and their keys are different from the current line I remove the current line from the LSH.

I want to do that in parallel using your lib - how?

Note: I tried adding a multiprocessing lock to your lib ([quick implementation](https://github.com/gordicaleksa/datasketch_threadsafe)), that works but bloats up my memory as the index is copied across each of the processes so 70 GB quickly turns into 1 TB. Shared memory still requires me to unpickle the object in the process which again leads to mem bloat.

is there a way to use Cassandra or Redis to achieve this? I'd just need to synchronize the process access to the database? any hints here? :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process-safe, no mem bloat, implementation of LSH #231

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Process-safe, no mem bloat, implementation of LSH #231

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions