Skip to content

Process-safe, no mem bloat, implementation of LSH #231

@gordicaleksa

Description

@gordicaleksa

Thanks a lot for your work :) amazing job!

Are there any plans to create an implementation that can be parallelized across multiple threads (processes in Python)?

More context:
I have a large file with millions of lines of text, each of those lines is indexed into the LSH as I'm trying to remove duplicate lines.

once I insert all of those lines into LSH I'd love to be able to parallelize deduplication process operating on the same LSH object.

in each of the workers I query whether there are candidate sentences and if there are and their keys are different from the current line I remove the current line from the LSH.

I want to do that in parallel using your lib - how?

Note: I tried adding a multiprocessing lock to your lib (quick implementation), that works but bloats up my memory as the index is copied across each of the processes so 70 GB quickly turns into 1 TB. Shared memory still requires me to unpickle the object in the process which again leads to mem bloat.

is there a way to use Cassandra or Redis to achieve this? I'd just need to synchronize the process access to the database? any hints here? :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions