-
Notifications
You must be signed in to change notification settings - Fork 17
Parallelize arroy again* #130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@irevoire just curious, what tool are you using to profile this ? |
|
It's On Linux, I was using valgrind+cachegrind and was visualizing the output with kcachegrind. |
|
Once merged I'll need to re-implement the progress properly. An idea I just had is to precompute the total number of items we must insert (nb_trees * to_insert) and then decrement this number in parallel every time we write a descendant. |
…changed the rng and some snapshots
…oment. We don't want one thread to stay stuck forever
Co-authored-by: Many the fish <[email protected]>



What changed
Since this is almost a complete rewrite of the indexing algorithm, once again, I'll try to detail everything I did here to help the reviewer, but more information is available below on the "why".
The main idea of this rework is to remove all the writes to the database while we're indexing. This means we don't have to refresh the leaves and tree nodes during the whole indexing process
Since we no longer lose access to the leaves and tree nodes, it also means we can now parallelize arroy.
An almost exhaustive list of the new component I had to develop to make the algorithm work:
How does an indexing process work now:
memory/nb_threadsRAM to each threads and then let them do their loops on their side without any synchronization? It means we would load way less elements than what we could in reality and multiply the number of time we have to traverse our tree but maybe it's not an issue?Some notes:
Context
After reducing the number of writes we do to LMDB, here's a view of where the time is lost during an indexing process.
In this PR, I'll get rid of all the writes we do in LMDB and instead read my own
TmpFile.That means we don't have to refresh the leaves anymore, saving 27% of the time. Probably also the 6% creating and removing temp files, as we will have only one file per thread instead.
That should also open the way to better parallelization later on as we won't have any common state between the threads. Only the queue of large descendants to explode will be shared behind a mutex where each thread can pop and push I guess 🤔
After first implementation
I ran some benchmarks on my MacBook Pro with unlimited RAM and here are the results:
It needs more profiling
First investigation
The more chunks we index and the more time is spent in rayon... waiting
Side quest
Before each chunk starts processing, we see a "tail" where we're barely doing anything.
That's the time it takes to check if any of the updated items have been removed. The bigger the database is and the most time it takes. On the last batch it takes 1.6s to insert 10k items in a 90k items database where there is actually nothing to remove.
Fixed in 9c70222

The bigger the database, the bigger the gain basically
In the end why are we still worse than v0.6.1
After a lot of profiling where nothing really stands out I finally noticed that the main difference between 0.6.1 and this PR.
It's just the number of trees generated.
In #105 (which was merged after the 0.6.1 but was never released officially), I tried to guess the number of trees we would generate ahead of time.
Looks like I was bad at it, and now I end up generating between 2-3 times more trees than the 0.6, which also explains why the relevancy was better, I guess, even if I don't understand why it's also better than main.
By fixing the number of trees to, let's say, 300. Here are the results of this PR against 0.6 with 10 cores:
Both are really close to each other, with no clear winner.
It's hard to see on the chart, but it seems that 0.6.1 is still twice as fast as this PR when it comes to inserting very few elements in the database with a lot of threads.
With 2 threads available it's similar and this PR ends up being quicker to process very few elements:

In conclusion, I'll rebase on main, implement the error handling, and we'll be able to merge as-is.
Then we'll absolutely need to work on #134