Skip to content

Conversation

@irevoire
Copy link
Contributor

@irevoire irevoire commented May 28, 2025

What changed

Since this is almost a complete rewrite of the indexing algorithm, once again, I'll try to detail everything I did here to help the reviewer, but more information is available below on the "why".

The main idea of this rework is to remove all the writes to the database while we're indexing. This means we don't have to refresh the leaves and tree nodes during the whole indexing process
Since we no longer lose access to the leaves and tree nodes, it also means we can now parallelize arroy.

An almost exhaustive list of the new component I had to develop to make the algorithm work:

  • Now we can read and write in a temporary file
    • The writes always happen at the end of the file and thus are efficient with a bufwriter
    • The reads are always looking at the last tree built, so overall, by looking at the file from the end, we're guaranteed that the worst case scenario is reaching an element that's at the end of the tree we just built
    • When reading we're seeking + reading, there are no tricks involved to make that faster, but in my benchmarks, I never saw anything related to that, so I believe it's efficient enough
  • Instead of creating one temporary file by tree like we were doing before, we're now creating one file per thread (thanks to the thread_local crate
  • Instead of parallelizing at the tree level like before, all the threads are now enqueuing a task for each of the large descendants. Rayon is used a lot.

How does an indexing process work now:

  1. Retrieve the items to remove and remove them immediately with the rwtxn
  2. Insert the new items into the current trees without creating new nodes. The descendants to update are stored in RAM. The parallelization is made at the tree level:
  3. We start by taking as many elements as possible that fits in RAM
  4. Every thread inserts these elements in their tree
  5. We wait for all threads to finish and then select a new batches of items to insert
  6. Should we try to change that to instead give memory/nb_threads RAM to each threads and then let them do their loops on their side without any synchronization? It means we would load way less elements than what we could in reality and multiply the number of time we have to traverse our tree but maybe it's not an issue?
  7. From this insertion, we've extracted a list of descendants to update
  8. We loop over these descendants, and when a descendant is too large to be stored as-is, we spawn a new tokio task that'll be in charge of creating a tree from this descendant
  9. All the threads start processing the descendants too large asap, the steps to process a descendant are the following:
  10. Create a new tree from this descendant in a tmp file using as much memory as possible (total memory divided by the number of threads)
  11. Insert all the remaining items that are too large in batches in the tree while reading the tree from the tmp file. The descendants are stored in RAM
  12. Iterate over the list of descendants. When a descendant is large, enqueue a new task in rayon; otherwise, store it in the file

Some notes:

  • Everything we see below was made before merging Add bias info to SplitPlaneNormal #132 to keep the benchmarks fair
  • I want to merge this PR as-is, but there will be a few follow-ups before the release that should improve the performance even more: https://github.com/meilisearch/arroy/milestone/5
  • To handle errors in rayon, when an error occurs, the thread encountering the error writes it to a channel. All the other threads must stop all work as soon as the channel contains something. I had to use a crossbeam channel for that but it doesn't add dependencies as rayon was already using crossbeam

Context

After reducing the number of writes we do to LMDB, here's a view of where the time is lost during an indexing process.

image

  • 47% of the time is spent in building the tree. In this function, almost all the time is spent on computing the Cosine distance. We can't really optimize that more, but it's the part we could multi-thread in the future.
  • 27% of the time is spent refreshing our leaves after a write to LMDB. That's also synchronization points where we cannot properly multi-thread
  • An additional 4.5%+2.1% are lost creating and removing TmpFile during the loop

In this PR, I'll get rid of all the writes we do in LMDB and instead read my own TmpFile.
That means we don't have to refresh the leaves anymore, saving 27% of the time. Probably also the 6% creating and removing temp files, as we will have only one file per thread instead.

That should also open the way to better parallelization later on as we won't have any common state between the threads. Only the queue of large descendants to explode will be shared behind a mutex where each thread can pop and push I guess 🤔

After first implementation

I ran some benchmarks on my MacBook Pro with unlimited RAM and here are the results:

version 100k vectors 10k * 10 vectors
v0.6.1 image image
This PR image image
  • I didn't save the previous benchmarks I've done, sadly but this PR is about 5 times faster than right before the PR, which means my 10 cores are being used but not at their full potential
  • I would say we're around twice slower than the v0.6.1 (which is the fastest version we ever made yet)
  • The relevancy increased a bit, but I don't know why yet

It needs more profiling

First investigation

image

The more chunks we index and the more time is spent in rayon... waiting

Side quest

image

Before each chunk starts processing, we see a "tail" where we're barely doing anything.
That's the time it takes to check if any of the updated items have been removed. The bigger the database is and the most time it takes. On the last batch it takes 1.6s to insert 10k items in a 90k items database where there is actually nothing to remove.

Fixed in 9c70222
image

  => Vectors:         77108,   154216,   231324,   308432,   385540,   462648,   539756,   616864,   693972,   771080,   771089
  => Before:          93.10s,   62.29s,   62.89s,   67.26s,   65.28s,   74.95s,   78.85s,   84.32s,   85.17s,   89.61s,   34.91s
  => After:           87.72s,   57.98s,   57.70s,   65.85s,   58.20s,   66.44s,   70.30s,   68.51s,   66.90s,   75.74s,   14.60s

The bigger the database, the bigger the gain basically

Screenshot 2025-06-12 at 15 30 30

In the end why are we still worse than v0.6.1

After a lot of profiling where nothing really stands out I finally noticed that the main difference between 0.6.1 and this PR.
It's just the number of trees generated.
In #105 (which was merged after the 0.6.1 but was never released officially), I tried to guess the number of trees we would generate ahead of time.
Looks like I was bad at it, and now I end up generating between 2-3 times more trees than the 0.6, which also explains why the relevancy was better, I guess, even if I don't understand why it's also better than main.
By fixing the number of trees to, let's say, 300. Here are the results of this PR against 0.6 with 10 cores:

image

Both are really close to each other, with no clear winner.
It's hard to see on the chart, but it seems that 0.6.1 is still twice as fast as this PR when it comes to inserting very few elements in the database with a lot of threads.

With 2 threads available it's similar and this PR ends up being quicker to process very few elements:
image


In conclusion, I'll rebase on main, implement the error handling, and we'll be able to merge as-is.
Then we'll absolutely need to work on #134

@irevoire irevoire changed the title Remove a first batch of useless writes to tmp files and LMDB Remove writes to LMDB until the very end of the indexing process May 28, 2025
@nnethercott
Copy link
Contributor

@irevoire just curious, what tool are you using to profile this ?

@irevoire
Copy link
Contributor Author

It's instruments but it's only available on macOS.
Here's the view it gives:

image

On Linux, I was using valgrind+cachegrind and was visualizing the output with kcachegrind.

@irevoire
Copy link
Contributor Author

Something else I noticed is that 12% of the time is spent on this function:
image
image

We can probably optimize that a bit by sharing the allocation between the successive calls or at least specifying the capacity if that helps (but rust should be able to guess it in this case 🤔)

@irevoire
Copy link
Contributor Author

Once merged I'll need to re-implement the progress properly.

An idea I just had is to precompute the total number of items we must insert (nb_trees * to_insert) and then decrement this number in parallel every time we write a descendant.
It's easy, it won't really help us understand precisely what's going on, but it'll at least give a rough idea to the user that something is going on

@irevoire irevoire changed the title Remove writes to LMDB until the very end of the indexing process Parallelize arroy Jun 12, 2025
@irevoire irevoire changed the title Parallelize arroy Parallelize arroy again* Jun 12, 2025
@irevoire irevoire added this to the v1.7.0 milestone Jun 12, 2025
@irevoire irevoire marked this pull request as ready for review June 16, 2025 14:41
@irevoire irevoire added indexing Everything related to indexing performance relevancy labels Jun 16, 2025
@irevoire irevoire requested a review from ManyTheFish June 17, 2025 15:38
Co-authored-by: Many the fish <[email protected]>
@irevoire irevoire requested a review from ManyTheFish June 18, 2025 09:17
@irevoire irevoire added this pull request to the merge queue Jun 18, 2025
Merged via the queue into main with commit d8caf55 Jun 18, 2025
15 checks passed
@irevoire irevoire deleted the remove-sync-point branch June 18, 2025 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

indexing Everything related to indexing performance relevancy

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants