-
Notifications
You must be signed in to change notification settings - Fork 17
Description
We made a huge rework of the arroy architecture to allow us to index large dataset with little RAM, but it's still not enough.
Context
A bit of context on why we did all that and the way we did it, then we'll see the issue I'm still encountering.
Before
What we were doing was basically:
- Get the pointer to all the vectors in a memory-mapped region that won't change during the whole indexing process
- Every time we update or create a node, we instead write it to a temporary file that will be written to the database at the end of the indexing process
- Insert the new elements in the current tree; each thread can work on its own tree in parallel
- If a descendant becomes too big in the process, we split it and create a new sub-tree from here
- We have to scan the whole database to remove the deleted elements
- If we need more trees, we can build them in parallel at the tree level
- We retrieve all the temporary files and write them to the database
When the whole database doesn't fit in RAM that process causes a few issues:
- During
3, to know if an item must go to the left or right of a split node, we have to compare it against the normal of that node, which means we have to load all the items in RAM and compare them against the split node. If the items don't fit in RAM, that's an issue because for each split node, we're going to read on disk, and that's worth it because it's done in parallel with all cores. - During
3.1, the issue is exacerbated because even if all the "new items to insert" fit in RAM, we're now pulling new random vectors from the database and using them to create split nodes, which generates a ton of random reads as well - Finally, during
4, even if it happens less often than the rest, we're reading the whole database at once
Now
Our new strategy to avoid all these issues works like this:
- Get the pointer to all the vectors in a memory-mapped region that won't change during the whole indexing process
- Same as before, when we have to write something, we write it to a temporary file that will be written to the database at the end of the indexing
- First we want to insert the new elements in the current trees:
- Compute ahead of time how many vectors fit in RAM and retrieve their ptr
- Insert them in the tree. We have to compare them against the normal as before, but now, since everything fits in RAM, they should only be read once. Multiple threads can do this work in parallel for all the trees.
- When a leaf becomes too large and must be split into a subtree, we store all the IDs it contains in RAM and don't split it now.
- Come back to
1until there are no vectors left
- We now have a list of leaves that are too large and must be exploded into a subtree
- For each missing tree, we add a new leaf to this list that contains the IDs of the whole database
- At this point, we won't ever read a tree node again from the database
- We explode the leaves to create all the subtrees required
- Each thread pops one large leaf and does the step
3on it. - Then it iterates over the list of leaves generated, the ones that are too large are pushed again in the list of large leaves and will be processed by this thread or another
- Each thread pops one large leaf and does the step
- We retrieve all the temporary files and write them to the database
Most of the work was finished in #130, where I parallelized the code, and a few bug fixes were done afterwards.
The issue
When the size of the database exceeds the RAM, we're once again doing a ton of random reads and nothing else.
Here's what it looks like:
- The RAM increases by a few megabytes, and then we start swapping 2 GiB of RAM and spending 99% of our time on IO wait
- During this time, we're barely writing anything to the disk, just reading and refreshing our pages over and over again
It's kinda expected that we have to do a lot of random reads on the disk. But here it's at a level where absolutely nothing makes any progress
