-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Bypass HNSW graph building for tiny segments #14963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
benwtrent
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor ideas.
| * When enabled, segments with fewer than the threshold number of vectors will store only flat | ||
| * vectors, significantly improving indexing performance for workloads with frequent flushes. | ||
| */ | ||
| private final boolean bypassTinySegments; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow this to be a parameter, it should be a threshold that refers to the typical k used when querying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do wonder if we would want to expose as a parameter though? Maybe it should just be a fixed value? I would have thought about setting it based on a threshold where exhaustive search is no-or-only-slightly more expensive than hnsw search? I would expect this to be related to the M of the graph maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to have it at least as a parameter to a pkg-protected constructor, so that we can pass random values in TestLucene99HnswVectorsFormat to make sure that we exercise properly both the case when there is a graph and the case when there is no graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be good to be able to pass random values in RandomCodec as well to help with the test coverage of the approximate case (we have very few tests that index more than 10k vectors).
| boolean doHnsw = | ||
| knnCollector.k() < scorer.maxOrd() | ||
| && (bypassTinySegments == false | ||
| || fieldEntry.size() > Lucene99HnswVectorsFormat.HNSW_GRAPH_THRESHOLD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reader should just look to see if there is a graph.
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java
Outdated
Show resolved
Hide resolved
|
I think 10K is likely way too large 10k vector ops vs. If a user typical searches for 10 nearest neighbors, the graph should be built at around 90 vectors |
|
I can think of a circumstance where we might create small segments that will probably never get searched at all, but will very quickly be merged. In that case we might want to allow a larger threshold? |
|
As far as the tests are concerned I'm confused, wouldn't they fail with a high threshold since we wouldn't build a graph until there are many documents? Maybe I didn't understand the meaning of the threshold though. |
|
@msokolov Actually they do, I had missed setting I'll try if there is some clean way to override those checks in the failing unit tests. |
|
I wonder how this interacts with how |
I think that is fine. I am thinking of semi-nrt with lots of updates. In cases like that 10k is way too big a default. I think the value should be used as an input to expectedVisitedNodes that takes into account the potential graph size. Additionally, I would assume users would want to scale quantized formats vs. non-quantized differently (as their vector ops can be much cheaper than floating point ops).
I would hope the format just does the right thing, and searches everything, knowing that there isn't a graph. |
|
@jpountz Ahh, I see what you are pointing towards and here is what think we could try maybe :
Let me know your thoughts or if I'm missing something here. Thanks! |
|
My recommendation would be to move the logic of switching to an exact search when the filter is selective to |
|
This makes me wonder if the knn search method should accept a ScorerSupplier and the live docs Bits instead of fully realized bit set that represent both the filter and live docs.... |
|
Or some higher-level abstraction that can either be consumed in a random-access fashion ( class AcceptDocs {
/** Random access to the accepted documents. */
Bits getBits();
/** Get an iterator of accepted docs. */
DocIdSetIterator getIterator();
/** Return an approximation of the number of accepted documents. */
long cost();
} |
vigyasharma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we handle backward compatibility in this change? I noticed we don't write any metadata (e.g. in FieldEntry) about bypassTinySegments or whether a graph was built or not. The flag gets configured when the format is initialized from the codec.
What happens if I create an index with bypassTinySegments=true, but later read it in an application with the flag set to false? I think we need to persist information about whether graph was built for the segment.
| this.bypassTinySegments = bypassTinySegments; | ||
| this.flatFieldVectorsWriter = Objects.requireNonNull(flatFieldVectorsWriter); | ||
| if (bypassTinySegments) { | ||
| this.bufferedVectors = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we only store upto HNSW_GRAPH_THRESHOLD no. of vectors, beyond which we resume the regular flow of adding them to the graph, we could use an array here instead of an ArrayList?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
| replayBufferedVectors(); | ||
| bufferedVectors.clear(); | ||
| } | ||
| if (hnswGraphBuilder != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does hnswGraphBuilder != null do the same thing as graphBuilderInisialized ? if so, do we need graphBuilderInisialized ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, we can drop this check
Maybe we could use one of the existing fields that describe the graph. Like set |
|
Thanks for the review @vigyasharma!
We write the vectorIndexLength here as the meta info which would be 0 in the case there is no HNSW graph so we could use this to determine if we want to do graph search or not (update the Let me know if missing something here maybe. |
| * vectors will use flat storage only, improving indexing performance when having frequent | ||
| * flushes. | ||
| */ | ||
| public static final int HNSW_GRAPH_THRESHOLD = 10_000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the comment should try to expand a bit more on this value to help future readers think through whether it's still right or whether it should be updated.
One thing we discussed on the linked issue is that the number of visited nodes is in the order of log(size) * k. So having a graph only helps if log(size) * k << size <=> size / log(size) >> k. If we arbitrarily choose k = 100, 10,000 is the first power of 10 so that size / log(size) is one order of magnitude greater than k (10/log(10) ~= 4.3, 100/log(100) ~= 22, 1000/log(1000) ~= 144, 10000 / log(10000) ~= 1085).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll elaborate on this.
461a053 to
210d593
Compare
|
ooh, it's nice to avoid force-push -- recommended way to keep up with main is |
Agreed, sometimes I end up doing |
|
OK, I ran the Note : The improvement in the latency or CPUTime seems to be driven by the less no. of segments which also very slightly impact the recall as we know. CC - @benwtrent @msokolov @jpountz @vigyasharma With
|
|
@benwtrent I fixed the failing tests in the latest commit. They were making assumptions that we always do HNSW and since many of those don't create high cardinality graphs it was causing failures. One way would have been to just increase the number of docs for some of them which create many vectors(but that might have increased the time those take to test; but did that with just one test I think). Others I mostly updated to check if there would be graph present or not. There are some refactoring opportunities in the testing I could think of like to test more random behaviors like when this optimization OFF vs ON (default is on) and some more simplifications but for now I didn't focus much on that and kept it mostly to make all tests happy in the most straightforward way. Maybe we can make these tests more random in a followup change? Looking for your feedback. I'll also post the latest upstream numbers shortly with the latest changes. |
benwtrent
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shubhamvishu could you confirm the benchmark results?
When I tried benchmarking, I didn't get anywhere near the numbers you got. Would be good to know the exact settings, etc. tested so we can replicate and see where the benefits are for this.
| public static boolean hasGraphPresent(int k, int size) { | ||
| int expectedVisitedNodes = expectedVisitedNodes(k, size); | ||
| return size > expectedVisitedNodes && expectedVisitedNodes > 0; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very trappy. This sort of assumes that the test case knows the logic that makes a graph exist or not.
Something better might be adjust TestUtil.getDefaultCodec() to set the threshold for building the graph to 0 for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case we will be testing not with this optimization disabled but in general it would be on by default. Maybe we can add a separate test case to test this and leave others to be disabled on the codec as you mentioned? I'll make those changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add a separate test case to test this and leave others to be disabled on the codec as you mentioned? I'll make those changes.
I think maybe for tests that assume there MUST be a graph, they have codec settings that enforce such. For all other tests, it can be the default.
lucene/CHANGES.txt
Outdated
| * GITHUB#15169: Add codecs for 4 and 8 bit Optimized Scalar Quantization vectors (Trevor McCulloch) | ||
| * GITHUB#15169, GITHUB#15223: Add codecs for 4 and 8 bit Optimized Scalar Quantization vectors. The new format | ||
| `Lucene104HnswScalarQuantizedVectorsFormat` replaces the now legacy `Lucene99HnswScalarQuantizedVectorsFormat` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there was a bit of an issue with main branch recently and it needed to be restored. So the commits might be out of alignment. I think your branch needs to have main merged back in or something to fix the changes.
@benwtrent Sure, I'll share the results with the latest changes soon.
Below are the settings I tried for the above results if I'm not mistaken. Also as we discussed earlier, I tried with a little higher threshold(mentioned below) so that is also one thing I would like to confirm the impact on numbers as we have changed it now. Let me know if there is anything else I might be missing. Thanks! Dataset : Cohere Graph Threshold |
This reverts commit d8a8164.
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
|
@benwtrent I reran the benchmarks and somehow I don't see the old performance difference like earlier and rather similar to baseline. Maybe the recent developments had some indexing improvements (like eg I see 4 bit recall is much better now due to OSQ?). Net-net, against the current main we are not creating those many tiny segments in our luceneutil benchmarks by default. Maybe we see still see higher benefits in older Lucene version probably? However, to simulate a more realtime update scenario by changing luceneutil Baseline : Candidate (HNSW_GRAPH_THRESHOLD = 100) [DEFAULT]: Candidate (HNSW_GRAPH_THRESHOLD = 1000) : |
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
benwtrent
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good.
Thank you for doing the more active flushing test, that is exactly what I am looking for!
I think keeping it simple and allowing users to set k to their "expected values" or indicating that they may want it to be higher (to be more lazy).
Could you add a CHANGES.txt entry for 10.4 Optimizations?
...ava/org/apache/lucene/backward_codecs/lucene99/Lucene99HnswScalarQuantizedVectorsFormat.java
Outdated
Show resolved
Hide resolved
...e/src/java/org/apache/lucene/codecs/lucene102/Lucene102HnswBinaryQuantizedVectorsFormat.java
Outdated
Show resolved
Hide resolved
...e/src/java/org/apache/lucene/codecs/lucene104/Lucene104HnswScalarQuantizedVectorsFormat.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java
Outdated
Show resolved
Hide resolved
|
Thanks for the review @benwtrent. I addressed the comments and added the CHANGES entry now. |
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java
Show resolved
Hide resolved
...e/src/java/org/apache/lucene/codecs/lucene104/Lucene104HnswScalarQuantizedVectorsFormat.java
Outdated
Show resolved
Hide resolved
…104HnswScalarQuantizedVectorsFormat.java Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
ChrisHegarty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This change avoids creating a HNSW graph if the segment is small (here we have taken the thresholdfor number of vectors as `10000` based on the conversation [here](#13447 (comment)) for now). Some of the points I'm not sure how we would want to go about : - All the tests passes currently since the option to enable the optimization is `false` by default but setting it to `true` reveals some failing unit tests which inherently assumes that the HNSW graph is created and KNN search is triggered (do we have some idea of how to bypass those in some good clean way?) - I understand we might want to always keep this optimization on (also less invasive change), but for now in this PR, I made it configurable and enabled it on the KNN format - just to be cautious (wasn't sure if it would not affect back-compact in some unknown way), but happy to make it as default behaviour
Description
This change avoids creating a HNSW graph if the segment is small (here we have taken the thresholdfor number of vectors as
10000based on the conversation here for now).Some of the points I'm not sure how we would want to go about :
falseby default but setting it totruereveals some failing unit tests which inherently assumes that the HNSW graph is created and KNN search is triggered (do we have some idea of how to bypass those in some good clean way?)TODOs:
Closes #13447