-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Description
MergedByteVectorValues represents a unified view of byte vector values from multiple underlying Lucene segments. It provides an iterator interface that allows advancing to a specific document ID. In this contract, if a user advances five times using the KnnVectorValues iterator and then loads the next byte[], it is expected to return the corresponding next vector.
However, this expected behavior does not hold for MergedByteVectorValues. After advancing N times and attempting to load the next byte[], an error consistently occurs. This inconsistency does not exist in MergedFloat32VectorValues, where the behavior is correct and reliable.
The root cause is that MergedByteVectorValues does not update the internal lastOrd field when advancing, unlike MergedFloat32VectorValues. As a result, when attempting to load the next vector, the code checks the current ord against lastOrd, which remains zero, causing an exception to be thrown.
lucene/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java
Lines 450 to 461 in 7fc9fd3
| @Override | |
| public int nextDoc() throws IOException { | |
| current = docIdMerger.next(); | |
| if (current == null) { | |
| docId = NO_MORE_DOCS; | |
| index = NO_MORE_DOCS; | |
| } else { | |
| docId = current.mappedDocID; | |
| ++index; | |
| } | |
| return docId; | |
| } |
To fix this, MergedByteVectorValues should be updated to increment lastOrd during advancement, mirroring the behavior of MergedFloat32VectorValues.
lucene/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java
Lines 340 to 352 in 7fc9fd3
| @Override | |
| public int nextDoc() throws IOException { | |
| current = docIdMerger.next(); | |
| if (current == null) { | |
| docId = NO_MORE_DOCS; | |
| index = NO_MORE_DOCS; | |
| } else { | |
| docId = current.mappedDocID; | |
| ++lastOrd; | |
| ++index; | |
| } | |
| return docId; | |
| } |
In OpenSearch, for fast vector index construction, we upload KNN vectors to a remote builder component and trigger the index build. To speed up uploads, vectors are logically partitioned, and multipart uploading is performed — a process that relies on the advance-then-load pattern. Due to lastOrd not being updated correctly in MergedByteVectorValues, this multipart upload mechanism currently fails, blocking fast uploads with byte vectors.
For more details use case in Opensearch, please refer to this : opensearch-project/k-NN#2803
Solution
The solution is straightforward: increment lastOrd during advancement, and ensure the ord check is made against lastOrd, not lastOrd + 1.
@Override
public byte[] vectorValue(int ord) throws IOException {
// if (ord != lastOrd + 1) {
if (ord != lastOrd) { <----------------- It now compares against `lastOrd` not `lastOrd + 1`.
throw new IllegalStateException(
"only supports forward iteration: ord=" + ord + ", lastOrd=" + lastOrd);
} else {
lastOrd = ord;
}
return current.values.vectorValue(current.index());
}
@Override
public int nextDoc() throws IOException {
current = docIdMerger.next();
if (current == null) {
docId = NO_MORE_DOCS;
index = NO_MORE_DOCS;
} else {
docId = current.mappedDocID;
++lastOrd; <------------------------ This line should be added
++index;
}
return docId;
}
Version and environment details
releases/lucene/10.2.2