Skip to content

Commit 9e7381c

Browse files
authored
Help collectors take advantage of bulk-retrieval of doc values. (#15173)
We added `NumericDocValues#longValues` to retrieve multiple values in a single call and better amortize the cost of virtual function calls in #15149. But it's not easy today to take advantage of this API in collectors, the best option is to buffer doc IDs in `LeafCollector#collect`, but then it's not great since this `LeafCollector#collect` call itself may be virtual, so we may still be doing 1+ virtual call per collected doc. This PR adds `DocIdStream#intoArray` to help collectors collect batches of doc IDs at once and retrieve numeric doc values for them. This way, you can compute facets/aggregations with less than one virtual call per collected document. I benchmarked on the geonames dataset, computing the average of the elevation field with the following collector: ```java class AverageCollector implements Collector { long sum; int count; @OverRide public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException { SortedNumericDocValues sortedValues = context.reader().getSortedNumericDocValues("elevation"); if (sortedValues == null) { throw new CollectionTerminatedException(); } NumericDocValues values = DocValues.unwrapSingleton(sortedValues); if (values == null) { // the field is single-valued throw new Error(); } return new LeafCollector() { int[] docBuffer = new int[64]; long[] valueBuffer = new long[docBuffer.length]; @OverRide public void setScorer(Scorable scorer) throws IOException { } @OverRide public void collect(DocIdStream stream) throws IOException { for (int count = stream.intoArray(docBuffer); count != 0; count = stream.intoArray(docBuffer)) { values.longValues(count, docBuffer, valueBuffer, Long.MIN_VALUE); for (int i = 0; i < count; ++i) { long v = valueBuffer[i]; sum += v; if (v != Long.MIN_VALUE) { AverageCollector.this.count++; } } } } @OverRide public void collect(int doc) throws IOException { if (values.advanceExact(doc)) { sum += values.longValue(); count += 1; } } }; } @OverRide public ScoreMode scoreMode() { return ScoreMode.COMPLETE_NO_SCORES; } } ``` The benchmark runs a few queries with different collectors first, so that `LeafCollector#collect` is polymorphic. I get the following numbers: | Query | | Latency without DocIdStream#intoArray (ms) | Latency with DocIdStream#intoArray (ms) | | - | - | - | - | | `MatchAllDocsQuery` | Uses `RangeDocIdStream` under the hood | 81 | 65 | | `featureClass:(S P)` | Matches spots or cities (5M docs), uses `BitSetDocIdStream` under the hood | 54 | 43 | Note that `NumericDocValues#advanceExact` and `NumericDocValues#longValue` are not polymorphic in this benchmark since all segments use the same impl on the `elevation` field. The difference would be bigger if they were. My profiler suggests that retrieving doc values is the bottleneck when computing the average on a `MatchAllDocsQuery`, while the bottleneck is a mix of retrieving doc values and extracting set bits from the bit set when computing the average on `featureClass:(S P)`.
1 parent 15b8d54 commit 9e7381c

File tree

18 files changed

+402
-325
lines changed

18 files changed

+402
-325
lines changed

lucene/CHANGES.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,10 @@ API Changes
115115
* GITHUB#15149: Introduce NumericDocValues#longValues to help speed up the
116116
retrieval of many doc values at once. (Adrien Grand)
117117

118+
* GITHUB#15173: Introduce DocIdStream#intoArray to help collect batches of doc
119+
IDs at once, in particular by taking advantage of the new
120+
NumericDocValues#longValues API. (Adrien Grand)
121+
118122
New Features
119123
---------------------
120124
* GITHUB#14565: Add ParentsChildrenBlockJoinQuery that supports parent and child filter in the same query

lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene103/BitSetUtil.java

Lines changed: 0 additions & 104 deletions
This file was deleted.

lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene103/Lucene103PostingsReader.java

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1039,7 +1039,8 @@ public void nextPostings(int upTo, DocAndFloatFeatureBuffer buffer) throws IOExc
10391039
}
10401040

10411041
// Only return docs from the current block
1042-
// +1 to make BitSetUtil#denseBitsetToArray work, see its java doc.
1042+
// +1 to make FixedBitSet#intoArray perform a bit faster, it uses a slower path if it doesn't
1043+
// have a free slot after the last element
10431044
buffer.growNoCopy(BLOCK_SIZE + 1);
10441045
upTo = (int) Math.min(upTo, level0LastDocID + 1L);
10451046

@@ -1055,8 +1056,9 @@ public void nextPostings(int upTo, DocAndFloatFeatureBuffer buffer) throws IOExc
10551056
break;
10561057
case UNARY:
10571058
buffer.size =
1058-
BitSetUtil.denseBitsetToArray(
1059-
docBitSet, doc - docBitSetBase, upTo - docBitSetBase, docBitSetBase, buffer.docs);
1059+
docBitSet.intoArray(
1060+
doc - docBitSetBase, upTo - docBitSetBase, docBitSetBase, buffer.docs);
1061+
assert buffer.size < buffer.docs.length; // buffer's capacity is BLOCK_SIZE+1
10601062
break;
10611063
}
10621064

lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene103/TestBitSetUtil.java

Lines changed: 0 additions & 53 deletions
This file was deleted.

lucene/core/src/java/org/apache/lucene/codecs/lucene104/BitSetUtil.java

Lines changed: 0 additions & 104 deletions
This file was deleted.

lucene/core/src/java/org/apache/lucene/codecs/lucene104/Lucene104PostingsReader.java

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1047,7 +1047,8 @@ public void nextPostings(int upTo, DocAndFloatFeatureBuffer buffer) throws IOExc
10471047
}
10481048

10491049
// Only return docs from the current block
1050-
// +1 to make BitSetUtil#denseBitsetToArray work, see its java doc.
1050+
// +1 to make FixedBitSet#intoArray perform a bit faster, it uses a slower path if it doesn't
1051+
// have a free slot after the last element
10511052
buffer.growNoCopy(BLOCK_SIZE + 1);
10521053
upTo = (int) Math.min(upTo, level0LastDocID + 1L);
10531054

@@ -1063,8 +1064,9 @@ public void nextPostings(int upTo, DocAndFloatFeatureBuffer buffer) throws IOExc
10631064
break;
10641065
case UNARY:
10651066
buffer.size =
1066-
BitSetUtil.denseBitsetToArray(
1067-
docBitSet, doc - docBitSetBase, upTo - docBitSetBase, docBitSetBase, buffer.docs);
1067+
docBitSet.intoArray(
1068+
doc - docBitSetBase, upTo - docBitSetBase, docBitSetBase, buffer.docs);
1069+
assert buffer.size < buffer.docs.length; // buffer's capacity is BLOCK_SIZE+1
10681070
break;
10691071
}
10701072

lucene/core/src/java/org/apache/lucene/search/BitSetDocIdStream.java

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,4 +59,18 @@ public int count(int upTo) throws IOException {
5959
return 0;
6060
}
6161
}
62+
63+
@Override
64+
public int intoArray(int upTo, int[] array) {
65+
if (upTo > this.upTo) {
66+
upTo = Math.min(upTo, max);
67+
int count = bitSet.intoArray(this.upTo - offset, upTo - offset, offset, array);
68+
if (count == array.length) { // The whole range of doc IDs may not have been copied
69+
upTo = array[array.length - 1] + 1;
70+
}
71+
this.upTo = upTo;
72+
return count;
73+
}
74+
return 0;
75+
}
6276
}

lucene/core/src/java/org/apache/lucene/search/BooleanScorerSupplier.java

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -266,6 +266,11 @@ public void setScorer(Scorable scorer) throws IOException {
266266
public void collect(int doc) throws IOException {
267267
collector.collect(doc);
268268
}
269+
270+
@Override
271+
public void collect(DocIdStream stream) throws IOException {
272+
collector.collect(stream);
273+
}
269274
};
270275
return scorer.score(noScoreCollector, acceptDocs, min, max);
271276
}

lucene/core/src/java/org/apache/lucene/search/DocIdStream.java

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,22 @@ public int count() throws IOException {
5858
// so would defeat the purpose of collecting hits via a DocIdStream.
5959
public abstract int count(int upTo) throws IOException;
6060

61+
/**
62+
* Copy some matching doc IDs into the provided array and return the number of copied elements. A
63+
* return value of {@code 0} indicates that there are no remaining doc IDs. The given array must
64+
* not be empty.
65+
*/
66+
public int intoArray(int[] array) {
67+
return intoArray(DocIdSetIterator.NO_MORE_DOCS, array);
68+
}
69+
70+
/**
71+
* Copy some matching doc IDs under {@code upTo} (exclusive) into the provided array and return
72+
* the number of copied elements. A return value of {@code 0} indicates that there are no matching
73+
* doc IDs under {@code upTo} anymore. The given array must not be empty.
74+
*/
75+
public abstract int intoArray(int upTo, int[] array);
76+
6177
/**
6278
* Return {@code true} if this stream may have remaining doc IDs. This must eventually return
6379
* {@code false} when the stream is exhausted.

0 commit comments

Comments
 (0)