You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/changelog/138524.yaml
+15-5Lines changed: 15 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -4,13 +4,23 @@ area: Mapping
4
4
type: feature
5
5
issues: []
6
6
highlight:
7
-
title: Remove feature flag to enable binary doc value compression
7
+
title: Add compression for binary doc values
8
8
body: |-
9
9
Add compression for binary doc values using Zstd and blocks with a variable number of values.
10
10
11
-
Block-wise LZ4 compression was previously added to Lucene in LUCENE-9211 and removed in LUCENE-9378 due to query performance issues. This approach stored a constant number of values per block (specifically 32 values). This made it easy to map a given value index (e.g., docId) to the block containing it by doing blockId = docId / 32.
12
-
Unfortunately, if values are very large, we must still have exactly 32 values per block, and (de)compressing a block could cause very high memory usage. As a result, we had to keep the number of values small, meaning that in the average case, a block was much smaller than ideal.
13
-
To overcome the issues of blocks with a constant number of values, this PR adds block-wise compression with a variable number of values per block. It stores a minimum of 1 document per block and stops adding values when the size of a block exceeds a threshold or the number of values exceeds a threshold.
14
-
Like the previous version, it stores an array of addresses for the start of each block. Additionally, it stores a parallel array with the docId at the start of each block. When looking up a given docId, if it is not in the current block, we binary search the array of docId starts to find the blockId containing the value. We then look up the address of the block. After this, decompression works very similarly to the code from LUCENE-9211; the main difference being that Zstd(1) is used instead of LZ4.
11
+
Block-wise LZ4 compression was previously added to Lucene in LUCENE-9211 and removed in LUCENE-9378 due to query performance issues.
12
+
This approach stored a constant number of values per block (specifically 32 values).
13
+
This made it easy to map a given value index (e.g., docId) to the block containing it by doing blockId = docId / 32.
14
+
Unfortunately, if values are very large, we must still have exactly 32 values per block, and (de)compressing a block could cause very high memory usage.
15
+
As a result, we had to keep the number of values small, meaning that in the average case, a block was much smaller than ideal.
16
+
17
+
To overcome the issues of blocks with a constant number of values, this PR adds block-wise compression with a variable number of values per block.
18
+
It stores a minimum of 1 document per block and stops adding values when the size of a block exceeds a threshold or the number of values exceeds a threshold.
19
+
Like the previous version, it stores an array of addresses for the start of each block.
20
+
Additionally, it stores a parallel array with the docId at the start of each block.
21
+
When looking up a given docId, if it is not in the current block, we binary search the array of docId starts to find the blockId containing the value.
22
+
We then look up the address of the block.
23
+
After this, decompression works very similarly to the code from LUCENE-9211; the main difference being that Zstd(1) is used instead of LZ4.
15
24
25
+
The introduction of binary doc value compression transparently affects wildcard field types, like URLs that are common in access logs, which will now compress much better.
0 commit comments