Skip to content

Commit f2e524a

Browse files
Update changelog highlights for 188524 (enable binary doc value compression) (elastic#139353)
1 parent 682cfe5 commit f2e524a

File tree

1 file changed

+15
-5
lines changed

1 file changed

+15
-5
lines changed

docs/changelog/138524.yaml

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,23 @@ area: Mapping
44
type: feature
55
issues: []
66
highlight:
7-
title: Remove feature flag to enable binary doc value compression
7+
title: Add compression for binary doc values
88
body: |-
99
Add compression for binary doc values using Zstd and blocks with a variable number of values.
1010
11-
Block-wise LZ4 compression was previously added to Lucene in LUCENE-9211 and removed in LUCENE-9378 due to query performance issues. This approach stored a constant number of values per block (specifically 32 values). This made it easy to map a given value index (e.g., docId) to the block containing it by doing blockId = docId / 32.
12-
Unfortunately, if values are very large, we must still have exactly 32 values per block, and (de)compressing a block could cause very high memory usage. As a result, we had to keep the number of values small, meaning that in the average case, a block was much smaller than ideal.
13-
To overcome the issues of blocks with a constant number of values, this PR adds block-wise compression with a variable number of values per block. It stores a minimum of 1 document per block and stops adding values when the size of a block exceeds a threshold or the number of values exceeds a threshold.
14-
Like the previous version, it stores an array of addresses for the start of each block. Additionally, it stores a parallel array with the docId at the start of each block. When looking up a given docId, if it is not in the current block, we binary search the array of docId starts to find the blockId containing the value. We then look up the address of the block. After this, decompression works very similarly to the code from LUCENE-9211; the main difference being that Zstd(1) is used instead of LZ4.
11+
Block-wise LZ4 compression was previously added to Lucene in LUCENE-9211 and removed in LUCENE-9378 due to query performance issues.
12+
This approach stored a constant number of values per block (specifically 32 values).
13+
This made it easy to map a given value index (e.g., docId) to the block containing it by doing blockId = docId / 32.
14+
Unfortunately, if values are very large, we must still have exactly 32 values per block, and (de)compressing a block could cause very high memory usage.
15+
As a result, we had to keep the number of values small, meaning that in the average case, a block was much smaller than ideal.
16+
17+
To overcome the issues of blocks with a constant number of values, this PR adds block-wise compression with a variable number of values per block.
18+
It stores a minimum of 1 document per block and stops adding values when the size of a block exceeds a threshold or the number of values exceeds a threshold.
19+
Like the previous version, it stores an array of addresses for the start of each block.
20+
Additionally, it stores a parallel array with the docId at the start of each block.
21+
When looking up a given docId, if it is not in the current block, we binary search the array of docId starts to find the blockId containing the value.
22+
We then look up the address of the block.
23+
After this, decompression works very similarly to the code from LUCENE-9211; the main difference being that Zstd(1) is used instead of LZ4.
1524
25+
The introduction of binary doc value compression transparently affects wildcard field types, like URLs that are common in access logs, which will now compress much better.
1626
notable: true

0 commit comments

Comments
 (0)