-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Open
Description
A while back, Lucene changed the way that it encodes doc IDs from PFOR-delta to FOR-delta, which is a bit faster but less space-efficient. In order to avoid introducing space-efficiency regressions (especially on dense postings lists, which are common on Logging datasets), @iverase moved Elasticsearch to a copy of the Lucene postings format that would still use PFOR-delta for compression. (#103601)
But Lucene 9.12 introduced a new postings format that has better skipping logic (in general). It would be nice to take advantage of it. I would suggest the following plan:
- Use 'Lucene912PostingsFormat' when storage efficiency isn't critical #119051
- Create a new postings format that is a copy of
Lucene912PostingsFormat
but with a more space-efficient encoding of doc deltas. @dnhatn and I played with it earlier this year, there is room for significant improvement by storing exceptions (the P from PFOR stands for "patched") more efficiently and allowing more exceptions per block. - Move indexes whose storage efficiency is important to this new postings format instead of
ES812PostingsFormat
. - Disallow using
ES812PostingsFormat
on new indexes. - Move the write logic of
ES812PostingsFormat
to the test folder.