Skip to content

Move to Lucene 9.12's new PostingsFormat. #115021

@jpountz

Description

@jpountz

A while back, Lucene changed the way that it encodes doc IDs from PFOR-delta to FOR-delta, which is a bit faster but less space-efficient. In order to avoid introducing space-efficiency regressions (especially on dense postings lists, which are common on Logging datasets), @iverase moved Elasticsearch to a copy of the Lucene postings format that would still use PFOR-delta for compression. (#103601)

But Lucene 9.12 introduced a new postings format that has better skipping logic (in general). It would be nice to take advantage of it. I would suggest the following plan:

  • Use 'Lucene912PostingsFormat' when storage efficiency isn't critical #119051
  • Create a new postings format that is a copy of Lucene912PostingsFormat but with a more space-efficient encoding of doc deltas. @dnhatn and I played with it earlier this year, there is room for significant improvement by storing exceptions (the P from PFOR stands for "patched") more efficiently and allowing more exceptions per block.
  • Move indexes whose storage efficiency is important to this new postings format instead of ES812PostingsFormat.
  • Disallow using ES812PostingsFormat on new indexes.
  • Move the write logic of ES812PostingsFormat to the test folder.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions