Skip to content

Should Lucene offer a helper class for primary key lookups?Β #15520

@mikemccand

Description

@mikemccand

Primary key lookup is such a common use case for search engines -- Lucene's nightly benchmarks track the performance of PKLookup (primary key lookup), finding the one document that contains a unique term. It's a stress test of Lucene's terms index/dictionary impl, and it has been surprisingly volatile over the past ~15 years.

This task (PKLookupTask) is somewhat bespoke: it first (on init -- not counted in the benchy stopwatch timer) creates a batch of 4000 randomly picked ids (all valid/existing -- maybe we should mix in some non-existing ids?) in the index. Then, when the concurrent benchy search threads execute this task, they iterate through each id, reusing per-segment state (TermsEnum) and use seekExact to see if the term exists in that segment. (Hmm, the ids are not pre-sorted -- that would make benchy a bit faster maybe -- terms dict likes it if you go forward only in your seeks. Also, the segments are visited in whatever order the DirectoryReader has them in ... I think if it sorted by descending maxDoc that also might make things faster -- best bang for your buck ish? Anyway:). As soon as it finds a live document for the id, it early terminates (doesn't check for more docs from the PostingsEnum, nor any of the remaining segments).

I wonder if Lucene should offer a utility API, to look up one PK? It'd early terminate when it found just the one doc. It might even be able to optimize for cases where most lookups are not in the index, if the users expresses that. The class would be reusable (hang onto the TermsEnum for each leaf), and thread bound (well, only one thread can do the lookup at a time, at least). Users today would normally just use a TermQuery, or maybe TermInSetQuery for a batch of PK lookups, but those are missing the optimizations to early terminate. It could be a PrimaryKeyInSetQuery or so? Or maybe lower level PrimaryKeyTermStates for the single PK lookup case.

I suspect retrieving docs by their unique id is a common use case for Lucene/OpenSearch/Solr/Elasticsearch users. At Amazon's customer facing product search we have many queries / clauses / other things (doc version tracking for out-of-order updates during indexing) that seem to do this, at quite a bit aggregated cost, and we haven't optimized them well but thinking about it. This is also related to my dev email about visualizing bloom filter tradeoffs, since in theory Lucene's BloomFilteringPostingsFormat should maybe help this use case.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions