Should Lucene offer a helper class for primary key lookups?

Primary key lookup is such a common use case for search engines -- Lucene's nightly benchmarks track the performance of `PKLookup` (primary key lookup), finding the one document that contains a unique term.  It's a stress test of Lucene's terms index/dictionary impl, and it has been [surprisingly volatile over the past ~15 years](https://benchmarks.mikemccandless.com/PKLookup.html).

This task ([`PKLookupTask`](https://github.com/mikemccand/luceneutil/blob/main/src/main/perf/PKLookupTask.java)) is somewhat bespoke: it first (on init -- not counted in the benchy stopwatch timer) [creates a batch of 4000 randomly picked ids (all valid/existing -- maybe we should mix in some non-existing ids?)](https://github.com/mikemccand/luceneutil/blob/main/src/main/perf/PKLookupTask.java#L56-L62) in the index.   Then, when the concurrent benchy search threads execute this task, they iterate through each id, reusing per-segment state (`TermsEnum`) and use `seekExact` to see if the term exists in that segment.  (Hmm, the ids are not pre-sorted -- that would make benchy a bit faster maybe -- terms dict likes it if you go forward only in your seeks.  Also, the segments are visited in whatever order the `DirectoryReader` has them in ... I think if it sorted by descending `maxDoc` that also might make things faster -- best bang for your buck ish?  Anyway:).  As soon as it finds a live document for the id, it early terminates (doesn't check for more docs from the `PostingsEnum`, nor any of the remaining segments).

I wonder if Lucene should offer a utility API, to look up one PK?  It'd early terminate when it found just the one doc.  It might even be able to optimize for cases where most lookups are not in the index, if the users expresses that.  The class would be reusable (hang onto the `TermsEnum` for each leaf), and thread bound (well, only one thread can do the lookup at a time, at least).  Users today would normally just use a `TermQuery`, or maybe `TermInSetQuery` for a batch of PK lookups, but those are missing the optimizations to early terminate.  It could be a `PrimaryKeyInSetQuery` or so?  Or maybe lower level `PrimaryKeyTermStates` for the single PK lookup case.

I suspect retrieving docs by their unique id is a common use case for Lucene/OpenSearch/Solr/Elasticsearch users.  At Amazon's customer facing product search we have many queries / clauses / other things (doc version tracking for out-of-order updates during indexing) that seem to do this, at quite a bit aggregated cost, and we haven't optimized them well but thinking about it.  This is also related to my dev email about [visualizing bloom filter tradeoffs](https://lists.apache.org/thread/f5sbhcpt7s6cmt318wfc6dvnc756f27n), since in theory Lucene's `BloomFilteringPostingsFormat` should maybe help this use case.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Should Lucene offer a helper class for primary key lookups? #15520

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Should Lucene offer a helper class for primary key lookups? #15520

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions