Skip to content

Conversation

@gf2121
Copy link
Contributor

@gf2121 gf2121 commented Mar 6, 2025

Context

Proposal

Before the PRs mentioned above, the fst in block tree is almost like a trie - no output prefix sharing and few suffix sharing. When output prefix get shared, we saved space, but also lost performance. IMO for tip, performance is more important than storage size, which is usually a very small part of the whole index, and loaded off-heap.

This PR tries to introduce a simple trie that specialized for blocktree index:

  • It can have assumptions that keys are bytes so numbers are better encoded.
  • The output bytes array can be read from IndexInput without copy.
  • It also improves performance by replacing vint / vlong with a over-read and mask.
    ...

PR is still a draft, but the number looks promising.

Storage

  Baseline Candidate diff
_32_Lucene101_0.tip 4425601 3827402 -13.52%
_65_Lucene101_0.tip 4458107 3884702 -12.86%
_98_Lucene101_0.tip 4791217 4185525 -12.64%
_cb_Lucene101_0.tip 4832497 4183043 -13.44%
_fe_Lucene101_0.tip 4807799 4173039 -13.20%
_fp_Lucene101_0.tip 720343 587765 -18.40%
_g0_Lucene101_0.tip 721438 585146 -18.89%
_gb_Lucene101_0.tip 694205 566073 -18.46%
_gm_Lucene101_0.tip 688145 561999 -18.33%
_gx_Lucene101_0.tip 819804 657046 -19.85%
_gy_Lucene101_0.tip 142276 100526 -29.34%
_gz_Lucene101_0.tip 125578 87568 -30.27%
_h0_Lucene101_0.tip 109982 77425 -29.60%
_h1_Lucene101_0.tip 113266 80308 -29.10%
_h2_Lucene101_0.tip 104672 72846 -30.41%
sum 27554930 23630413 -14.24%

Search

TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          IntSet      595.28      (3.3%)      584.37      (3.9%)   -1.8% (  -8% -    5%) 0.110
                      OrHighRare      108.03      (8.7%)      106.53      (9.2%)   -1.4% ( -17% -   18%) 0.624
                  FilteredPhrase       16.86      (2.9%)       16.69      (3.0%)   -1.0% (  -6% -    5%) 0.288
                IntervalsOrdered        7.89      (3.1%)        7.81      (4.9%)   -1.0% (  -8% -    7%) 0.448
                        SpanNear       12.64      (2.1%)       12.56      (2.2%)   -0.6% (  -4% -    3%) 0.359
                    FilteredTerm      128.98      (3.2%)      128.26      (2.9%)   -0.6% (  -6% -    5%) 0.565
                 CountAndHighMed       84.00      (2.3%)       83.55      (1.6%)   -0.5% (  -4% -    3%) 0.389
                AndMedOrHighHigh       54.23      (4.6%)       53.99      (4.2%)   -0.5% (  -8% -    8%) 0.745
             CountFilteredPhrase       12.76      (2.3%)       12.70      (2.4%)   -0.4% (  -4% -    4%) 0.556
                  CountOrHighMed      172.95      (3.8%)      172.23      (3.6%)   -0.4% (  -7% -    7%) 0.724
                CountAndHighHigh       84.36      (2.0%)       84.03      (1.6%)   -0.4% (  -3% -    3%) 0.496
                     TermGroup1M       23.31      (1.9%)       23.24      (2.1%)   -0.3% (  -4% -    3%) 0.635
                          Phrase       50.78      (3.1%)       50.65      (2.9%)   -0.3% (  -6% -    5%) 0.781
                     CountPhrase       18.16      (1.8%)       18.12      (2.0%)   -0.2% (  -4% -    3%) 0.698
                    TermGroup10K       17.28      (1.4%)       17.24      (1.9%)   -0.2% (  -3% -    3%) 0.665
                  FilteredOrMany        7.97      (1.9%)        7.95      (1.8%)   -0.2% (  -3% -    3%) 0.730
                   TermTitleSort      138.40      (3.0%)      138.24      (3.6%)   -0.1% (  -6% -    6%) 0.910
     FilteredAnd2Terms2StopWords      436.82      (3.3%)      436.31      (3.1%)   -0.1% (  -6% -    6%) 0.906
             CountFilteredOrMany       13.23      (2.1%)       13.22      (2.2%)   -0.1% (  -4% -    4%) 0.889
             CombinedAndHighHigh        9.59      (2.7%)        9.59      (2.2%)   -0.1% (  -4% -    5%) 0.930
             FilteredOrStopWords       25.57      (3.9%)       25.56      (3.3%)   -0.0% (  -6% -    7%) 0.966
                 AndHighOrMedMed       15.54      (1.4%)       15.53      (1.4%)   -0.0% (  -2% -    2%) 0.939
         CountFilteredOrHighHigh       37.22      (1.2%)       37.21      (0.9%)   -0.0% (  -2% -    2%) 0.923
                    TermGroup100       16.94      (2.4%)       16.95      (2.3%)    0.0% (  -4% -    4%) 0.989
                     CountOrMany       15.03      (3.4%)       15.03      (3.4%)    0.0% (  -6% -    7%) 0.988
                 CountOrHighHigh       89.26      (1.9%)       89.28      (1.7%)    0.0% (  -3% -    3%) 0.960
             CountFilteredIntNRQ       23.19      (1.6%)       23.20      (1.7%)    0.0% (  -3% -    3%) 0.944
                  FilteredIntNRQ       35.57      (1.8%)       35.60      (2.3%)    0.1% (  -3% -    4%) 0.887
                          IntNRQ       35.72      (1.9%)       35.76      (2.1%)    0.1% (  -3% -    4%) 0.860
                    TermBGroup1M       23.72      (2.4%)       23.75      (2.2%)    0.1% (  -4% -    4%) 0.848
              CombinedAndHighMed       78.87      (3.2%)       79.00      (3.3%)    0.2% (  -6% -    6%) 0.876
               FilteredOrHighMed      147.44      (4.0%)      147.69      (2.8%)    0.2% (  -6% -    7%) 0.876
          CountFilteredOrHighMed       46.61      (1.1%)       46.69      (1.0%)    0.2% (  -1% -    2%) 0.609
              FilteredOrHighHigh       30.50      (3.9%)       30.58      (3.1%)    0.3% (  -6% -    7%) 0.816
                    AndStopWords        6.02      (6.8%)        6.04      (4.4%)    0.3% ( -10% -   12%) 0.881
                       OrHighMed       55.30      (4.4%)       55.45      (5.8%)    0.3% (  -9% -   10%) 0.868
                          OrMany        5.73      (6.5%)        5.75      (7.0%)    0.3% ( -12% -   14%) 0.898
                      AndHighMed      116.16      (3.6%)      116.50      (3.4%)    0.3% (  -6% -    7%) 0.789
                FilteredOr3Terms      106.20      (3.8%)      106.59      (2.4%)    0.4% (  -5% -    6%) 0.713
                    SloppyPhrase       22.88      (3.1%)       22.98      (3.1%)    0.4% (  -5% -    6%) 0.666
                  TermBGroup1M1P       13.89      (2.4%)       13.96      (1.9%)    0.4% (  -3% -    4%) 0.518
            FilteredAndStopWords       26.26      (4.3%)       26.38      (4.0%)    0.5% (  -7% -    9%) 0.719
               CombinedOrHighMed       15.73      (3.4%)       15.81      (1.5%)    0.5% (  -4% -    5%) 0.564
                 DismaxOrHighMed      100.85      (3.4%)      101.37      (3.6%)    0.5% (  -6% -    7%) 0.642
              CombinedOrHighHigh       12.56      (3.3%)       12.64      (1.0%)    0.6% (  -3% -    5%) 0.403
      FilteredOr2Terms2StopWords      192.94      (2.7%)      194.24      (1.9%)    0.7% (  -3% -    5%) 0.362
                      DismaxTerm      719.66      (5.8%)      724.60      (6.9%)    0.7% ( -11% -   14%) 0.733
                       And3Terms      493.05      (5.2%)      496.54      (5.4%)    0.7% (  -9% -   12%) 0.676
              Or2Terms2StopWords      102.67      (4.8%)      103.43      (5.4%)    0.7% (  -9% -   11%) 0.643
                    CombinedTerm       23.30      (3.5%)       23.48      (1.7%)    0.8% (  -4% -    6%) 0.365
             FilteredAndHighHigh       25.26      (4.6%)       25.46      (4.7%)    0.8% (  -8% -   10%) 0.594
                     OrStopWords       37.40      (4.8%)       37.73      (6.2%)    0.9% (  -9% -   12%) 0.614
                DismaxOrHighHigh       21.65      (3.7%)       21.86      (5.5%)    1.0% (  -7% -   10%) 0.521
               FilteredAnd3Terms       99.12      (5.5%)      100.08      (5.3%)    1.0% (  -9% -   12%) 0.573
                      OrHighHigh       24.82      (4.6%)       25.07      (5.8%)    1.0% (  -9% -   12%) 0.546
                     AndHighHigh       83.00      (4.4%)       83.91      (3.1%)    1.1% (  -6% -    9%) 0.369
                            Term      706.24      (4.0%)      713.95      (4.9%)    1.1% (  -7% -   10%) 0.440
                        Or3Terms      253.05      (5.5%)      256.03      (5.2%)    1.2% (  -9% -   12%) 0.487
                      TermDTSort      243.15      (3.3%)      246.59      (5.2%)    1.4% (  -6% -   10%) 0.305
                   TermMonthSort     2863.45      (7.1%)     2908.45      (7.4%)    1.6% ( -12% -   17%) 0.493
                 FilteredPrefix3      144.48      (2.9%)      146.86      (2.3%)    1.6% (  -3% -    7%) 0.046
             And2Terms2StopWords      247.08      (3.1%)      251.49      (3.1%)    1.8% (  -4% -    8%) 0.066
              FilteredAndHighMed      146.72      (7.4%)      149.37      (7.0%)    1.8% ( -11% -   17%) 0.426
                         Prefix3       68.50      (3.3%)       69.91      (1.6%)    2.1% (  -2% -    7%) 0.012
               TermDayOfYearSort      443.75      (5.2%)      453.71      (4.2%)    2.2% (  -6% -   12%) 0.134
                        Wildcard      100.44      (4.4%)      104.04      (3.3%)    3.6% (  -3% -   11%) 0.004
                          Fuzzy2       47.50      (3.4%)       49.31      (3.6%)    3.8% (  -3% -   11%) 0.001
                         Respell       64.88      (2.9%)       68.90      (3.7%)    6.2% (   0% -   13%) 0.000
                          Fuzzy1      121.48      (2.1%)      129.95      (3.1%)    7.0% (   1% -   12%) 0.000
                       CountTerm     7936.16     (10.6%)     8914.66      (8.8%)   12.3% (  -6% -   35%) 0.000
                        PKLookup      269.87      (3.6%)      361.18      (4.9%)   33.8% (  24% -   43%) 0.000

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thank you for persisting @gf2121!


@SuppressWarnings("NonFinalStaticField")
static Codec defaultCodec = LOADER.lookup("Lucene101");
static Codec defaultCodec = LOADER.lookup("Lucene103");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay! This means default Codec is now using trie!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is causing the OOMs. See #14487

@gf2121 gf2121 merged commit 878e23f into apache:main Apr 14, 2025
7 checks passed
@gf2121
Copy link
Contributor Author

gf2121 commented Apr 14, 2025

Thank you @mikemccand and @jpountz for the patient review and all these great suggestions!

I raised mikemccand/luceneutil#369 to switch codec for luceneutil.

@gf2121
Copy link
Contributor Author

gf2121 commented Apr 15, 2025

Nightly confirmed the speedup.

https://benchmarks.mikemccandless.com/2025.04.14.18.05.20.html
https://benchmarks.mikemccandless.com/PKLookup.html

Annotation PR: mikemccand/luceneutil#370

@mikemccand
Copy link
Member

Nightly confirmed the speedup.

That's ~1.2 M primary key lookups per second, yay!

PK lookups are used all the time, not just the "key/value store" use case... e.g. updating or deleting a document by its id.

@msokolov
Copy link
Contributor

msokolov commented Apr 15, 2025

I saw a message raising concern about some OOM test failures possibly related to this? Am I confused? Is excessive memory usage something we should be concerned about?

Hmm maybe it was a test issue only #14487

@stefanvodita
Copy link
Contributor

If we think the issue last week was with the tests, should we go ahead and back-port this change?

@gf2121
Copy link
Contributor Author

gf2121 commented Apr 23, 2025

Hi @stefanvodita

We plan to allow CI chew on this change for a couple of weeks and backport it if everything goes well. See #14333 (comment).

When backporting, i'll try to catch all changes made to Lucene103Codec, including #14447. So for now, feel free to just merge it into main :)

@stefanvodita
Copy link
Contributor

You anticipated my concern with #14447 😄
Thank you for handling this!

jpountz pushed a commit to jpountz/lucene that referenced this pull request Apr 24, 2025
gf2121 added a commit to gf2121/lucene that referenced this pull request Apr 29, 2025
gf2121 added a commit to gf2121/lucene that referenced this pull request Apr 29, 2025
gf2121 added a commit to gf2121/lucene that referenced this pull request May 7, 2025
gf2121 added a commit that referenced this pull request May 9, 2025
* A specialized Trie for Block Tree Index (#14333)

* Fix OOM of TestTrie (#14488)

* Compute the doc range more efficiently when flushing doc block (#14447)

* Fix TestForTooMuchCloning (#14547)

* Fix tests: too many calls to IndexInput.clone during merging (#14595)

* Simplify rootCodeFp (#14596)

---------

Co-authored-by: panguixin <[email protected]>
@Coqueue
Copy link

Coqueue commented May 20, 2025

Thanks for the fantastic change!

Want to share that we adopted and backported this codec to Lucene 912, and ran it against an Amazon Search internal benchmark, from which we observed an increase of 2.7% in Searcher throughput : D

Looking forward to seeing the actual improvements this brings in production!

@mikemccand
Copy link
Member

Thank you @Coqueue.

ran it against an Amazon Search internal benchmark, from which we observed an increase of 2.7% in Searcher throughput : D

Small correction: Amazon Product Search internal benchmarks.

Also, we did this backport just to evaluate performance impact to our service from this change since we so heavily use Lucene's terms dictionary, but for actually running this in production we will upgrade to 10.x.

Hmm this change doesn't have the milestone set -- I'll set it.

@mikemccand mikemccand added this to the 10.3.0 milestone May 21, 2025
@jpountz
Copy link
Contributor

jpountz commented May 21, 2025

This is great info, thanks for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants