Specialize arc store for continuous label in FST #12748

easyice · 2023-11-02T09:24:42Z

This PR resolves issue: #12701 . Thanks for the cool idea from @gf2121

It need some more benchmarking.

mikemccand · 2023-11-02T09:48:02Z

This looks cool! Sorry, I caused conflicts w/ the earlier merge -- could you please resolve those @easyice? I'm happy to try benchmarking it, using the new IndexToFST tool in luceneutil :)

easyice · 2023-11-02T10:07:19Z

@mikemccand Thanks for your quick reply! the conflicts has resolved, any comment is welcomed!

easyice · 2023-11-03T07:20:44Z

I ran this with wikimedium10m and wikimediumall, There was no significant performance improvement or regression that was found. The total size of tip has a slight reduced:

	baseline	candidate
wikimedium10m	10280673	10275716
wikimediumall	28530090	28496270

The counted the different nodeFlags for wikimedium10m:

strategies	count	percent
ARCS_FOR_DIRECT_ADDRESSING	558555	50.23%
ARCS_FOR_CONTINUOUS	25215	2.26%
ARCS_FOR_BINARY_SEARCH	9	0.00%
Linear search(bytesPerArc:0)	528100	47.49%

It seems that the percentage hitting this optimization is small, but the data is dense for the arcs, so i generated 10 million random long values as terms:

for (int i = 0; i < 1000_0000; i++) {
    Document doc = new Document();
    doc.add(new StringField("f1", String.valueOf(rand.nextLong()), Store.NO));
    indexWriter.addDocument(doc);
}

This optimization will be hit in most cases:

strategies	count	percent
ARCS_FOR_DIRECT_ADDRESSING	2469	2.58%
ARCS_FOR_CONTINUOUS	78732	82.45%
ARCS_FOR_BINARY_SEARCH	0	0.00%
Linear search(bytesPerArc:0)	14280	14.95%

mikemccand · 2023-11-03T10:31:02Z

I tested this PR using IndexToFST from luceneutil. This just tests construction time and final FST size, on all wikimediumall unique terms, allowing up to 64 MB RAM while building the FST:

main:

  saved FST to "fst.bin": 382070800 bytes; 44.146 sec
  saved FST to "fst.bin": 382070800 bytes; 43.478 sec

PR:

  saved FST to "fst.bin": 381705016 bytes; 42.616 sec
  saved FST to "fst.bin": 381705016 bytes; 42.832 sec

FST size is a wee bit smaller (~0.1%), and curiously the construction time seems to be faster too.

mikemccand · 2023-11-03T10:33:20Z

I'll run Test2BFST too ... takes a few hours!

mikemccand

This looks good to me! Can we remove the DRAFT status or is there something still missing? Thanks @easyice!

mikemccand · 2023-11-03T10:34:56Z

lucene/core/src/java/org/apache/lucene/util/fst/FST.java

   */
  static final byte ARCS_FOR_DIRECT_ADDRESSING = 1 << 6;

+  static final byte ARCS_FOR_CONTINUOUS = ARCS_FOR_DIRECT_ADDRESSING + ARCS_FOR_BINARY_SEARCH;


Could you add a comment explaining this arc optimization case?

+1 , It is important.

mikemccand · 2023-11-03T10:37:48Z

lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java

      int labelRange = nodeIn.arcs[nodeIn.numArcs - 1].label - nodeIn.arcs[0].label + 1;
      assert labelRange > 0;
-      if (shouldExpandNodeWithDirectAddressing(
+      boolean continuousLable = labelRange == nodeIn.numArcs;


Hmm can we please use the continuousLabel spelling instead :) To be consistent...

sorry for the typo :)

mikemccand · 2023-11-03T11:25:15Z

Test2BFSTs is happy:

BUILD SUCCESSFUL in 50m 6s

gf2121

This looks great. Thanks for working on this @easyice !

Maybe we can add an abstraction layer for all these arc strategies in the future to simplify code and make it easier to add strategies :)

.../core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsReader.java

easyice · 2023-11-03T12:06:09Z

@mikemccand Thanks for the benchmarking, i also write 10 million docs of random long values, then use TermInSetQuery for benchmarking. here is the result:

The file size of tip reduced ~2%

	size
main	1807149
PR	1770259

The query latency reduced ~7%. termsCount is the number of terms in TermInSetQuery, hitRatio refers to what percentage of the term will be hit. there is a bit of variance across runs, but they seem good overall.

hitRatio	termsCount	tookMs(main)	tookMs(PR)	diff
1%	64	177	164	92.66%
1%	512	1380	1312	95.07%
1%	2048	5225	5022	96.11%
25%	64	222	212	95.50%
25%	512	1462	1391	95.14%
25%	2048	5602	5533	98.77%
50%	64	216	204	94.44%
50%	512	1600	1513	94.56%
50%	2048	6193	5883	94.99%
75%	64	224	213	95.09%
75%	512	1702	1598	93.89%
75%	2048	6565	6289	95.80%
100%	64	233	218	93.56%
100%	512	1752	1736	99.09%
100%	2048	7057	6621	93.82%

crude benchmark code:

static public long doSearch(int termCount, int hitRatio) throws IOException {
        Directory directory = FSDirectory.open(Paths.get("/Volumes/RamDisk/longdata"));
        IndexReader indexReader = DirectoryReader.open(directory);
        IndexSearcher searcher = new IndexSearcher(indexReader);
        searcher.setQueryCachingPolicy(
                new QueryCachingPolicy() {
                    @Override
                    public void onUse(Query query) {
                    }

                    @Override
                    public boolean shouldCache(Query query) throws IOException {
                        return false;
                    }
                });

        long total = 0;
        Query query = getQuery(termCount, hitRatio);
        for (int i = 0; i < 1000; i++) {
            long start = System.currentTimeMillis();
            doQuery(searcher, query);
            long end = System.currentTimeMillis();
            total += end - start;
        }
        //System.out.println("term count: " + termCount + ", took(ms): " + total);
        indexReader.close();
        directory.close();
        return total;
    }

    private static Query getQuery(int termCount, int hitRatio) {
        int hitCount = termCount * hitRatio / 100;
        int notHitCount = termCount - hitCount;
        List<BytesRef> terms = new ArrayList<>();
        for (int i = 0; i < hitCount; i++) {
            terms.add(new BytesRef(Long.toString(longs.get(RANDOM.nextInt(longs.size() - 1)))));
        }

        Random r = new Random();
        for (int i = 0; i < notHitCount; i++) {
            long v = r.nextLong();
            while (uniqueLongs.contains(v)) {
                v = r.nextLong();
            }
            terms.add(new BytesRef(Long.toString(v)));
        }
        return new TermInSetQuery(FIELD, terms);
    }

    private static void doQuery(IndexSearcher searcher, Query query) throws IOException {
        searcher.search(
                query,
                new Collector() {
                    @Override
                    public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException {
                        return new LeafCollector() {
                            @Override
                            public void setScorer(Scorable scorer) throws IOException {
                            }

                            @Override
                            public void collect(int doc) throws IOException {
                                throw new CollectionTerminatedException();
                            }
                        };
                    }

                    @Override
                    public ScoreMode scoreMode() {
                        return ScoreMode.COMPLETE_NO_SCORES;
                    }
                });
    }

mikemccand · 2023-11-04T15:33:14Z

Hello @easyice, I'm sorry but I just merged #12738 which caused conflicts here ... could you please rebase and resolve conflicts? I think this change is ready except for that. Thank you!!

mikemccand · 2023-11-04T15:36:26Z

lucene/core/src/java/org/apache/lucene/util/fst/FST.java

+
  // Increment version to change it
  private static final String FILE_FORMAT_NAME = "FST";
  private static final int VERSION_START = 6;


Hmm shouldn't we bump the VERSION_CURRENT in FST with this change too?

I think so, we expect to throw IndexFormatTooNewException in CodecUtil.checkIndexHeader when the old version code reading new index format. so we should also bump the VERSION_CURRENT in Lucene90BlockTreeTermsReader?

I am not sure, if the change is backward compatible, maybe we can also keep the VERSION_CURRENT? could you please give me some suggestion? Thank you!

Hmm I'd like to protect against an older version of Lucene (w/o this change) trying to read an FST written with a newer version (with this change). If we bump the version, that older version would throw an understandable error, but if we don't, it'd be some strange assertion error or so?

I realize it'd be hard to even reach such a situation (you'd have to be using FSTs directly or so), but still when we make such changes to our format I think it's good practice to bump the version.

I think so, we expect to throw IndexFormatTooNewException in CodecUtil.checkIndexHeader when the old version code reading new index format. so we should also bump the VERSION_CURRENT in Lucene90BlockTreeTermsReader?

+1 to also bump the version in Lucene90BlockTreeTermsReader.

Thank you very much for your guidance! it's very helpful!

mikemccand · 2023-11-06T16:04:57Z

This might be a nice bump in PKLookup performance in the nightly benchmarks -- it uses compact integers encoded as BytesRef in the id field.

gf2121

LGTM.
I can help merge this in and backport if there is no objection in 48h.

lucene/CHANGES.txt

mikemccand

Thank you @easyice -- what a nice optimization. I'll merge shortly.

mikemccand · 2023-11-09T11:01:27Z

I can help merge this in and backport if there is no objection in 48h.

Thanks @gf2121 -- we should backport all these recent exciting FST changes in the right order as a batch to avoid scary hairy conflicts. I plan to do this in the next few days -- they have been baking in main quite well recently.

easyice · 2023-11-09T11:24:18Z

@mikemccand @gf2121 Thanks for review and merge it ;-)

jpountz · 2023-11-20T11:10:16Z

@mikemccand is now a good time to backport these changes?

mikemccand · 2023-11-20T15:48:06Z

@mikemccand is now a good time to backport these changes?

Yes I think so. I'll try to tackle this this week <-- one of the rare sentences where having the same word twice in a row is correct/grammatical.

* init * review fix and reuse duplicate code * rebase * tidy * CHANGES.txt * bump version * rebase * CHANGES.txt

easyice force-pushed the continuous_arcs branch from f09590e to bf534ee Compare November 2, 2023 10:02

mikemccand approved these changes Nov 3, 2023

View reviewed changes

easyice marked this pull request as ready for review November 3, 2023 10:56

gf2121 approved these changes Nov 3, 2023

View reviewed changes

.../core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsReader.java Outdated Show resolved Hide resolved

mikemccand reviewed Nov 4, 2023

View reviewed changes

easyice force-pushed the continuous_arcs branch from 9960e12 to 008ee46 Compare November 5, 2023 06:30

easyice added 7 commits November 7, 2023 10:40

init

d7528d8

review fix and reuse duplicate code

3854010

rebase

71a2d5a

tidy

8948453

CHANGES.txt

0a6261b

bump version

1ebb658

rebase

61e89ed

easyice force-pushed the continuous_arcs branch from 9de1e4d to 61e89ed Compare November 7, 2023 02:55

gf2121 approved these changes Nov 9, 2023

View reviewed changes

gf2121 reviewed Nov 9, 2023

View reviewed changes

lucene/CHANGES.txt Outdated Show resolved Hide resolved

easyice added 2 commits November 9, 2023 17:34

Merge branch 'main' into continuous_arcs

e3b0657

CHANGES.txt

178151f

mikemccand approved these changes Nov 9, 2023

View reviewed changes

mikemccand merged commit 570832e into apache:main Nov 9, 2023

mikemccand pushed a commit that referenced this pull request Nov 20, 2023

Specialize arc store for continuous label in FST (#12748)

b2d6736

* init * review fix and reuse duplicate code * rebase * tidy * CHANGES.txt * bump version * rebase * CHANGES.txt

mikemccand added this to the 9.9.0 milestone Nov 20, 2023

Specialize arc store for continuous label in FST #12748

Specialize arc store for continuous label in FST #12748

Uh oh!

Conversation

easyice commented Nov 2, 2023

Uh oh!

mikemccand commented Nov 2, 2023

Uh oh!

easyice commented Nov 2, 2023

Uh oh!

easyice commented Nov 3, 2023

Uh oh!

mikemccand commented Nov 3, 2023

Uh oh!

mikemccand commented Nov 3, 2023

Uh oh!

mikemccand left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikemccand commented Nov 3, 2023

Uh oh!

gf2121 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

easyice commented Nov 3, 2023

Uh oh!

mikemccand commented Nov 4, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikemccand commented Nov 6, 2023

Uh oh!

gf2121 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mikemccand left a comment

Choose a reason for hiding this comment

Uh oh!

mikemccand commented Nov 9, 2023

Uh oh!

easyice commented Nov 9, 2023

Uh oh!

jpountz commented Nov 20, 2023

Uh oh!

mikemccand commented Nov 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants