Skip to content

Conversation

@easyice
Copy link
Contributor

@easyice easyice commented Nov 2, 2023

This PR resolves issue: #12701 . Thanks for the cool idea from @gf2121

It need some more benchmarking.

@mikemccand
Copy link
Member

This looks cool! Sorry, I caused conflicts w/ the earlier merge -- could you please resolve those @easyice? I'm happy to try benchmarking it, using the new IndexToFST tool in luceneutil :)

@easyice
Copy link
Contributor Author

easyice commented Nov 2, 2023

@mikemccand Thanks for your quick reply! the conflicts has resolved, any comment is welcomed!

@easyice
Copy link
Contributor Author

easyice commented Nov 3, 2023

I ran this with wikimedium10m and wikimediumall, There was no significant performance improvement or regression that was found. The total size of tip has a slight reduced:

baseline candidate
wikimedium10m 10280673 10275716
wikimediumall 28530090 28496270

The counted the different nodeFlags for wikimedium10m:

strategies count percent
ARCS_FOR_DIRECT_ADDRESSING 558555 50.23%
ARCS_FOR_CONTINUOUS 25215 2.26%
ARCS_FOR_BINARY_SEARCH 9 0.00%
Linear search(bytesPerArc:0) 528100 47.49%

It seems that the percentage hitting this optimization is small, but the data is dense for the arcs, so i generated 10 million random long values as terms:

for (int i = 0; i < 1000_0000; i++) {
    Document doc = new Document();
    doc.add(new StringField("f1", String.valueOf(rand.nextLong()), Store.NO));
    indexWriter.addDocument(doc);
}

This optimization will be hit in most cases:

strategies count percent
ARCS_FOR_DIRECT_ADDRESSING 2469 2.58%
ARCS_FOR_CONTINUOUS 78732 82.45%
ARCS_FOR_BINARY_SEARCH 0 0.00%
Linear search(bytesPerArc:0) 14280 14.95%

@mikemccand
Copy link
Member

I tested this PR using IndexToFST from luceneutil. This just tests construction time and final FST size, on all wikimediumall unique terms, allowing up to 64 MB RAM while building the FST:

main:

  saved FST to "fst.bin": 382070800 bytes; 44.146 sec
  saved FST to "fst.bin": 382070800 bytes; 43.478 sec

PR:

  saved FST to "fst.bin": 381705016 bytes; 42.616 sec
  saved FST to "fst.bin": 381705016 bytes; 42.832 sec

FST size is a wee bit smaller (~0.1%), and curiously the construction time seems to be faster too.

@mikemccand
Copy link
Member

I'll run Test2BFST too ... takes a few hours!

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! Can we remove the DRAFT status or is there something still missing? Thanks @easyice!

*/
static final byte ARCS_FOR_DIRECT_ADDRESSING = 1 << 6;

static final byte ARCS_FOR_CONTINUOUS = ARCS_FOR_DIRECT_ADDRESSING + ARCS_FOR_BINARY_SEARCH;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment explaining this arc optimization case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 , It is important.

int labelRange = nodeIn.arcs[nodeIn.numArcs - 1].label - nodeIn.arcs[0].label + 1;
assert labelRange > 0;
if (shouldExpandNodeWithDirectAddressing(
boolean continuousLable = labelRange == nodeIn.numArcs;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm can we please use the continuousLabel spelling instead :) To be consistent...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the typo :)

@easyice easyice marked this pull request as ready for review November 3, 2023 10:56
@mikemccand
Copy link
Member

Test2BFSTs is happy:

BUILD SUCCESSFUL in 50m 6s

Copy link
Contributor

@gf2121 gf2121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thanks for working on this @easyice !

Maybe we can add an abstraction layer for all these arc strategies in the future to simplify code and make it easier to add strategies :)

@easyice
Copy link
Contributor Author

easyice commented Nov 3, 2023

@mikemccand Thanks for the benchmarking, i also write 10 million docs of random long values, then use TermInSetQuery for benchmarking. here is the result:

The file size of tip reduced ~2%

size
main 1807149
PR 1770259

The query latency reduced ~7%. termsCount is the number of terms in TermInSetQuery, hitRatio refers to what percentage of the term will be hit. there is a bit of variance across runs, but they seem good overall.

hitRatio termsCount tookMs(main) tookMs(PR) diff
1% 64 177 164 92.66%
1% 512 1380 1312 95.07%
1% 2048 5225 5022 96.11%
25% 64 222 212 95.50%
25% 512 1462 1391 95.14%
25% 2048 5602 5533 98.77%
50% 64 216 204 94.44%
50% 512 1600 1513 94.56%
50% 2048 6193 5883 94.99%
75% 64 224 213 95.09%
75% 512 1702 1598 93.89%
75% 2048 6565 6289 95.80%
100% 64 233 218 93.56%
100% 512 1752 1736 99.09%
100% 2048 7057 6621 93.82%

crude benchmark code:

static public long doSearch(int termCount, int hitRatio) throws IOException {
        Directory directory = FSDirectory.open(Paths.get("/Volumes/RamDisk/longdata"));
        IndexReader indexReader = DirectoryReader.open(directory);
        IndexSearcher searcher = new IndexSearcher(indexReader);
        searcher.setQueryCachingPolicy(
                new QueryCachingPolicy() {
                    @Override
                    public void onUse(Query query) {
                    }

                    @Override
                    public boolean shouldCache(Query query) throws IOException {
                        return false;
                    }
                });

        long total = 0;
        Query query = getQuery(termCount, hitRatio);
        for (int i = 0; i < 1000; i++) {
            long start = System.currentTimeMillis();
            doQuery(searcher, query);
            long end = System.currentTimeMillis();
            total += end - start;
        }
        //System.out.println("term count: " + termCount + ", took(ms): " + total);
        indexReader.close();
        directory.close();
        return total;
    }

    private static Query getQuery(int termCount, int hitRatio) {
        int hitCount = termCount * hitRatio / 100;
        int notHitCount = termCount - hitCount;
        List<BytesRef> terms = new ArrayList<>();
        for (int i = 0; i < hitCount; i++) {
            terms.add(new BytesRef(Long.toString(longs.get(RANDOM.nextInt(longs.size() - 1)))));
        }

        Random r = new Random();
        for (int i = 0; i < notHitCount; i++) {
            long v = r.nextLong();
            while (uniqueLongs.contains(v)) {
                v = r.nextLong();
            }
            terms.add(new BytesRef(Long.toString(v)));
        }
        return new TermInSetQuery(FIELD, terms);
    }

    private static void doQuery(IndexSearcher searcher, Query query) throws IOException {
        searcher.search(
                query,
                new Collector() {
                    @Override
                    public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException {
                        return new LeafCollector() {
                            @Override
                            public void setScorer(Scorable scorer) throws IOException {
                            }

                            @Override
                            public void collect(int doc) throws IOException {
                                throw new CollectionTerminatedException();
                            }
                        };
                    }

                    @Override
                    public ScoreMode scoreMode() {
                        return ScoreMode.COMPLETE_NO_SCORES;
                    }
                });
    }

@mikemccand
Copy link
Member

Hello @easyice, I'm sorry but I just merged #12738 which caused conflicts here ... could you please rebase and resolve conflicts? I think this change is ready except for that. Thank you!!


// Increment version to change it
private static final String FILE_FORMAT_NAME = "FST";
private static final int VERSION_START = 6;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm shouldn't we bump the VERSION_CURRENT in FST with this change too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, we expect to throw IndexFormatTooNewException in CodecUtil.checkIndexHeader when the old version code reading new index format. so we should also bump the VERSION_CURRENT in Lucene90BlockTreeTermsReader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, if the change is backward compatible, maybe we can also keep the VERSION_CURRENT? could you please give me some suggestion? Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'd like to protect against an older version of Lucene (w/o this change) trying to read an FST written with a newer version (with this change). If we bump the version, that older version would throw an understandable error, but if we don't, it'd be some strange assertion error or so?

I realize it'd be hard to even reach such a situation (you'd have to be using FSTs directly or so), but still when we make such changes to our format I think it's good practice to bump the version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, we expect to throw IndexFormatTooNewException in CodecUtil.checkIndexHeader when the old version code reading new index format. so we should also bump the VERSION_CURRENT in Lucene90BlockTreeTermsReader?

+1 to also bump the version in Lucene90BlockTreeTermsReader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your guidance! it's very helpful!

@mikemccand
Copy link
Member

This might be a nice bump in PKLookup performance in the nightly benchmarks -- it uses compact integers encoded as BytesRef in the id field.

Copy link
Contributor

@gf2121 gf2121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I can help merge this in and backport if there is no objection in 48h.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @easyice -- what a nice optimization. I'll merge shortly.

@mikemccand
Copy link
Member

I can help merge this in and backport if there is no objection in 48h.

Thanks @gf2121 -- we should backport all these recent exciting FST changes in the right order as a batch to avoid scary hairy conflicts. I plan to do this in the next few days -- they have been baking in main quite well recently.

@mikemccand mikemccand merged commit 570832e into apache:main Nov 9, 2023
@easyice
Copy link
Contributor Author

easyice commented Nov 9, 2023

@mikemccand @gf2121 Thanks for review and merge it ;-)

@jpountz
Copy link
Contributor

jpountz commented Nov 20, 2023

@mikemccand is now a good time to backport these changes?

@mikemccand
Copy link
Member

@mikemccand is now a good time to backport these changes?

Yes I think so. I'll try to tackle this this week <-- one of the rare sentences where having the same word twice in a row is correct/grammatical.

mikemccand pushed a commit that referenced this pull request Nov 20, 2023
* init

* review fix and reuse duplicate code

* rebase

* tidy

* CHANGES.txt

* bump version

* rebase

* CHANGES.txt
@mikemccand mikemccand added this to the 9.9.0 milestone Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants