-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Specialize arc store for continuous label in FST #12748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This looks cool! Sorry, I caused conflicts w/ the earlier merge -- could you please resolve those @easyice? I'm happy to try benchmarking it, using the new |
f09590e to
bf534ee
Compare
|
@mikemccand Thanks for your quick reply! the conflicts has resolved, any comment is welcomed! |
|
I ran this with wikimedium10m and wikimediumall, There was no significant performance improvement or regression that was found. The total size of tip has a slight reduced:
The counted the different
It seems that the percentage hitting this optimization is small, but the data is dense for the arcs, so i generated 10 million random long values as terms: This optimization will be hit in most cases:
|
|
I tested this PR using
PR: FST size is a wee bit smaller (~0.1%), and curiously the construction time seems to be faster too. |
|
I'll run |
mikemccand
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! Can we remove the DRAFT status or is there something still missing? Thanks @easyice!
| */ | ||
| static final byte ARCS_FOR_DIRECT_ADDRESSING = 1 << 6; | ||
|
|
||
| static final byte ARCS_FOR_CONTINUOUS = ARCS_FOR_DIRECT_ADDRESSING + ARCS_FOR_BINARY_SEARCH; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment explaining this arc optimization case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 , It is important.
| int labelRange = nodeIn.arcs[nodeIn.numArcs - 1].label - nodeIn.arcs[0].label + 1; | ||
| assert labelRange > 0; | ||
| if (shouldExpandNodeWithDirectAddressing( | ||
| boolean continuousLable = labelRange == nodeIn.numArcs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm can we please use the continuousLabel spelling instead :) To be consistent...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the typo :)
|
|
gf2121
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Thanks for working on this @easyice !
Maybe we can add an abstraction layer for all these arc strategies in the future to simplify code and make it easier to add strategies :)
.../core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsReader.java
Outdated
Show resolved
Hide resolved
|
@mikemccand Thanks for the benchmarking, i also write 10 million docs of random long values, then use The file size of tip reduced ~2%
The query latency reduced ~7%.
crude benchmark code: |
|
|
||
| // Increment version to change it | ||
| private static final String FILE_FORMAT_NAME = "FST"; | ||
| private static final int VERSION_START = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm shouldn't we bump the VERSION_CURRENT in FST with this change too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, we expect to throw IndexFormatTooNewException in CodecUtil.checkIndexHeader when the old version code reading new index format. so we should also bump the VERSION_CURRENT in Lucene90BlockTreeTermsReader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, if the change is backward compatible, maybe we can also keep the VERSION_CURRENT? could you please give me some suggestion? Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I'd like to protect against an older version of Lucene (w/o this change) trying to read an FST written with a newer version (with this change). If we bump the version, that older version would throw an understandable error, but if we don't, it'd be some strange assertion error or so?
I realize it'd be hard to even reach such a situation (you'd have to be using FSTs directly or so), but still when we make such changes to our format I think it's good practice to bump the version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, we expect to throw
IndexFormatTooNewExceptioninCodecUtil.checkIndexHeaderwhen the old version code reading new index format. so we should also bump theVERSION_CURRENTinLucene90BlockTreeTermsReader?
+1 to also bump the version in Lucene90BlockTreeTermsReader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your guidance! it's very helpful!
9960e12 to
008ee46
Compare
|
This might be a nice bump in |
9de1e4d to
61e89ed
Compare
gf2121
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I can help merge this in and backport if there is no objection in 48h.
mikemccand
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @easyice -- what a nice optimization. I'll merge shortly.
Thanks @gf2121 -- we should backport all these recent exciting FST changes in the right order as a batch to avoid scary hairy conflicts. I plan to do this in the next few days -- they have been baking in |
|
@mikemccand @gf2121 Thanks for review and merge it ;-) |
|
@mikemccand is now a good time to backport these changes? |
Yes I think so. I'll try to tackle this this week <-- one of the rare sentences where having the same word twice in a row is correct/grammatical. |
* init * review fix and reuse duplicate code * rebase * tidy * CHANGES.txt * bump version * rebase * CHANGES.txt
This PR resolves issue: #12701 . Thanks for the cool idea from @gf2121
It need some more benchmarking.