-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Use value-based LRU cache in NodeHash #12738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mikemccand
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this approach! Thank you for tackling it so quickly!
It's wonderful that it passes all tests :)
We just need to fix the nocommits (switch to ByteBlockPool to hold the copied per-node byte[] values)!
Thanks @dungba88
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java
Outdated
Show resolved
Hide resolved
| throws IOException { | ||
| // nocommit: this is non-optimal, we should have a BytesReader that wraps and read the | ||
| // ByteBlockPool directly | ||
| byte[] bytes = table.getBytes(address); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can change address from the "global" address (in the FST's growing append-only byte[]), to the more compact address of the ByteBlockPool belonging to each of two (primary and fallback) hash sets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up storing the relative differences between the global address and the ByteBlockPool address. It retains the existing behavior of FST operation (which always rely on the global address). See the ByteBlockPoolReverseBytesReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh that's right, the hash map must retain the true FST offset since that's what future added nodes must link to!
lucene/core/src/java/org/apache/lucene/util/fst/ReverseBytesReader.java
Outdated
Show resolved
Hide resolved
ce4c571 to
0c03d8a
Compare
f9d0b38 to
b5a1be9
Compare
|
I solved most of the nocommits (only 1 left). But I ended up using a
The downside is that it's limited to 2 billion nodes (which I think should suffice). This can also be overcome by using a nested list with fixed inner list size. |
6db6000 to
b9d4209
Compare
dungba88
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some explanation
|
|
||
| public PagedGrowableHash() { | ||
| entries = new PagedGrowableWriter(16, BLOCK_SIZE_BYTES, 8, PackedInts.COMPACT); | ||
| copiedOffsets = new PagedGrowableWriter(32, BLOCK_SIZE_BYTES, 8, PackedInts.COMPACT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are doubling the bytes because we need to write both the node global address and the local copiedNodes offset (in adjacent position). Global address will be written on the even position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see: we need two long values stored per entry? One for the true FST appending byte[] offset, and another for the local pool'd copy of just the byte[] for this node? Maybe rename entries to fstNodeAddress and then copiedNodeAddress for the pool'd offsets?
|
Ok, it's ready for review. I'll add the CHANGES.txt entry once it's approved. |
5b28eee to
214091c
Compare
Hmm -- this is sizable added RAM overhead per entry. Added array header (16 or 20 bytes), added pointers to these arrays (4 or 8 bytes), tiny objects for GC to crawl, though maybe they mostly die young if the RAM budget is low enough. Yes, we can share some of these
We are paying 4 bytes per entry for this :) And it seems a shame since FST's node encoding is already self-delimiting, and, as a side effect of looking up that node we will already have computed that length.
I don't think this added cost is so much. We freeze the node into the true FST appending byte store, then, we copy those "last N bytes" over to primary hash. Eventually it's moved to fallback, and, maybe it never gets promoted back (single copy), or, maybe it does (+1 copy) but that's "worth it" since we achieve some minimization, i.e. the cost correlates nicely with the win (the whole point of this double LRU hash). |
There is actually already one copy before this, which is where we read from the BytesStore into the temporary byte[]. So in case it never got promoted it's 2 copy and when it is, it is 4 copy (one to read from the fallback table into a temporary byte[] and one to write to the primary table). When it got promoted, we also had one side effect that the byte can be shared between the primary and fallback. I see there is a tradeoff here. If we don't care too much about CPU then we can use BytesRefArray like @gf2121 suggested. |
mikemccand
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like where this is going! But I sure hope we can pack each copied node's byte[] into a single ByteBlockPool instead of separate byte[] ... the added RAM overhead of the latter is high.
|
|
||
| public PagedGrowableHash() { | ||
| entries = new PagedGrowableWriter(16, BLOCK_SIZE_BYTES, 8, PackedInts.COMPACT); | ||
| copiedOffsets = new PagedGrowableWriter(32, BLOCK_SIZE_BYTES, 8, PackedInts.COMPACT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see: we need two long values stored per entry? One for the true FST appending byte[] offset, and another for the local pool'd copy of just the byte[] for this node? Maybe rename entries to fstNodeAddress and then copiedNodeAddress for the pool'd offsets?
| PackedInts.bitsRequired(lastNodeAddress), | ||
| PackedInts.COMPACT); | ||
| mask = size - 1; | ||
| offsetMask = size * 2 - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm -- why are entries and copiedOffsets not simply side-by-side values arrays for the hash map? Why is copiedOffsets 2X the size? It seems like we could have them precisely match (side by side value arrays for the same hash entry)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index of entries are the hash of the node arcs while the index of copiedOffsets are the hash of the address hence their positions are not matched.
| long pos = Long.hashCode(pointer) & mask; | ||
| // find an empty slot | ||
| while (fstNodeAddress.get(pos) != 0) { | ||
| pos = (pos + 1) & mask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm we found linear probing to be slower than quadratic in a prior PR. But if we consolidate down to the two parallel PagedGrowableWriter we can just use the quadratic hash we already use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it this PR (#12716)? That's interesting. I also got the linear probing from that PR. Lemme change it to quadratic.
| * (long), returning the local copiedNodes start address if the two nodes are matched, or -1 | ||
| * otherwise | ||
| */ | ||
| private int getMatchedNodeLength(FSTCompiler.UnCompiledNode<T> node, long address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this duplicating the nodesEqual code? Instead of that, could we have an instance variable that sets the length as a side effect of nodeEquals? A bit messy, but ... I think worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually renamed from nodesEqual (it was removed), so there is no duplication. The old behavior is essentially getMatchedNodeLength != -1
e85238a to
8a0ea42
Compare
5d6a0a8 to
a033a69
Compare
a033a69 to
3e3416c
Compare
|
Thanks @dungba88 -- I will review! But first I tried running Seems like a |
|
Yes, I just noticed that, and pushed out a fix. Seems like I was using the primary table pos instead of the fallback pos. And I added an assertion to catch it earlier. Let me also re-run the test with my local Github. |
mikemccand
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking closer! I left a bunch of small comments. Thanks @dungba88!
lucene/core/src/java/org/apache/lucene/util/fst/ByteBlockPoolReverseBytesReader.java
Outdated
Show resolved
Hide resolved
| // at 0: | ||
| assert node != 0; | ||
| assert node != FST.FINAL_END_NODE && node != FST.NON_FINAL_END_NODE; | ||
| byte[] buf = new byte[Math.toIntExact(node - startAddress + 1)]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm why the +1 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node is the last address of the node and startAddress is its starting address. Hence we compute the length by end - start + 1 (like if end == start then length must be 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brain hurts! Reverse ob1 error in my brain :)
What confuses me here is if fstCompiler.addNode writes 3 bytes, won't we compute a length of 4?
Oh, I see! FSTCompiler#addNode has this at the end:
final long thisNodeAddress = bytes.getPosition() - 1;
bytes.reverse(startAddress, thisNodeAddress);
nodeCount++;
return thisNodeAddress;
So that last byte address it returns is inclusive, since it did the -1, so we have to +1 to undo that. OK I think it makese sense ;)
f766057 to
bf1bc49
Compare
mikemccand
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very close!
| } | ||
|
|
||
| @Override | ||
| public boolean reversed() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why FST.BytesReader even has this method? It might be a holdover (now dead?) from the pack days (long ago removed). But we should not try to fix it here ... this change is awesome enough already!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open a spinoff issue for this -- it seems at quick glance to be dead/pointless code.
| fstNodeAddress.set(hashSlot, nodeAddress); | ||
| count++; | ||
| copiedNodes.append(bytes); | ||
| copiedBytes += bytes.length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm couldn't we just use copiedBytes.getPosition() instead? Isn't that the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see this method in ByteBlockPool, maybe it got removed? It's also strange that this class does not have a method to get its size, although we might be able to compute with the bufferUpTo and byteUpTo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could add that method in ByteBlockPool as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, it was just there, in 9.x, disappeared when I git pulld! Something must've just removed it? Found it:
LUCENE-10560: Faster merging of TermsEnum (#1052)
Maybe it was just dead code and @jpountz removed? But let's bring it back from the dead here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikemccand This change did not touch ByteBlockPool? Or did I misunderstand what you said?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikemccand This change did not touch
ByteBlockPool? Or did I misunderstand what you said?
Woops sorry you are right! Your commit did not touch ByteBlockPool. Now I am really confused about what I though I saw early this AM. I must've been hallucinating. Maybe I am just an LLM.
Sorry for the false accusation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could add that method in ByteBlockPool as well.
+1 to add it LOL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I added it, and found there is a potential inconsistenct in blocksize between the Allocator and ByteBlockPool, added a TODO to fix that later.
| copiedBytes += bytes.length; | ||
| // write the offset, which points to the last byte of the node we copied since we later read | ||
| // this node in reverse | ||
| copiedNodeAddress.set(hashSlot, copiedBytes - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we assert this slot is 0 before we set it?
|
Thanks @dungba88! I confirmed that It's a bit slower than
This PR: |
|
|
| public abstract static class Allocator { | ||
| // TODO: ByteBlockPool assume the blockSize is always {@link BYTE_BLOCK_SIZE}, but this class | ||
| // allow arbitrary value of blockSize. We should make them consistent. | ||
| protected final int blockSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm do we have any if or assert that confirms Allocator's blockSize == ByteBlockPool.BYTE_BLOCK_SIZE when passed to ByteBlockPool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have, and this Allocator seems to be only used by ByteBlockPool so maybe we don't need it to have custom block size?
| totalBytes += size; | ||
|
|
||
| // make sure we report the correct position | ||
| assertEquals(totalBytes, pool.getPosition()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
mikemccand
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks great to me! I think it's ready! I'll merge today.
We'll let this bake in main for a few days, and given that the prior big change (capping RAM used by NodeHash) seems to have baked successfully on main as well, let's backport these recent FST improvements next week?
|
I merged to main, thank you @dungba88 for the fast iterations! I could barely keep up just reviewing :) After all this FST dust settles let's remember to add your CHANGES.txt entry summarizing all the progress with capping RAM usage of FSTs. I think we can make one entry (something like "FST Compiler can now build arbitrary large FSTs with capped/controllable RAM usage"), linking to the N GitHub issues/PRs it took to accomplish. Maybe after we backport to 9.x? |
|
Thank you @mikemccand ! Agree we should have a single changes entry summarizing all different PR |
* Use value-based LRU cache in NodeHash (#12714) * tidy code * Add a nocommit about OffsetAndLength * Fix the readBytes method * Use List<byte[]> instead of ByteBlockPool * Move nodesEqual to PagedGrowableHash * Add generic type * Fix the count variable * Fix the RAM usage measurement * Use PagedGrowableWriter instead of HashMap * Remove unused generic type * Update the ramBytesUsed formula * Retain the FSTCompiler.addNode signature * Switch back to ByteBlockPool * Remove the unnecessary assertion * Remove fstHashAddress * Add some javadoc * Fix the address offset when reading from fallback table * tidy code * Address comments * Add assertions
Description
Fix #12714
First attempt to introduce value-based LRU cache in NodeHash. There are some inefficiencies, but the functionalities work.
See this comment for the main idea