Skip to content

Conversation

@dungba88
Copy link
Contributor

@dungba88 dungba88 commented Oct 31, 2023

Description

Fix #12714

First attempt to introduce value-based LRU cache in NodeHash. There are some inefficiencies, but the functionalities work.

See this comment for the main idea

I think we can use ByteBlockPool to store the byte[] slices, just appending a new byte[] slice when we store a new suffix. We never delete individual suffixes, but rather discard the entire "secondary" hash map in the double barrel cache, so we could just drop/recycle the ByteBlockPool at that point too.

@dungba88 dungba88 marked this pull request as draft October 31, 2023 07:48
Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this approach! Thank you for tackling it so quickly!

It's wonderful that it passes all tests :)

We just need to fix the nocommits (switch to ByteBlockPool to hold the copied per-node byte[] values)!

Thanks @dungba88

throws IOException {
// nocommit: this is non-optimal, we should have a BytesReader that wraps and read the
// ByteBlockPool directly
byte[] bytes = table.getBytes(address);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can change address from the "global" address (in the FST's growing append-only byte[]), to the more compact address of the ByteBlockPool belonging to each of two (primary and fallback) hash sets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up storing the relative differences between the global address and the ByteBlockPool address. It retains the existing behavior of FST operation (which always rely on the global address). See the ByteBlockPoolReverseBytesReader

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh that's right, the hash map must retain the true FST offset since that's what future added nodes must link to!

@dungba88
Copy link
Contributor Author

dungba88 commented Oct 31, 2023

I solved most of the nocommits (only 1 left). But I ended up using a List<byte[]> where each item is a node instead of ByteBlockPool due to the following reasons:

  • With ByteBlockPool we have 2 unavoidable double byte-copies: (1) when write from BytesStore to the primary table and (2) when promote an entry from the fallback table to the primary table. In both situations we need to first write into a temporary byte[].
  • Some additional side benefits:
    • The fallback and primary tables can share some of copied nodes (when it is promoted), thus reducing the memory usage
    • We automatically get the length of the node without any traversing
    • When creating the BytesReader for FST operations, we just the byte[] as is and put it to the ReverseByteReader, no copy here

The downside is that it's limited to 2 billion nodes (which I think should suffice). This can also be overcome by using a nested list with fixed inner list size.

Copy link
Contributor Author

@dungba88 dungba88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some explanation


public PagedGrowableHash() {
entries = new PagedGrowableWriter(16, BLOCK_SIZE_BYTES, 8, PackedInts.COMPACT);
copiedOffsets = new PagedGrowableWriter(32, BLOCK_SIZE_BYTES, 8, PackedInts.COMPACT);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are doubling the bytes because we need to write both the node global address and the local copiedNodes offset (in adjacent position). Global address will be written on the even position.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see: we need two long values stored per entry? One for the true FST appending byte[] offset, and another for the local pool'd copy of just the byte[] for this node? Maybe rename entries to fstNodeAddress and then copiedNodeAddress for the pool'd offsets?

@dungba88 dungba88 marked this pull request as ready for review November 1, 2023 04:52
@dungba88
Copy link
Contributor Author

dungba88 commented Nov 1, 2023

Ok, it's ready for review. I'll add the CHANGES.txt entry once it's approved.

@mikemccand
Copy link
Member

But I ended up using a List<byte[]> where each item is a node instead of ByteBlockPool due to the following reasons:

Hmm -- this is sizable added RAM overhead per entry. Added array header (16 or 20 bytes), added pointers to these arrays (4 or 8 bytes), tiny objects for GC to crawl, though maybe they mostly die young if the RAM budget is low enough. Yes, we can share some of these byte[] with both hashes but I suspect that's not a net/net win.

We automatically get the length of the node without any traversing

We are paying 4 bytes per entry for this :) And it seems a shame since FST's node encoding is already self-delimiting, and, as a side effect of looking up that node we will already have computed that length.

With ByteBlockPool we have 2 unavoidable double byte-copies: (1) when write from BytesStore to the primary table and (2) when promote an entry from the fallback table to the primary table. In both situations we need to first write into a temporary byte[].

I don't think this added cost is so much. We freeze the node into the true FST appending byte store, then, we copy those "last N bytes" over to primary hash. Eventually it's moved to fallback, and, maybe it never gets promoted back (single copy), or, maybe it does (+1 copy) but that's "worth it" since we achieve some minimization, i.e. the cost correlates nicely with the win (the whole point of this double LRU hash).

@dungba88
Copy link
Contributor Author

dungba88 commented Nov 1, 2023

Eventually it's moved to fallback, and, maybe it never gets promoted back (single copy), or, maybe it does (+1 copy)

There is actually already one copy before this, which is where we read from the BytesStore into the temporary byte[]. So in case it never got promoted it's 2 copy and when it is, it is 4 copy (one to read from the fallback table into a temporary byte[] and one to write to the primary table).

When it got promoted, we also had one side effect that the byte can be shared between the primary and fallback.

I see there is a tradeoff here. If we don't care too much about CPU then we can use BytesRefArray like @gf2121 suggested.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like where this is going! But I sure hope we can pack each copied node's byte[] into a single ByteBlockPool instead of separate byte[] ... the added RAM overhead of the latter is high.


public PagedGrowableHash() {
entries = new PagedGrowableWriter(16, BLOCK_SIZE_BYTES, 8, PackedInts.COMPACT);
copiedOffsets = new PagedGrowableWriter(32, BLOCK_SIZE_BYTES, 8, PackedInts.COMPACT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see: we need two long values stored per entry? One for the true FST appending byte[] offset, and another for the local pool'd copy of just the byte[] for this node? Maybe rename entries to fstNodeAddress and then copiedNodeAddress for the pool'd offsets?

PackedInts.bitsRequired(lastNodeAddress),
PackedInts.COMPACT);
mask = size - 1;
offsetMask = size * 2 - 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm -- why are entries and copiedOffsets not simply side-by-side values arrays for the hash map? Why is copiedOffsets 2X the size? It seems like we could have them precisely match (side by side value arrays for the same hash entry)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index of entries are the hash of the node arcs while the index of copiedOffsets are the hash of the address hence their positions are not matched.

long pos = Long.hashCode(pointer) & mask;
// find an empty slot
while (fstNodeAddress.get(pos) != 0) {
pos = (pos + 1) & mask;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm we found linear probing to be slower than quadratic in a prior PR. But if we consolidate down to the two parallel PagedGrowableWriter we can just use the quadratic hash we already use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it this PR (#12716)? That's interesting. I also got the linear probing from that PR. Lemme change it to quadratic.

* (long), returning the local copiedNodes start address if the two nodes are matched, or -1
* otherwise
*/
private int getMatchedNodeLength(FSTCompiler.UnCompiledNode<T> node, long address)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this duplicating the nodesEqual code? Instead of that, could we have an instance variable that sets the length as a side effect of nodeEquals? A bit messy, but ... I think worth it?

Copy link
Contributor Author

@dungba88 dungba88 Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually renamed from nodesEqual (it was removed), so there is no duplication. The old behavior is essentially getMatchedNodeLength != -1

@mikemccand
Copy link
Member

Thanks @dungba88 -- I will review!

But first I tried running IndexToFST (recently born helper tool, now in luceneutil) on a wikimediumall index, creating the FST from all of its body field terms, but hit this exciting doozie:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 350                                                                                                             
        at org.apache.lucene.util.ByteBlockPool.readBytes(ByteBlockPool.java:250)                                                                                                                                      
        at org.apache.lucene.util.fst.NodeHash$PagedGrowableHash.getBytes(NodeHash.java:248)                                                                                                                           
        at org.apache.lucene.util.fst.NodeHash.add(NodeHash.java:121)                                                                                                                                                  
        at org.apache.lucene.util.fst.FSTCompiler.compileNode(FSTCompiler.java:294)                                                                                                                                    
        at org.apache.lucene.util.fst.FSTCompiler.freezeTail(FSTCompiler.java:702)                                                                                                                                     
        at org.apache.lucene.util.fst.FSTCompiler.add(FSTCompiler.java:776)                                                                                                                                            
        at IndexToFST.main(IndexToFST.java:65)

Seems like a lastFallbackNodeLength was -1 (node not found in fallback) sort of situation? Not sure...

@dungba88
Copy link
Contributor Author

dungba88 commented Nov 2, 2023

Yes, I just noticed that, and pushed out a fix.

Seems like I was using the primary table pos instead of the fallback pos. And I added an assertion to catch it earlier.

Let me also re-run the test with my local Github.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking closer! I left a bunch of small comments. Thanks @dungba88!

// at 0:
assert node != 0;
assert node != FST.FINAL_END_NODE && node != FST.NON_FINAL_END_NODE;
byte[] buf = new byte[Math.toIntExact(node - startAddress + 1)];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm why the +1 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node is the last address of the node and startAddress is its starting address. Hence we compute the length by end - start + 1 (like if end == start then length must be 1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brain hurts! Reverse ob1 error in my brain :)

What confuses me here is if fstCompiler.addNode writes 3 bytes, won't we compute a length of 4?

Oh, I see! FSTCompiler#addNode has this at the end:

    final long thisNodeAddress = bytes.getPosition() - 1;
    bytes.reverse(startAddress, thisNodeAddress);
    nodeCount++;
    return thisNodeAddress;

So that last byte address it returns is inclusive, since it did the -1, so we have to +1 to undo that. OK I think it makese sense ;)

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very close!

}

@Override
public boolean reversed() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why FST.BytesReader even has this method? It might be a holdover (now dead?) from the pack days (long ago removed). But we should not try to fix it here ... this change is awesome enough already!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open a spinoff issue for this -- it seems at quick glance to be dead/pointless code.

fstNodeAddress.set(hashSlot, nodeAddress);
count++;
copiedNodes.append(bytes);
copiedBytes += bytes.length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm couldn't we just use copiedBytes.getPosition() instead? Isn't that the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see this method in ByteBlockPool, maybe it got removed? It's also strange that this class does not have a method to get its size, although we might be able to compute with the bufferUpTo and byteUpTo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add that method in ByteBlockPool as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, it was just there, in 9.x, disappeared when I git pulld! Something must've just removed it? Found it:

LUCENE-10560: Faster merging of TermsEnum (#1052)

Maybe it was just dead code and @jpountz removed? But let's bring it back from the dead here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikemccand This change did not touch ByteBlockPool? Or did I misunderstand what you said?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikemccand This change did not touch ByteBlockPool? Or did I misunderstand what you said?

Woops sorry you are right! Your commit did not touch ByteBlockPool. Now I am really confused about what I though I saw early this AM. I must've been hallucinating. Maybe I am just an LLM.

Sorry for the false accusation!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add that method in ByteBlockPool as well.

+1 to add it LOL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I added it, and found there is a potential inconsistenct in blocksize between the Allocator and ByteBlockPool, added a TODO to fix that later.

copiedBytes += bytes.length;
// write the offset, which points to the last byte of the node we copied since we later read
// this node in reverse
copiedNodeAddress.set(hashSlot, copiedBytes - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assert this slot is 0 before we set it?

@mikemccand
Copy link
Member

Thanks @dungba88!

I confirmed that IndexToFST now works again, and, when given "up to" inf RAM to use, it produces the same sized minimal fst.bin as main at 367244208 bytes.

It's a bit slower than main, but that's to be expected and I think OK. We can optimize later ... being able to fully cap RAM usage, nomatter how big an FST you produce, is worth this tradeoff. (Note: we still need to fix FST writing to spool to disk to achieve RAM capping, the other PR @dungba88 is tackling -- thank you!).

main:

  saved FST to "fst.bin": 367244208 bytes; 47.758 sec
  saved FST to "fst.bin": 367244208 bytes; 48.134 sec

This PR:

  saved FST to "fst.bin": 367244208 bytes; 54.765 sec
  saved FST to "fst.bin": 367244208 bytes; 58.616 sec

@mikemccand
Copy link
Member

Test2BFST is happy, yay!

BUILD SUCCESSFUL in 56m 36s

public abstract static class Allocator {
// TODO: ByteBlockPool assume the blockSize is always {@link BYTE_BLOCK_SIZE}, but this class
// allow arbitrary value of blockSize. We should make them consistent.
protected final int blockSize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm do we have any if or assert that confirms Allocator's blockSize == ByteBlockPool.BYTE_BLOCK_SIZE when passed to ByteBlockPool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have, and this Allocator seems to be only used by ByteBlockPool so maybe we don't need it to have custom block size?

totalBytes += size;

// make sure we report the correct position
assertEquals(totalBytes, pool.getPosition());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks great to me! I think it's ready! I'll merge today.

We'll let this bake in main for a few days, and given that the prior big change (capping RAM used by NodeHash) seems to have baked successfully on main as well, let's backport these recent FST improvements next week?

@mikemccand mikemccand merged commit b8a9b0a into apache:main Nov 4, 2023
@mikemccand
Copy link
Member

I merged to main, thank you @dungba88 for the fast iterations! I could barely keep up just reviewing :)

After all this FST dust settles let's remember to add your CHANGES.txt entry summarizing all the progress with capping RAM usage of FSTs. I think we can make one entry (something like "FST Compiler can now build arbitrary large FSTs with capped/controllable RAM usage"), linking to the N GitHub issues/PRs it took to accomplish. Maybe after we backport to 9.x?

@dungba88
Copy link
Contributor Author

dungba88 commented Nov 4, 2023

Thank you @mikemccand ! Agree we should have a single changes entry summarizing all different PR

mikemccand pushed a commit that referenced this pull request Nov 20, 2023
* Use value-based LRU cache in NodeHash (#12714)

* tidy code

* Add a nocommit about OffsetAndLength

* Fix the readBytes method

* Use List<byte[]> instead of ByteBlockPool

* Move nodesEqual to PagedGrowableHash

* Add generic type

* Fix the count variable

* Fix the RAM usage measurement

* Use PagedGrowableWriter instead of HashMap

* Remove unused generic type

* Update the ramBytesUsed formula

* Retain the FSTCompiler.addNode signature

* Switch back to ByteBlockPool

* Remove the unnecessary assertion

* Remove fstHashAddress

* Add some javadoc

* Fix the address offset when reading from fallback table

* tidy code

* Address comments

* Add assertions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FSTCompiler's NodeHash should fully duplicate byte[] slices from the growing FST

5 participants