-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Optimize FST on-heap BytesReader #12879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 58 commits
a6259fc
e0e1517
9b986a8
f2d8234
13c9359
fa08d51
c6fb4b5
f1e8f89
2f9c730
847828d
0913062
0de0d26
ef3fdc6
f00d24f
12963df
f1e81b8
b7ceec9
c8165ad
d8cb927
096fa97
9a002c0
1c201d4
cfdeeff
1420cfc
7efcde0
8f98e7b
b140b91
dbc1918
a5c7e14
50de8f7
d90902f
5818fe2
e88f452
2587fc7
f0b78d2
94913bb
fd458c4
606fe45
dc45d9a
36685cc
1c4f68d
34aabf2
7ae16bb
4823bc1
817ae08
e0d56f5
14c977b
b7b7b0b
6f7a8c5
dba57ff
5552271
6cc31c9
ea3d588
e14a909
013b42b
c9f230a
34efa96
cbceb85
ef9eae6
9fccfbc
3c388fe
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -16,8 +16,12 @@ | |
| */ | ||
| package org.apache.lucene.util.fst; | ||
|
|
||
| import static org.apache.lucene.store.ByteBuffersDataOutput.ALLOCATE_BB_ON_HEAP; | ||
| import static org.apache.lucene.store.ByteBuffersDataOutput.NO_REUSE; | ||
|
|
||
| import java.io.IOException; | ||
| import org.apache.lucene.store.ByteBuffersDataInput; | ||
| import java.nio.ByteBuffer; | ||
| import java.util.List; | ||
| import org.apache.lucene.store.ByteBuffersDataOutput; | ||
| import org.apache.lucene.store.DataOutput; | ||
|
|
||
|
|
@@ -28,13 +32,19 @@ | |
| final class ReadWriteDataOutput extends DataOutput implements FSTReader { | ||
|
|
||
| private final ByteBuffersDataOutput dataOutput; | ||
| // the DataInput to read from once we finish writing | ||
| private ByteBuffersDataInput dataInput; | ||
| private final int blockBits; | ||
| private final int blockSize; | ||
| private final int blockMask; | ||
| private List<ByteBuffer> byteBuffers; | ||
| // whether this DataOutput is already frozen | ||
| private boolean frozen; | ||
|
|
||
| public ReadWriteDataOutput(ByteBuffersDataOutput dataOutput) { | ||
| this.dataOutput = dataOutput; | ||
| public ReadWriteDataOutput(int blockBits) { | ||
| this.dataOutput = | ||
| new ByteBuffersDataOutput(blockBits, blockBits, ALLOCATE_BB_ON_HEAP, NO_REUSE); | ||
| this.blockBits = blockBits; | ||
| this.blockSize = 1 << blockBits; | ||
| this.blockMask = blockSize - 1; | ||
| } | ||
|
|
||
| @Override | ||
|
|
@@ -56,14 +66,59 @@ public long ramBytesUsed() { | |
|
|
||
| public void freeze() { | ||
| frozen = true; | ||
| // this operation are costly, so we want to compute it once and cache | ||
| dataInput = dataOutput.toDataInput(); | ||
| // this operation is costly, so we want to compute it once and cache | ||
| this.byteBuffers = dataOutput.toWriteableBufferList(); | ||
| } | ||
|
|
||
| @Override | ||
| public FST.BytesReader getReverseBytesReader() { | ||
| assert dataInput != null; // freeze() must be called first | ||
| return new ReverseRandomAccessReader(dataInput); | ||
| assert byteBuffers != null; // freeze() must be called first | ||
| if (byteBuffers.size() == 1) { | ||
| // use a faster implementation for single-block case | ||
| return new ReverseBytesReader(byteBuffers.get(0).array()); | ||
| } | ||
| return new FST.BytesReader() { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do tests exercise this path sometimes? We would need tests to either make large FSTs, or, randomize the block size so sometimes it's tiny?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm I'm still worried about test coverage of this. Conditionals like this are dangerous for Lucene, since our tests only test tiny FSTs, this code path would be rarely/never executed, likely even by our nightly benchmarks. Ideally we would find a way to randomize the page size in many tests, or at least, one test?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry I meant TestFSTs.testRealTerms. The default block size is 32KB and the FST is 55-80KB, thus 2-3 blocks. |
||
| private byte[] current = byteBuffers.get(0).array(); | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| private int nextBuffer = -1; | ||
| private int nextRead = 0; | ||
|
||
|
|
||
| @Override | ||
| public byte readByte() { | ||
| if (nextRead == -1) { | ||
| current = byteBuffers.get(nextBuffer--).array(); | ||
| nextRead = blockSize - 1; | ||
| } | ||
| return current[nextRead--]; | ||
| } | ||
|
|
||
| @Override | ||
| public void skipBytes(long count) { | ||
| setPosition(getPosition() - count); | ||
| } | ||
|
|
||
| @Override | ||
| public void readBytes(byte[] b, int offset, int len) { | ||
| for (int i = 0; i < len; i++) { | ||
| b[offset + i] = readByte(); | ||
| } | ||
| } | ||
|
|
||
| @Override | ||
| public long getPosition() { | ||
| return ((long) nextBuffer + 1) * blockSize + nextRead; | ||
| } | ||
|
|
||
| @Override | ||
| public void setPosition(long pos) { | ||
| int bufferIndex = (int) (pos >> blockBits); | ||
| if (nextBuffer != bufferIndex - 1) { | ||
| nextBuffer = bufferIndex - 1; | ||
| current = byteBuffers.get(bufferIndex).array(); | ||
| } | ||
| nextRead = (int) (pos & blockMask); | ||
| assert getPosition() == pos : "pos=" + pos + " getPos()=" + getPosition(); | ||
| } | ||
| }; | ||
| } | ||
|
|
||
| @Override | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we
assert hasArray()before this?.array()is optional (only valid forDirectByteBufferI think).Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we call toWriteableBufferList the array should be accessible. I think if the array is somehow not accessible HeapByteBuffer (the one being used) would throw an exception and thus would be caught by tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I realize the code is safe but it takes some thinking to confirm it -- maybe add the assert with your awesome above explanation? So future code readers know why this
ByteBufferis for sure backed by an array.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. I've added the assertion. I decided to put it right after the ByteBuffer list is created (in
freeze()) and assert all ByteBuffer in the list.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks, looks great @dungba88!