Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions lucene/CHANGES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,8 @@ Optimizations

* GITHUB#12381: Skip docs with DocValues in NumericLeafComparator. (Lu Xugang, Adrien Grand)

* GITHUB#12748: Specialize arc store for continuous label in FST. (Guo Feng, Chao Zhang)

Changes in runtime behavior
---------------------

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,11 @@ public final class Lucene90BlockTreeTermsReader extends FieldsProducer {
*/
public static final int VERSION_MSB_VLONG_OUTPUT = 1;

/** The version that specialize arc store for continuous label in FST. */
public static final int VERSION_FST_CONTINUOUS_ARCS = 2;

/** Current terms format. */
public static final int VERSION_CURRENT = VERSION_MSB_VLONG_OUTPUT;
public static final int VERSION_CURRENT = VERSION_FST_CONTINUOUS_ARCS;

/** Extension of terms index file */
static final String TERMS_INDEX_EXTENSION = "tip";
Expand Down
84 changes: 71 additions & 13 deletions lucene/core/src/java/org/apache/lucene/util/fst/FST.java
Original file line number Diff line number Diff line change
Expand Up @@ -98,11 +98,19 @@ public enum INPUT_TYPE {
*/
static final byte ARCS_FOR_DIRECT_ADDRESSING = 1 << 6;

/**
* Value of the arc flags to declare a node with continuous arcs designed for pos the arc directly
* with labelToPos - firstLabel. like {@link #ARCS_FOR_BINARY_SEARCH} we use flag combinations
* that will not occur at the same time.
*/
static final byte ARCS_FOR_CONTINUOUS = ARCS_FOR_DIRECT_ADDRESSING + ARCS_FOR_BINARY_SEARCH;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment explaining this arc optimization case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 , It is important.


// Increment version to change it
private static final String FILE_FORMAT_NAME = "FST";
private static final int VERSION_START = 6;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm shouldn't we bump the VERSION_CURRENT in FST with this change too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, we expect to throw IndexFormatTooNewException in CodecUtil.checkIndexHeader when the old version code reading new index format. so we should also bump the VERSION_CURRENT in Lucene90BlockTreeTermsReader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, if the change is backward compatible, maybe we can also keep the VERSION_CURRENT? could you please give me some suggestion? Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'd like to protect against an older version of Lucene (w/o this change) trying to read an FST written with a newer version (with this change). If we bump the version, that older version would throw an understandable error, but if we don't, it'd be some strange assertion error or so?

I realize it'd be hard to even reach such a situation (you'd have to be using FSTs directly or so), but still when we make such changes to our format I think it's good practice to bump the version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, we expect to throw IndexFormatTooNewException in CodecUtil.checkIndexHeader when the old version code reading new index format. so we should also bump the VERSION_CURRENT in Lucene90BlockTreeTermsReader?

+1 to also bump the version in Lucene90BlockTreeTermsReader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your guidance! it's very helpful!

private static final int VERSION_LITTLE_ENDIAN = 8;
static final int VERSION_CURRENT = VERSION_LITTLE_ENDIAN;
private static final int VERSION_CONTINUOUS_ARCS = 9;
static final int VERSION_CURRENT = VERSION_CONTINUOUS_ARCS;

// Never serialized; just used to represent the virtual
// final node w/ no arcs:
Expand Down Expand Up @@ -243,7 +251,10 @@ public String toString() {
.append(numArcs())
.append(")")
.append("(")
.append(nodeFlags() == ARCS_FOR_DIRECT_ADDRESSING ? "da" : "bs")
.append(
nodeFlags() == ARCS_FOR_DIRECT_ADDRESSING
? "da"
: nodeFlags() == ARCS_FOR_CONTINUOUS ? "cs" : "bs")
.append(")");
}
return b.toString();
Expand Down Expand Up @@ -285,8 +296,8 @@ public int arcIdx() {

/**
* Node header flags. Only meaningful to check if the value is either {@link
* #ARCS_FOR_BINARY_SEARCH} or {@link #ARCS_FOR_DIRECT_ADDRESSING} (other value when bytesPerArc
* == 0).
* #ARCS_FOR_BINARY_SEARCH} or {@link #ARCS_FOR_DIRECT_ADDRESSING} or {@link
* #ARCS_FOR_CONTINUOUS} (other value when bytesPerArc == 0).
*/
public byte nodeFlags() {
return nodeFlags;
Expand Down Expand Up @@ -318,7 +329,7 @@ public int numArcs() {

/**
* First label of a direct addressing node. Only valid if nodeFlags == {@link
* #ARCS_FOR_DIRECT_ADDRESSING}.
* #ARCS_FOR_DIRECT_ADDRESSING} or {@link #ARCS_FOR_CONTINUOUS}.
*/
int firstLabel() {
return firstLabel;
Expand Down Expand Up @@ -653,7 +664,9 @@ Arc<T> readLastTargetArc(Arc<T> follow, Arc<T> arc, BytesReader in) throws IOExc
} else {
in.setPosition(follow.target());
byte flags = arc.nodeFlags = in.readByte();
if (flags == ARCS_FOR_BINARY_SEARCH || flags == ARCS_FOR_DIRECT_ADDRESSING) {
if (flags == ARCS_FOR_BINARY_SEARCH
|| flags == ARCS_FOR_DIRECT_ADDRESSING
|| flags == ARCS_FOR_CONTINUOUS) {
// Special arc which is actually a node header for fixed length arcs.
// Jump straight to end to find the last arc.
arc.numArcs = in.readVInt();
Expand All @@ -664,10 +677,14 @@ Arc<T> readLastTargetArc(Arc<T> follow, Arc<T> arc, BytesReader in) throws IOExc
arc.firstLabel = readLabel(in);
arc.posArcsStart = in.getPosition();
readLastArcByDirectAddressing(arc, in);
} else {
} else if (flags == ARCS_FOR_BINARY_SEARCH) {
arc.arcIdx = arc.numArcs() - 2;
arc.posArcsStart = in.getPosition();
readNextRealArc(arc, in);
} else {
arc.firstLabel = readLabel(in);
arc.posArcsStart = in.getPosition();
readLastArcByContinuous(arc, in);
}
} else {
arc.flags = flags;
Expand Down Expand Up @@ -740,7 +757,9 @@ private void readFirstArcInfo(long nodeAddress, Arc<T> arc, final BytesReader in
in.setPosition(nodeAddress);

byte flags = arc.nodeFlags = in.readByte();
if (flags == ARCS_FOR_BINARY_SEARCH || flags == ARCS_FOR_DIRECT_ADDRESSING) {
if (flags == ARCS_FOR_BINARY_SEARCH
|| flags == ARCS_FOR_DIRECT_ADDRESSING
|| flags == ARCS_FOR_CONTINUOUS) {
// Special arc which is actually a node header for fixed length arcs.
arc.numArcs = in.readVInt();
arc.bytesPerArc = in.readVInt();
Expand All @@ -749,6 +768,8 @@ private void readFirstArcInfo(long nodeAddress, Arc<T> arc, final BytesReader in
readPresenceBytes(arc, in);
arc.firstLabel = readLabel(in);
arc.presenceIndex = -1;
} else if (flags == ARCS_FOR_CONTINUOUS) {
arc.firstLabel = readLabel(in);
}
arc.posArcsStart = in.getPosition();
} else {
Expand All @@ -773,7 +794,9 @@ boolean isExpandedTarget(Arc<T> follow, BytesReader in) throws IOException {
} else {
in.setPosition(follow.target());
byte flags = in.readByte();
return flags == ARCS_FOR_BINARY_SEARCH || flags == ARCS_FOR_DIRECT_ADDRESSING;
return flags == ARCS_FOR_BINARY_SEARCH
|| flags == ARCS_FOR_DIRECT_ADDRESSING
|| flags == ARCS_FOR_CONTINUOUS;
}
}

Expand Down Expand Up @@ -801,16 +824,18 @@ int readNextArcLabel(Arc<T> arc, BytesReader in) throws IOException {

in.setPosition(arc.nextArc());
byte flags = in.readByte();
if (flags == ARCS_FOR_BINARY_SEARCH || flags == ARCS_FOR_DIRECT_ADDRESSING) {
if (flags == ARCS_FOR_BINARY_SEARCH
|| flags == ARCS_FOR_DIRECT_ADDRESSING
|| flags == ARCS_FOR_CONTINUOUS) {
// System.out.println(" nextArc fixed length arc");
// Special arc which is actually a node header for fixed length arcs.
int numArcs = in.readVInt();
in.readVInt(); // Skip bytesPerArc.
if (flags == ARCS_FOR_BINARY_SEARCH) {
in.readByte(); // Skip arc flags.
} else {
} else if (flags == ARCS_FOR_DIRECT_ADDRESSING) {
in.skipBytes(getNumPresenceBytes(numArcs));
}
} // Nothing to do for ARCS_FOR_CONTINUOUS
}
} else {
switch (arc.nodeFlags()) {
Expand All @@ -826,6 +851,8 @@ int readNextArcLabel(Arc<T> arc, BytesReader in) throws IOException {
int nextIndex = BitTable.nextBitSet(arc.arcIdx(), arc, in);
assert nextIndex != -1;
return arc.firstLabel() + nextIndex;
case ARCS_FOR_CONTINUOUS:
return arc.firstLabel() + arc.arcIdx() + 1;
default:
// Variable length arcs - linear search.
assert arc.bytesPerArc() == 0;
Expand All @@ -849,6 +876,20 @@ public Arc<T> readArcByIndex(Arc<T> arc, final BytesReader in, int idx) throws I
return readArc(arc, in);
}

/**
* Reads a Continuous node arc, with the provided index in the label range.
*
* @param rangeIndex The index of the arc in the label range. It must be within the label range.
*/
public Arc<T> readArcByContinuous(Arc<T> arc, final BytesReader in, int rangeIndex)
throws IOException {
assert rangeIndex >= 0 && rangeIndex < arc.numArcs();
in.setPosition(arc.posArcsStart() - rangeIndex * (long) arc.bytesPerArc());
arc.arcIdx = rangeIndex;
arc.flags = in.readByte();
return readArc(arc, in);
}

/**
* Reads a present direct addressing node arc, with the provided index in the label range.
*
Expand Down Expand Up @@ -888,6 +929,11 @@ public Arc<T> readLastArcByDirectAddressing(Arc<T> arc, final BytesReader in) th
return readArcByDirectAddressing(arc, in, arc.numArcs() - 1, presenceIndex);
}

/** Reads the last arc of a continuous node. */
public Arc<T> readLastArcByContinuous(Arc<T> arc, final BytesReader in) throws IOException {
return readArcByContinuous(arc, in, arc.numArcs() - 1);
}

/** Never returns null, but you should never call this if arc.isLast() is true. */
public Arc<T> readNextRealArc(Arc<T> arc, final BytesReader in) throws IOException {

Expand All @@ -896,6 +942,7 @@ public Arc<T> readNextRealArc(Arc<T> arc, final BytesReader in) throws IOExcepti

switch (arc.nodeFlags()) {
case ARCS_FOR_BINARY_SEARCH:
case ARCS_FOR_CONTINUOUS:
assert arc.bytesPerArc() > 0;
arc.arcIdx++;
assert arc.arcIdx() >= 0 && arc.arcIdx() < arc.numArcs();
Expand Down Expand Up @@ -924,7 +971,7 @@ public Arc<T> readNextRealArc(Arc<T> arc, final BytesReader in) throws IOExcepti
* positioned just after the arc flags byte.
*/
private Arc<T> readArc(Arc<T> arc, BytesReader in) throws IOException {
if (arc.nodeFlags() == ARCS_FOR_DIRECT_ADDRESSING) {
if (arc.nodeFlags() == ARCS_FOR_DIRECT_ADDRESSING || arc.nodeFlags() == ARCS_FOR_CONTINUOUS) {
arc.label = arc.firstLabel() + arc.arcIdx();
} else {
arc.label = readLabel(in);
Expand Down Expand Up @@ -1067,6 +1114,17 @@ public Arc<T> findTargetArc(int labelToMatch, Arc<T> follow, Arc<T> arc, BytesRe
}
}
return null;
} else if (flags == ARCS_FOR_CONTINUOUS) {
arc.numArcs = in.readVInt();
arc.bytesPerArc = in.readVInt();
arc.firstLabel = readLabel(in);
arc.posArcsStart = in.getPosition();
int arcIndex = labelToMatch - arc.firstLabel();
if (arcIndex < 0 || arcIndex >= arc.numArcs()) {
return null; // Before or after label range.
}
arc.arcIdx = arcIndex - 1;
return readNextRealArc(arc, in);
}

// Linear scan
Expand Down
27 changes: 19 additions & 8 deletions lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
package org.apache.lucene.util.fst;

import static org.apache.lucene.util.fst.FST.ARCS_FOR_BINARY_SEARCH;
import static org.apache.lucene.util.fst.FST.ARCS_FOR_CONTINUOUS;
import static org.apache.lucene.util.fst.FST.ARCS_FOR_DIRECT_ADDRESSING;
import static org.apache.lucene.util.fst.FST.BIT_ARC_HAS_FINAL_OUTPUT;
import static org.apache.lucene.util.fst.FST.BIT_ARC_HAS_OUTPUT;
Expand Down Expand Up @@ -113,6 +114,7 @@ public class FSTCompiler<T> {
long nodeCount;
long binarySearchNodeCount;
long directAddressingNodeCount;
long continuousNodeCount;

final boolean allowFixedLengthArcs;
final float directAddressingMaxOversizingFactor;
Expand Down Expand Up @@ -445,9 +447,15 @@ long addNode(FSTCompiler.UnCompiledNode<T> nodeIn) throws IOException {

int labelRange = nodeIn.arcs[nodeIn.numArcs - 1].label - nodeIn.arcs[0].label + 1;
assert labelRange > 0;
if (shouldExpandNodeWithDirectAddressing(
boolean continuousLabel = labelRange == nodeIn.numArcs;
if (continuousLabel) {
writeNodeForDirectAddressingOrContinuous(
nodeIn, startAddress, maxBytesPerArcWithoutLabel, labelRange, true);
continuousNodeCount++;
} else if (shouldExpandNodeWithDirectAddressing(
nodeIn, maxBytesPerArc, maxBytesPerArcWithoutLabel, labelRange)) {
writeNodeForDirectAddressing(nodeIn, startAddress, maxBytesPerArcWithoutLabel, labelRange);
writeNodeForDirectAddressingOrContinuous(
nodeIn, startAddress, maxBytesPerArcWithoutLabel, labelRange, false);
directAddressingNodeCount++;
} else {
writeNodeForBinarySearch(nodeIn, startAddress, maxBytesPerArc);
Expand Down Expand Up @@ -578,18 +586,19 @@ private void writeNodeForBinarySearch(
bytes.writeBytes(startAddress, fixedLengthArcsBuffer.getBytes(), 0, headerLen);
}

private void writeNodeForDirectAddressing(
private void writeNodeForDirectAddressingOrContinuous(
FSTCompiler.UnCompiledNode<T> nodeIn,
long startAddress,
int maxBytesPerArcWithoutLabel,
int labelRange) {
int labelRange,
boolean continuous) {
// Expand the arcs backwards in a buffer because we remove the labels.
// So the obtained arcs might occupy less space. This is the reason why this
// whole method is more complex.
// Drop the label bytes since we can infer the label based on the arc index,
// the presence bits, and the first label. Keep the first label.
int headerMaxLen = 11;
int numPresenceBytes = getNumPresenceBytes(labelRange);
int numPresenceBytes = continuous ? 0 : getNumPresenceBytes(labelRange);
long srcPos = bytes.getPosition();
int totalArcBytes = numLabelBytesPerArc[0] + nodeIn.numArcs * maxBytesPerArcWithoutLabel;
int bufferOffset = headerMaxLen + numPresenceBytes + totalArcBytes;
Expand Down Expand Up @@ -620,7 +629,7 @@ private void writeNodeForDirectAddressing(
// metadata.
fixedLengthArcsBuffer
.resetPosition()
.writeByte(ARCS_FOR_DIRECT_ADDRESSING)
.writeByte(continuous ? ARCS_FOR_CONTINUOUS : ARCS_FOR_DIRECT_ADDRESSING)
.writeVInt(labelRange) // labelRange instead of numArcs.
.writeVInt(
maxBytesPerArcWithoutLabel); // maxBytesPerArcWithoutLabel instead of maxBytesPerArc.
Expand All @@ -642,8 +651,10 @@ private void writeNodeForDirectAddressing(
writeOffset += headerLen;

// Write the presence bits
writePresenceBits(nodeIn, writeOffset, numPresenceBytes);
writeOffset += numPresenceBytes;
if (continuous == false) {
writePresenceBits(nodeIn, writeOffset, numPresenceBytes);
writeOffset += numPresenceBytes;
}

// Write the first label and the arcs.
bytes.writeBytes(writeOffset, fixedLengthArcsBuffer.getBytes(), bufferOffset, totalArcBytes);
Expand Down
Loading