Skip to content

Commit 2ca906d

Browse files
committed
Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation (#12633)
* tweak comments; change if to switch * remove old SOPs, minor comment styling, fixed silly performance bug on rehash using the wrong bitsRequired (count vs node) * first raw cut; some nocommits added; some tests fail * tests pass! * fix silly fallback hash bug * remove SOPs; add some temporary debugging metrics * add temporary tool to test FST performance across differing NodeHash sizes * remove (now deleted) shouldShareNonSingletonNodes call from Lucene90BlockTreeTermsWriter * add simple tool to render results table to GitHub MD * add simple temporary tool to iterate all terms from a provided luceneutil wikipedia index and build an FST from them * first cut at using packed ints for hash t able again * add some nocommits; tweak test_all_sizes.py to new RAM usage approach; when half of the double barrel is full, allocate new primary hash at full size to save cost of continuously rehashing for a large FST * switch to limit suffix hash by RAM usage not count (more intuitive for users); clean up some stale nocommits * switch to more intuitive approximate RAM (mb) limit for allowed size of NodeHash * nuke a few nocommits; a few more remain * remove DO_PRINT_HASH_RAM * no more FST pruning * remove final nocommit: randomly change allowed NodeHash suffix RAM size in TestFSTs.testRealTerms * remove SOP * tidy * delete temp utility tools * remove dead (FST pruning) code * add CHANGES entry; fix one missed fst.addNode -> fstCompiler.addNode during merge conflict resolution * remove a mal-formed nocommit * fold PR feedback * fold feedback * add gradle help test details on how to specify heap size for the test JVM; fix bogus assert (uncovered by Test2BFST); add TODO to Test2BFST anticipating building massive FSTs in small bounded RAM * suppress sysout checks for Test2BFSTs; add helpful comment showing how to run it directly * tidy
1 parent 13d1a19 commit 2ca906d

File tree

14 files changed

+390
-624
lines changed

14 files changed

+390
-624
lines changed

help/tests.txt

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,15 @@ specifying the project and test task or a fully qualified task path. Example:
133133
gradlew -p lucene/core test -Ptests.verbose=true --tests "TestDemo"
134134

135135

136+
Larger heap size
137+
--------------------------
138+
139+
By default tests run with a 512 MB max heap. But some tests (monster/nightly)
140+
need more heap. Use "-Dtests.heapsize" for this:
141+
142+
gradlew -p lucene/core test --tests "Test2BFST" -Dtest.heapsize=32g
143+
144+
136145
Run GUI tests headlessly with Xvfb (Linux only)
137146
-----------------------------------------------
138147

lucene/CHANGES.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,13 @@ Improvements
9393

9494
* GITHUB#11277, LUCENE-10241: Upgrade to OpenNLP to 1.9.4. (Jeff Zemerick)
9595

96+
* GITHUB#12542: FSTCompiler can now approximately limit how much RAM it uses to share
97+
suffixes during FST construction using the suffixRAMLimitMB method. Larger values
98+
result in a more minimal FST (more common suffixes are shard). Pass
99+
Double.POSITIVE_INFINITY to use as much RAM as is needed to create a purely
100+
minimal FST. Inspired by this Rust FST implemention:
101+
https://blog.burntsushi.net/transducers (Mike McCandless)
102+
96103
Optimizations
97104
---------------------
98105
* GITHUB#12183: Make TermStates#build concurrent. (Shubham Chaudhary)

lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene40/blocktree/Lucene40BlockTreeTermsWriter.java

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -478,9 +478,7 @@ public void compileIndex(
478478

479479
final ByteSequenceOutputs outputs = ByteSequenceOutputs.getSingleton();
480480
final FSTCompiler<BytesRef> fstCompiler =
481-
new FSTCompiler.Builder<>(FST.INPUT_TYPE.BYTE1, outputs)
482-
.shouldShareNonSingletonNodes(false)
483-
.build();
481+
new FSTCompiler.Builder<>(FST.INPUT_TYPE.BYTE1, outputs).build();
484482
// if (DEBUG) {
485483
// System.out.println(" compile index for prefix=" + prefix);
486484
// }

lucene/codecs/src/java/org/apache/lucene/codecs/blocktreeords/OrdsBlockTreeTermsWriter.java

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -395,9 +395,7 @@ public void compileIndex(
395395
}
396396

397397
final FSTCompiler<Output> fstCompiler =
398-
new FSTCompiler.Builder<>(FST.INPUT_TYPE.BYTE1, FST_OUTPUTS)
399-
.shouldShareNonSingletonNodes(false)
400-
.build();
398+
new FSTCompiler.Builder<>(FST.INPUT_TYPE.BYTE1, FST_OUTPUTS).build();
401399
// if (DEBUG) {
402400
// System.out.println(" compile index for prefix=" + prefix);
403401
// }

lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -521,12 +521,7 @@ public void compileIndex(
521521

522522
final ByteSequenceOutputs outputs = ByteSequenceOutputs.getSingleton();
523523
final FSTCompiler<BytesRef> fstCompiler =
524-
new FSTCompiler.Builder<>(FST.INPUT_TYPE.BYTE1, outputs)
525-
// Disable suffixes sharing for block tree index because suffixes are mostly dropped
526-
// from the FST index and left in the term blocks.
527-
.shouldShareSuffix(false)
528-
.bytesPageBits(pageBits)
529-
.build();
524+
new FSTCompiler.Builder<>(FST.INPUT_TYPE.BYTE1, outputs).bytesPageBits(pageBits).build();
530525
// if (DEBUG) {
531526
// System.out.println(" compile index for prefix=" + prefix);
532527
// }

lucene/core/src/java/org/apache/lucene/util/fst/FST.java

Lines changed: 18 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -83,13 +83,16 @@ public enum INPUT_TYPE {
8383

8484
static final int BIT_ARC_HAS_FINAL_OUTPUT = 1 << 5;
8585

86-
/** Value of the arc flags to declare a node with fixed length arcs designed for binary search. */
86+
/**
87+
* Value of the arc flags to declare a node with fixed length (sparse) arcs designed for binary
88+
* search.
89+
*/
8790
// We use this as a marker because this one flag is illegal by itself.
8891
public static final byte ARCS_FOR_BINARY_SEARCH = BIT_ARC_HAS_FINAL_OUTPUT;
8992

9093
/**
91-
* Value of the arc flags to declare a node with fixed length arcs and bit table designed for
92-
* direct addressing.
94+
* Value of the arc flags to declare a node with fixed length dense arcs and bit table designed
95+
* for direct addressing.
9396
*/
9497
static final byte ARCS_FOR_DIRECT_ADDRESSING = 1 << 6;
9598

@@ -751,11 +754,9 @@ public Arc<T> readFirstTargetArc(Arc<T> follow, Arc<T> arc, BytesReader in) thro
751754
private void readFirstArcInfo(long nodeAddress, Arc<T> arc, final BytesReader in)
752755
throws IOException {
753756
in.setPosition(nodeAddress);
754-
// System.out.println(" flags=" + arc.flags);
755757

756758
byte flags = arc.nodeFlags = in.readByte();
757759
if (flags == ARCS_FOR_BINARY_SEARCH || flags == ARCS_FOR_DIRECT_ADDRESSING) {
758-
// System.out.println(" fixed length arc");
759760
// Special arc which is actually a node header for fixed length arcs.
760761
arc.numArcs = in.readVInt();
761762
arc.bytesPerArc = in.readVInt();
@@ -766,8 +767,6 @@ private void readFirstArcInfo(long nodeAddress, Arc<T> arc, final BytesReader in
766767
arc.presenceIndex = -1;
767768
}
768769
arc.posArcsStart = in.getPosition();
769-
// System.out.println(" bytesPer=" + arc.bytesPerArc + " numArcs=" + arc.numArcs + "
770-
// arcsStart=" + pos);
771770
} else {
772771
arc.nextArc = nodeAddress;
773772
arc.bytesPerArc = 0;
@@ -830,27 +829,27 @@ int readNextArcLabel(Arc<T> arc, BytesReader in) throws IOException {
830829
}
831830
}
832831
} else {
833-
if (arc.bytesPerArc() != 0) {
834-
// System.out.println(" nextArc real array");
835-
// Arcs have fixed length.
836-
if (arc.nodeFlags() == ARCS_FOR_BINARY_SEARCH) {
832+
switch (arc.nodeFlags()) {
833+
case ARCS_FOR_BINARY_SEARCH:
837834
// Point to next arc, -1 to skip arc flags.
838835
in.setPosition(arc.posArcsStart() - (1 + arc.arcIdx()) * (long) arc.bytesPerArc() - 1);
839-
} else {
840-
assert arc.nodeFlags() == ARCS_FOR_DIRECT_ADDRESSING;
836+
break;
837+
case ARCS_FOR_DIRECT_ADDRESSING:
841838
// Direct addressing node. The label is not stored but rather inferred
842839
// based on first label and arc index in the range.
843840
assert BitTable.assertIsValid(arc, in);
844841
assert BitTable.isBitSet(arc.arcIdx(), arc, in);
845842
int nextIndex = BitTable.nextBitSet(arc.arcIdx(), arc, in);
846843
assert nextIndex != -1;
847844
return arc.firstLabel() + nextIndex;
848-
}
849-
} else {
850-
// Arcs have variable length.
851-
// System.out.println(" nextArc real list");
852-
// Position to next arc, -1 to skip flags.
853-
in.setPosition(arc.nextArc() - 1);
845+
default:
846+
// Variable length arcs - linear search.
847+
assert arc.bytesPerArc() == 0;
848+
// Arcs have variable length.
849+
// System.out.println(" nextArc real list");
850+
// Position to next arc, -1 to skip flags.
851+
in.setPosition(arc.nextArc() - 1);
852+
break;
854853
}
855854
}
856855
return readLabel(in);

0 commit comments

Comments
 (0)