Skip to content

Corruption read on term dictionaries in Lucene 9.9 #12895

@benwtrent

Description

@benwtrent

Description

It seems that #12699 has inadvertantly broken reading term dictionaries created in Lucene 9.8<=.

To replicate a bug, one can index wikibigall with LuceneUtil & Lucene 9.8 & force-merge.

Then attempt to read the created index using a wildcard query:

    Path path = Paths.get("/data/local/lucene/indices/wikibigall.lucene-main.opt.Lucene90.dvfields.nd6.72652M/index");
    try (FSDirectory dir = FSDirectory.open(path);
        DirectoryReader reader = DirectoryReader.open(dir)) {
      IndexSearcher searcher = new IndexSearcher(reader); 
      searcher.count(new WildcardQuery(new Term("body", "*fo*")));
    }

This will result in a trace similar to below:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 3 out of bounds for length 3
	at org.apache.lucene.store.ByteArrayDataInput.readByte(ByteArrayDataInput.java:136)
	at org.apache.lucene.store.DataInput.readVInt(DataInput.java:110)
	at org.apache.lucene.store.ByteArrayDataInput.readVInt(ByteArrayDataInput.java:114)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:158)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:149)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.pushFrame(IntersectTermsEnum.java:203)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:531)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:373)
	at org.apache.lucene.search.MultiTermQueryConstantScoreBlendedWrapper$1.rewriteInner(MultiTermQueryConstantScoreBlendedWrapper.java:111)
	at org.apache.lucene.search.AbstractMultiTermQueryConstantScoreWrapper$RewritingWeight.rewrite(AbstractMultiTermQueryConstantScoreWrapper.java:179)
	at org.apache.lucene.search.AbstractMultiTermQueryConstantScoreWrapper$RewritingWeight.bulkScorer(AbstractMultiTermQueryConstantScoreWrapper.java:220)
	at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:930)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:678)
	at org.apache.lucene.search.IndexSearcher.lambda$4(IndexSearcher.java:636)
	at org.apache.lucene.search.TaskExecutor$TaskGroup.lambda$0(TaskExecutor.java:118)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at org.apache.lucene.search.TaskExecutor$TaskGroup.invokeAll(TaskExecutor.java:153)
	at org.apache.lucene.search.TaskExecutor.invokeAll(TaskExecutor.java:76)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:640)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:607)
	at org.apache.lucene.search.IndexSearcher.count(IndexSearcher.java:423)
	at Corruption.main(Corruption.java:18)

We are currently not sure if this effects Lucene 9.9 created indices & reading via Lucene 9.9.

EDIT: This failure does NOT occur for indices created by 9.9 and read by 9.9.

NOTE: This also fails with just a prefix wildcard query. It seems to be all multi-term queries could be affected.

Will provide more example stack traces in issue comments.

Version and environment details

Lucene 9.9 reading Lucene 9.8 indices.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions