Skip to content

[BUG] org.apache.lucene.analysis.TokenStream.end() can slow down elemental #140

@tohidemyname

Description

@tohidemyname

Describe the bug

Lucene fixed a performance bug: https://issues.apache.org/jira/browse/LUCENE-7419

It complains that TokenStream.end() is quite costly. This bug is marked as a blocker bug. According to the bug report, it affects 5.5.5, 6.2, and 7.0. The reporter complains that TokenStream.end() wrongly calls getAttribute().The buggy end() method is as follows:

public void end() throws IOException {
clearAttributes(); // LUCENE-3849: don't consume dirty atts
PositionIncrementAttribute posIncAtt = getAttribute(PositionIncrementAttribute.class);
if (posIncAtt != null) {
posIncAtt.setPositionIncrement(0);
}
}

Elemental uses lucene 4.10.4. I checked the source code of lucene 4.10.4. Its code is identical to the buggy code:

public void end() throws IOException {
clearAttributes(); // LUCENE-3849: don't consume dirty atts
PositionIncrementAttribute posIncAtt = getAttribute(PositionIncrementAttribute.class);
if (posIncAtt != null) {
posIncAtt.setPositionIncrement(0);
}
}

As a result, this bug should also affect 4.10.4.

To Reproduce

In the lucene bug report (LUCENE-7419), Michael McCandless mentioned that this bug was found by elasticsearch:

"This is the apparent source of the very unexpected slowdown here: elastic/elasticsearch#19867 (comment)"

He also explained how to reproduce such a bug.

Elemental calls the buggy method at the following locations:

<--XMLToQuery.phraseQuery
<--XMLToQuery.nearQuery
<--XMLToQuery.getTerm
<--MarkableTokenFilter.incrementToken
<--MarkableTokenFilter.incrementToken
<--RangeIndexWorker.analyzeContent

The lucene bug is fixed in 6.2.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions