Skip to content
Merged
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@

package org.elasticsearch.index.engine;

import com.carrotsearch.hppc.IntArrayList;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend to use this namespace (seems odd since it comes from a testing tool)?

Copy link
Member Author

@martijnvg martijnvg Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think hppc may have originated from the randomized testing framework that both Elasticsearch and Lucene use today. However it is currently a standalone high performance primitive collection library: https://github.com/carrotsearch/hppc, which does have other production usage in Elasticsearch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using the Lucene one: package org.apache.lucene.internal.hppc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Lucene core library only forked a subset of the hppc primitive collection library and IntArrayList is included. But the javadocs says it forked from version 0.10.0 and Elasticsearch is uses version 0.8.1 of that library. I see most Elasticsearch usages of hppc use the dependency directly.


import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.search.FieldDoc;
import org.apache.lucene.search.ScoreDoc;
Expand Down Expand Up @@ -83,6 +85,7 @@ public LuceneSyntheticSourceChangesSnapshot(
this.maxMemorySizeInBytes = maxMemorySizeInBytes > 0 ? maxMemorySizeInBytes : 1;
this.sourceLoader = mapperService.mappingLookup().newSourceLoader(null, SourceFieldMetrics.NOOP);
Set<String> storedFields = sourceLoader.requiredStoredFields();

this.storedFieldLoader = StoredFieldLoader.create(false, storedFields);
this.lastSeenSeqNo = fromSeqNo - 1;
}
Expand Down Expand Up @@ -191,8 +194,24 @@ private Translog.Operation[] loadDocuments(List<SearchRecord> documentRecords) t
maxDoc = leafReaderContext.reader().maxDoc();
} while (docRecord.docID() >= docBase + maxDoc);

leafFieldLoader = storedFieldLoader.getLoader(leafReaderContext, null);
leafSourceLoader = sourceLoader.leaf(leafReaderContext.reader(), null);
// TODO: instead of building an array, let's just check whether doc ids are (semi) dense
IntArrayList nextDocIds = new IntArrayList();
for (int j = i; j < documentRecords.size(); j++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand we're not increasing the complexity of the method by iterating on documentRecords again here (as we already iterates on documentRecords in the outer loop), because we only compute nextDocIds for 1 leaf reader. Can you confirm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we only compute the docids for the current lead reader. If docid is higher then or equal to docbase + maxDoc then it means current document record belongs to the next leaf reader.

var record = documentRecords.get(j);
if (record.isTombstone()) {
continue;
}
int docID = record.docID();
if (docID >= docBase + maxDoc) {
break;
}
int segmentDocID = docID - docBase;
nextDocIds.add(segmentDocID);
}

int[] nextDocIdArray = nextDocIds.toArray();
leafFieldLoader = storedFieldLoader.getLoader(leafReaderContext, nextDocIdArray);
leafSourceLoader = sourceLoader.leaf(leafReaderContext.reader(), nextDocIdArray);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another side effect of providing the array of document IDs is that some field loaders may choose to load their values eagerly. I don't see this as a problem, but I wanted to point out that we would lose this behavior if we implement the TODO above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I will update the TODO to include that perspective.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that some doc values loaders already apply this strategy when doc ids are provided and there is a single value per field:

setNextSourceMetadataReader(leafReaderContext);
}
int segmentDocID = docRecord.docID() - docBase;
Expand Down