Skip to content
Merged
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@

package org.elasticsearch.index.engine;

import com.carrotsearch.hppc.IntArrayList;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend to use this namespace (seems odd since it comes from a testing tool)?

Copy link
Member Author

@martijnvg martijnvg Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think hppc may have originated from the randomized testing framework that both Elasticsearch and Lucene use today. However it is currently a standalone high performance primitive collection library: https://github.com/carrotsearch/hppc, which does have other production usage in Elasticsearch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using the Lucene one: package org.apache.lucene.internal.hppc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Lucene core library only forked a subset of the hppc primitive collection library and IntArrayList is included. But the javadocs says it forked from version 0.10.0 and Elasticsearch is uses version 0.8.1 of that library. I see most Elasticsearch usages of hppc use the dependency directly.


import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.search.FieldDoc;
import org.apache.lucene.search.ScoreDoc;
Expand Down Expand Up @@ -191,8 +193,28 @@ private Translog.Operation[] loadDocuments(List<SearchRecord> documentRecords) t
maxDoc = leafReaderContext.reader().maxDoc();
} while (docRecord.docID() >= docBase + maxDoc);

leafFieldLoader = storedFieldLoader.getLoader(leafReaderContext, null);
leafSourceLoader = sourceLoader.leaf(leafReaderContext.reader(), null);
// TODO: instead of building an array, consider just checking whether doc ids are dense.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some knowledge in the PR description and comments that would deserve to be captured in the code as a comment, explaining why we always provide the docs ids set.

// Note, field loaders then would lose the ability to optionally eagerly loading values.
IntArrayList nextDocIds = new IntArrayList();
for (int j = i; j < documentRecords.size(); j++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand we're not increasing the complexity of the method by iterating on documentRecords again here (as we already iterates on documentRecords in the outer loop), because we only compute nextDocIds for 1 leaf reader. Can you confirm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we only compute the docids for the current lead reader. If docid is higher then or equal to docbase + maxDoc then it means current document record belongs to the next leaf reader.

var record = documentRecords.get(j);
if (record.isTombstone()) {
continue;
}
int docID = record.docID();
if (docID >= docBase + maxDoc) {
break;
}
int segmentDocID = docID - docBase;
nextDocIds.add(segmentDocID);
}

// This computed doc ids arrays us used by stored field loader as a heuristic to determine whether to use a sequential
// stored field reader (which bulk loads stored fields and avoids decompressing the same blocks multiple times). For
// source loader, it is also used as a heuristic for bulk reading doc values (E.g. SingletonDocValuesLoader).
int[] nextDocIdArray = nextDocIds.toArray();
leafFieldLoader = storedFieldLoader.getLoader(leafReaderContext, nextDocIdArray);
leafSourceLoader = sourceLoader.leaf(leafReaderContext.reader(), nextDocIdArray);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another side effect of providing the array of document IDs is that some field loaders may choose to load their values eagerly. I don't see this as a problem, but I wanted to point out that we would lose this behavior if we implement the TODO above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I will update the TODO to include that perspective.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that some doc values loaders already apply this strategy when doc ids are provided and there is a single value per field:

setNextSourceMetadataReader(leafReaderContext);
}
int segmentDocID = docRecord.docID() - docBase;
Expand Down