Skip to content
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,8 @@ public class HnswGraphBuilder implements HnswBuilder {
protected final int M; // max number of connections on upper layers
private final double ml;

private final int[] bulkScoreNodes; // for bulk scoring
private final float[] bulkScores; // for bulk scoring
Comment on lines +72 to +73
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was worried about thread-safety of these arrays (given we have concurrent merging), but from this comment it looks like instances of this class are not shared across threads, but rather multiple instances of this class (across different threads) can operate on a single HnswGraph?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaivalnp the scorer itself isn't threadsafe. I assumed that since we were using a scorer, we were OK.

I had the same threading concerns and looked up and it seems that each thread as a unique builder object instance (The worker of the thread) and they all work on the same graph.

private final SplittableRandom random;
protected final UpdateableRandomVectorScorer scorer;
protected final HnswGraphSearcher graphSearcher;
Expand Down Expand Up @@ -156,6 +158,10 @@ protected HnswGraphBuilder(
this.hnsw = hnsw;
this.hnswLock = hnswLock;
this.graphSearcher = graphSearcher;
// pick a number that keeps us from scoring TOO much for diversity checking
// but enough to take advantage of bulk scoring
this.bulkScoreNodes = new int[8];
this.bulkScores = new float[8];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe host 8 up as a private static final variable?

entryCandidates = new GraphBuilderKnnCollector(1);
beamCandidates = new GraphBuilderKnnCollector(beamWidth);
beamCandidates0 = new GraphBuilderKnnCollector(Math.min(beamWidth / 2, M * 3));
Expand Down Expand Up @@ -470,12 +476,17 @@ static void popToScratch(GraphBuilderKnnCollector candidates, NeighborArray scra
*/
private boolean diversityCheck(float score, NeighborArray neighbors, RandomVectorScorer scorer)
throws IOException {
int bulkCount = 0;
for (int i = 0; i < neighbors.size(); i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we don't need to perform any operation per-node of this loop, can we reduce some iterations using something like:

private boolean diversityCheck(float score, NeighborArray neighbors, RandomVectorScorer scorer)
    throws IOException {
  final int chunk = Math.min(Math.ceilDiv(neighbors.size(), 2), bulkScoreNodes.length);
  for (int start = 0; start < neighbors.size(); start += chunk) {
    int length = Math.min(neighbors.size() - start, chunk);
    System.arraycopy(neighbors.nodes(), start, bulkScoreNodes, 0, length);
    if (scorer.bulkScore(bulkScoreNodes, bulkScores, length) >= score) {
      return false;
    }
  }
  return true;
}

float neighborSimilarity = scorer.score(neighbors.nodes()[i]);
if (neighborSimilarity >= score) {
return false;
bulkScoreNodes[bulkCount++] = neighbors.nodes()[i];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, doing 8 always, even if there are only 8 connections seems foolish. I will benchmark with Math.min((neighbors.nodes().length + 1)/2, bulkScoreNodes.length)).

if (bulkCount == bulkScoreNodes.length || i == neighbors.size() - 1) {
if (scorer.bulkScore(bulkScoreNodes, bulkScores, bulkCount) >= score) {
return false;
}
bulkCount = 0;
}
}
assert bulkCount == 0;
return true;
}

Expand Down
Loading