Introduce growInRange to reduce array overallocation #12844

stefanvodita · 2023-11-24T22:55:50Z

In cases where we know there is an upper limit to the potential size of an array, we can use growInRange to avoid allocating beyond that limit.

We address such cases in DirectoryTaxonomyReader and NeighborArray.

Closes #12839

jpountz · 2023-11-28T15:11:41Z

lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java

+              + minLength
+              + " is larger than requested maximum array length "
+              + maxLength);
+    }


Should we validate this first, and only then check if the array is already large enough?

Initially, I had placed the validation first. I changed it to cover cases where INITIAL_CAPACITY >= minSize > maxSize. In these cases, we've already allocated beyond maxSize, so we might as well use that memory. Do you think that ends up being more confusing than validating first?

I think it would be strange that we allow minSize to be > maxSize, that could introduce latent bugs.

I thought about this some more and I applied your suggestion in the latest commit. I still think there's a case where checking first will be inconvenient, e.g. if we don't have information about maxLength where we initialize the array and we end up initializing to a capacity larger than maxLength. But this is hypothetical, maybe we don't have any cases like that in the code base. At this time, I agree the proposal you two made is better.

jpountz · 2023-11-28T15:13:36Z

lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java

 */
-public class NeighborArray {
+public class NeighborArray implements Accountable {
+  private static final int INITIAL_CAPACITY = 10;


@msokolov @benwtrent Can you double check it's ok to grow this array on demand vs. pre-sizing like today? It looks like it could save significant amounts of memory?

@zhaih as well as there are new concurrency concerns on some of these things.

I can read through here to see if anything stands out to me in the neighborarray.

We used to grow on demand, then we found arrays were always fully-populated and switched to preallocating, but the diversity check changed and they no longer are! So as far as that goes, growing on demand seems good to me, although I agree we need to check carefully due to concurrent updates to this data structure - that could be a deal-breaker or require locking we might not want? I'll read

OK I believe it's safe to grow on demand because the concurrent impl is used only when merging, and in that case we add the initial set of nodes on one thread and then NeighborArray is only updated when adding reciprocal neighbors (in HnswGraphBuilder.addDiverseNeighbors), where we acquire a write lock that should prevent concurrent updates. Basically - we already handle synchronization around these arrays for the purpose of adding/removing entries, so we should be free to resize (and replace the array pointer) in those cases.

It's ok to grow because it's under lock as @msokolov pointed out. However I think we previously avoid growing because we don't want to spend extra time on resizing and copy-paste?
Could you run the knn benchmark to check the performance is ok? (Maybe better the multithread merge version to check btw the resizing is safe

jpountz · 2023-11-28T15:16:37Z

lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java

+    return (long) node.length * (Integer.BYTES + Float.BYTES)
+        + RamUsageEstimator.NUM_BYTES_ARRAY_HEADER * 2L
+        + RamUsageEstimator.NUM_BYTES_OBJECT_REF * 2L
+        + Integer.BYTES * 5;


Let's use the RamUsageEstimator utility functions instead of estimating RAM usage manually? (shallowSizeOfInstance and sizeOf specifically)

zhaih · 2023-11-28T17:35:13Z

lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java

+    for (NeighborArray[] neighborArraysPerNode : graph) {
+      if (neighborArraysPerNode != null) {
+        for (NeighborArray neighborArrayPerNodeAndLevel : neighborArraysPerNode) {
+          total += neighborArrayPerNodeAndLevel.ramBytesUsed();


Hmmm this can make this method potentially a bit slow, it might be ok since normally nobody will call it in any critical code path I guess, but it would still be better if we can have an estimation that runs faster rather than the every accurate account?

zhaih · 2023-11-28T17:36:56Z

lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java

 */
-public class NeighborArray {
+public class NeighborArray implements Accountable {
+  private static final int INITIAL_CAPACITY = 10;


It's ok to grow because it's under lock as @msokolov pointed out. However I think we previously avoid growing because we don't want to spend extra time on resizing and copy-paste?
Could you run the knn benchmark to check the performance is ok? (Maybe better the multithread merge version to check btw the resizing is safe

dungba88 · 2023-11-29T11:57:05Z

lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java

+    }
+
+    int potentialLength = oversize(minLength, Integer.BYTES);
+    if (potentialLength > maxLength) {


I think growExact(array, Math.min(potentialLength, maxLength)) would be clearer?

dungba88 · 2023-11-29T12:02:16Z

lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java

+
  /**
   * Returns an array whose size is at least {@code minSize}, generally over-allocating
   * exponentially


Would it make sense if we delegate the existing grow(int[], int) to the new one to avoid having double code path? Something like return growInRange(array, minSize, Integer.MAX_VALUE) would work I guess.

benwtrent · 2023-11-29T13:07:57Z

I ran knnPerfTest in Lucene util against this PR. 100k cohere vectors. Flushing was depending on memory usage (Do we check the ram usage of onHeapgraph during indexing to determine when to flush?)

main branch:

recall	latency	nDoc	fanout	maxConn	beamWidth index
0.912	 0.77	100000	0	16	100	  48602

This branch:

recall	latency	nDoc	fanout	maxConn	beamWidth index ms
0.912	 0.75	100000	0	16	100	  165929

Indexing is about 3.5x slower with this change. I don't know if its due to the OnHeapGraph memory estimation being slow or the node resizing. I am gonna run a profiler to see whats up.

Good news is that search latency and recall are unchanged. Forced merging time seems about the same as well :).

benwtrent · 2023-11-29T13:23:09Z

Yeah, ramBytesUsed changes here are adding significant overhead to indexing. Here is a cpu profile.

pr12844-768-100000-wall.jfr.zip

stefanvodita · 2023-11-29T16:56:21Z

Thank you for running the benchmarks @benwtrent, you were quicker than I was. I was worried the modified ramBytesUsed would be slow. Maybe we can come up with a smarter way to do the estimation.

zhaih · 2023-11-29T19:30:34Z

@benwtrent Thanks for running the benchmark. I looked at the profile and I think we call ramBytesUsed after every document is indexed to control the flush here.

@stefanvodita I can think of two options here:

just use maxSize as before for estimation, it's not too accurate but at least is a good upper bound
Carefully account the extra memory used when we add new nodes to the graph and accumulate it by an AtomicLong, including the cost for the node itself as well as the cost of resizing the existing neighbor's NeighborArray. Which I think is doable (in both single thread and multi thread situation) but a bit tricky.

stefanvodita · 2023-11-29T21:32:32Z

Thanks for the suggestions @zhaih! I have to think about option 2 a bit more.

If we change ramBytesUsed back, performance recovers (Mostly? I'm not sure how noisy this benchmark is).

Benchmark config:

dim = 100
doc_vectors = '%s/data/enwiki-20120502-lines-1k-100d.vec' % constants.BASE_DIR
query_vectors = '%s/util/tasks/vector-task-minilm.vec' % constants.BASE_DIR

main:

recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.720    0.16   10000   0       64      250     100     2551    1.00    post-filter
0.542    0.23   100000  0       64      250     100     43620   1.00    post-filter
0.512    0.27   200000  0       64      250     100     108235  1.00    post-filter

This PR:

recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.720    0.16   10000   0       64      250     100     5028    1.00    post-filter
0.542    0.23   100000  0       64      250     100     315369  1.00    post-filter
0.512    0.28   200000  0       64      250     100     1578461 1.00    post-filter

This PR with the old memory estimation:

recall	latency	nDoc	fanout	maxConn	beamWidth	visited	index ms
0.720	 0.16	10000	0	64	250	100	2497	1.00	post-filter
0.542	 0.23	100000	0	64	250	100	43886	1.00	post-filter
0.512	 0.28	200000	0	64	250	100	117152	1.00	post-filter

stefanvodita · 2023-11-30T21:20:35Z

I thought some more about option 2. It does seem quite tricky. OnHeapHnswGraph only knows about NeighborArrays being created, but it doesn't know about nodes being added to the arrays - HnswGraphBuilder handles that. I don't know this code well, so I could be missing something.

I'm not sure if a more precise estimate is worth pursuing or not. I've pushed the version where we keep memory estimation like it was and I've added @dungba88's suggestions. I'll try to do a few more benchmark runs to see if there is a measurable slow-down and if we can reduce it by increasing INITIAL_CAPACITY.

zhaih · 2023-11-30T23:31:21Z

Yeah I don't think we need to spend too much effort on option2, especially in this PR, because providing an upper bound on memory usage is good enough to me. Plus we'll never be able to precisely account how much the memory is gonna be used. Maybe leaving a TODO or just mentioning in the comment that this estimation is not accurate seems ok to me.

…

On Thu, Nov 30, 2023 at 1:20 PM Stefan Vodita ***@***.***> wrote: I thought some more about option 2. It does seem quite tricky. OnHeapHnswGraph only knows about NeighborArrays being created, but it doesn't know about nodes being added to the arrays - HnswGraphBuilder handles that. I don't know this code well, so I could be missing something. I'm not sure if a more precise estimate is worth pursuing or not. I've pushed the version where we keep memory estimation like it was and I've added @dungba88 <https://github.com/dungba88>'s suggestions. I'll try to do a few more benchmark runs to see if there is a measurable slow-down and if we can reduce it by increasing INITIAL_CAPACITY. — Reply to this email directly, view it on GitHub <#12844 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEFSB7ALXIILG7WBJTDKO5DYHD2C7AVCNFSM6AAAAAA7ZUIX5WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZUGU3TQMJTHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

stefanvodita · 2023-12-01T12:47:09Z

I did 5 benchmark runs for 4 configurations. To avoid making this comment way too large, I'll just report the averages across the 5 runs per configuration.
It looks like there are some genuine differences between the baseline and cadidate, but I'm not sure the differences between different candidate configurations are significant. How do we trade off between the extra miliseconds and the memory savings?

baseline
recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.720    0.16   10000   0       64      250     100     2502    1.00    post-filter
0.542    0.23   100000  0       64      250     100     43624   1.00    post-filter
0.512    0.28   200000  0       64      250     100     111019  1.00    post-filter

candidate with INITIAL_CAPACITY == 10
recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.720    0.16   10000   0       64      250     100     2498    1.00    post-filter
0.542    0.23   100000  0       64      250     100     45063   1.00    post-filter
0.512    0.28   200000  0       64      250     100     116683  1.00    post-filter

candidate with INITIAL_CAPACITY == 100
recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.720    0.16   10000   0       64      250     100     2528    1.00    post-filter
0.542    0.23   100000  0       64      250     100     46055   1.00    post-filter
0.512    0.28   200000  0       64      250     100     115407  1.00    post-filter

candidate with INITIAL_CAPACITY == 1000
recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.720    0.16   10000   0       64      250     100     2518    1.00    post-filter
0.542    0.23   100000  0       64      250     100     46617   1.00    post-filter
0.512    0.29   200000  0       64      250     100     118359  1.00    post-filter

benwtrent · 2023-12-01T12:50:27Z

@stefanvodita

How do we trade off between the extra miliseconds and the memory savings?

It would be good to know the actual memory savings. I don't know how to measure the trade-off without know what we are trading off for.

dungba88 · 2023-12-01T13:31:02Z

lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java

  public NeighborArray(int maxSize, boolean descOrder) {
-    node = new int[maxSize];
-    score = new float[maxSize];
+    node = new int[INITIAL_CAPACITY];


I was new to this class, just curious what happen if INITIAL_CAPACITY > maxSize. I thought maxSize is the maximum size that we ever needed.

But looking at previous code, it seems the array can still be expanded even if we already fully pre-allocated it with maxSize, from the line node = ArrayUtil.grow(node);. Is it just because it was initial grow on-demand, then change to full prelocation, but didn't clean that up.

If it's indeed the maximum size that we ever need, maybe Math.min(INITIAL_CAPACITY, maxSize) would be better.

This seems to come from the M parameter, which is maximum number of connection (neighbors) a node can have, which is M on upper level and M*2 on level 0.

We should never ever allow INITIAL_CAPACITY to be used to create these arrays when it is larger than maxSize. This would be a serious bug IMO.

I think it's some leftover stuff from before, yes. Although it's also possible we have a bug in our maxSize initialization? I don't think we do though.

I really don't understand why candidate with INITIAL_CAPACITY == 1000 would be slower. Is it GC-related? Have you been able to look at JFR at all?

Could it be due to the over-allocation that every node now needs to allocate 1000 size array whether they need it or not?

what happen if INITIAL_CAPACITY > maxSize

Great point. I fixed this and tested with INITIAL_CAPACITY == 1000 again. In the profiler, it looks the same as running the baseline. I assume nodes tend not to have 1000 neighbors.

@stefanvodita with a maxConn of 64, nodes will at most have 128 neighbors (maybe 129 as when we exceed the accounted size, we will then remove one before storing).

Oh, I see. Is there some recommended value or normal range for maxConn?

I think it's just due to having consuming more memory -> more GC cycles needed -> higher latency.

So I rechecked the code, when we insert a node, we will first collect beamWidth number of candidates, and then try to diversely add those candidates to the neighborArray. So I think:

in case that beamWidth > maxSize, we can just init this with maxSize and done, because it's likely in a larger graph that the first fill will directly fill the NeighborArray to full and there's no point on resizing it with any init size.

in case that beamWidth < maxSize, we can just init this with beamWidth such that the init fill will likely fill the array in a nearly full state?

stefanvodita · 2023-12-01T18:39:36Z

I did some memory profiling and it doesn't look promising. Let's take initial capacity 100 as an example. Total allocation is 3.41GB compared to 1.29GB on the baseline, and peak heap usage is 946MiB compared to 547MiB on baseline. It's worse for smaller initial capacities and larger capacities devolve into the baseline behavior.

I might try a few runs on a different data-set, but maybe these neighbor arrays tend to be mostly used in the general case.

Baseline: baseline.jfr.zip

Candidate: candidate-100.jfr.zip

stefanvodita · 2023-12-04T23:09:27Z

Unless anyone has other ideas, I’ll revert the changes to NeighborArray and only keep growInRange for DirectoryTaxonomyReader. Separately, I will go through other uses of the array growth API and see if growInRange is more appropriate.

dungba88 · 2023-12-05T08:57:00Z

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java

      } else {
        indexesMissingFromCache =
-            ArrayUtil.grow(indexesMissingFromCache, numberOfMissingFromCache + 1);
+            ArrayUtil.growInRange(


I think we could add a comment here saying that indexesMissingFromCache cannot grow beyond categoryPaths.

Was also thinking if an assert is needed here but the check in growInRange should be enough.

stefanvodita · 2023-12-06T00:01:05Z

@zhaih - I also tried your idea about beamWidth. I passed beamWidth in to NeighborArray and then changed the initial capacity of the arrays there, like so:

- int arraySize = Math.min(INITIAL_CAPACITY, maxSize);
+ int arraySize = Math.min(maxSize, beamWidth);

In the profiler, it doesn't look very different from baseline. I didn't see any resizing. If I understood you correctly, you were expecting there to be occasional resizing of arrays when beamWidth < maxSize.

The latency isn't measurably better either. Results are averaged from 5 runs (baseline is here).

recall	latency	nDoc	fanout	maxConn	beamWidth	visited	index ms
0.720	 0.16	10000	0	64	250	100	2531	1.00	post-filter
0.542	 0.23	100000	0	64	250	100	43960	1.00	post-filter
0.512	 0.28	200000	0	64	250	100	111647	1.00	post-filter

stefanvodita · 2023-12-09T10:21:31Z

I reverted the changes to NeighborArray and OnHeapHnswGrap, since it doesn't look like we're making an improvement there.

dungba88 · 2023-12-10T16:48:22Z

lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java

+    assert minLength >= 0
+        : "length must be positive (got " + minLength + "): likely integer overflow?";
+
+    if (minLength > maxLength) {


I think this can be assert instead, as this class is marked as internal

Let's contrast this with the assertion above. A negative minLength is probably not intended, but we can techincally handle it, so we don't stop execution unless assertions are enabled. On the other hand, minLength > maxLength is not a case we can handle while obeying the contract of this method. I guess the contrast is unintended input vs invalid input. That's why I prefer the current arrangement, but let me know if you strongly feel otherwise.

That's interesting, I didn't realize that.

I just learnt recently from #12624 that assert was used for the internal code path to catch bug, while throwing exceptions was used for the code path that users can directly control. In Lucene we always have assertion enabled, so assert would surely throw in tests. This class is marked as internal and not to be used by users so I thought it would be fine to just use assert here.

Anyhow, I'm fine with either. Maybe other people could have more thoughts here.

dungba88 · 2023-12-10T17:16:47Z

lucene/core/src/test/org/apache/lucene/util/hnsw/HnswGraphTestCase.java

+          int node = nodesOnLevel.nextInt();
+          NeighborArray neighbors = hnsw.getNeighbors(level, node);
+          long maxNeighborsSize;
+          if (level == 0) {


I think generally we would rather want to avoid having test to duplicate assumption or logic made on prod path? This seems to be a specific implementation decision that could be changed independently. I couldn't think of a better way though.

But I'm unsure about the need of this newly added code. It seems we only compute it in a single test, and we want to have a better estimation? The test seems to verify that our over-estimation cannot be more than 30% of the actual size. If we provide a better estimation maybe we can lower the tolerant threshold?

Still it seems strange that if we truly need this better estimation but it is only in test.

Sorry, this addition made sense with the changes to OnHeapHnswGrap. I forgot to remove it when I reverted those changes. Reverted now.

zhaih

LGTM

zhaih · 2023-12-14T00:16:25Z

@stefanvodita Could you move the change entry to 9.10? Then I can merge it

In cases where we know there is an upper limit to the potential size of an array, we can use `growInRange` to avoid allocating beyond that limit.

stefanvodita · 2023-12-14T09:12:23Z

Done, thank you @zhaih! I've opened #12941 to replace other uses of the unbounded growth API.

In cases where we know there is an upper limit to the potential size of an array, we can use `growInRange` to avoid allocating beyond that limit.

stefanvodita mentioned this pull request Nov 24, 2023

Grow arrays up to a given limit to avoid overallocation where possible #12839

Closed

jpountz reviewed Nov 28, 2023

View reviewed changes

zhaih reviewed Nov 28, 2023

View reviewed changes

dungba88 reviewed Nov 29, 2023

View reviewed changes

dungba88 reviewed Dec 1, 2023

View reviewed changes

dungba88 reviewed Dec 5, 2023

View reviewed changes

dungba88 reviewed Dec 10, 2023

View reviewed changes

dungba88 approved these changes Dec 12, 2023

View reviewed changes

zhaih approved these changes Dec 12, 2023

View reviewed changes

Introduce growInRange to reduce array overallocation

85e87f5

In cases where we know there is an upper limit to the potential size of an array, we can use `growInRange` to avoid allocating beyond that limit.

stefanvodita force-pushed the array-grow-max branch from 69c7a2f to 85e87f5 Compare December 14, 2023 08:57

stefanvodita mentioned this pull request Dec 14, 2023

Find and replace uses of unbounded array growth with growInRange #12941

Closed

zhaih merged commit b0ebb84 into apache:main Dec 14, 2023

zhaih pushed a commit that referenced this pull request Dec 14, 2023

Introduce growInRange to reduce array overallocation (#12844)

eb57db2

In cases where we know there is an upper limit to the potential size of an array, we can use `growInRange` to avoid allocating beyond that limit.

Introduce growInRange to reduce array overallocation #12844

Introduce growInRange to reduce array overallocation #12844

Uh oh!

Conversation

stefanvodita commented Nov 24, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dungba88 Nov 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent commented Nov 29, 2023

Uh oh!

benwtrent commented Nov 29, 2023

Uh oh!

stefanvodita commented Nov 29, 2023

Uh oh!

zhaih commented Nov 29, 2023

Uh oh!

stefanvodita commented Nov 29, 2023

Uh oh!

stefanvodita commented Nov 30, 2023

Uh oh!

zhaih commented Nov 30, 2023 via email

Uh oh!

stefanvodita commented Dec 1, 2023

Uh oh!

benwtrent commented Dec 1, 2023

Uh oh!

dungba88 Dec 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dungba88 Dec 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stefanvodita commented Dec 1, 2023

Uh oh!

dungba88 Nov 29, 2023 •

edited

Loading

dungba88 Dec 1, 2023 •

edited

Loading

dungba88 Dec 1, 2023 •

edited

Loading