Skip to content

Conversation

@stefanvodita
Copy link
Contributor

In cases where we know there is an upper limit to the potential size of an array, we can use growInRange to avoid allocating beyond that limit.

We address such cases in DirectoryTaxonomyReader and NeighborArray.

Closes #12839

+ minLength
+ " is larger than requested maximum array length "
+ maxLength);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we validate this first, and only then check if the array is already large enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I had placed the validation first. I changed it to cover cases where INITIAL_CAPACITY >= minSize > maxSize. In these cases, we've already allocated beyond maxSize, so we might as well use that memory. Do you think that ends up being more confusing than validating first?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be strange that we allow minSize to be > maxSize, that could introduce latent bugs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this some more and I applied your suggestion in the latest commit. I still think there's a case where checking first will be inconvenient, e.g. if we don't have information about maxLength where we initialize the array and we end up initializing to a capacity larger than maxLength. But this is hypothetical, maybe we don't have any cases like that in the code base. At this time, I agree the proposal you two made is better.

*/
public class NeighborArray {
public class NeighborArray implements Accountable {
private static final int INITIAL_CAPACITY = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msokolov @benwtrent Can you double check it's ok to grow this array on demand vs. pre-sizing like today? It looks like it could save significant amounts of memory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaih as well as there are new concurrency concerns on some of these things.

I can read through here to see if anything stands out to me in the neighborarray.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to grow on demand, then we found arrays were always fully-populated and switched to preallocating, but the diversity check changed and they no longer are! So as far as that goes, growing on demand seems good to me, although I agree we need to check carefully due to concurrent updates to this data structure - that could be a deal-breaker or require locking we might not want? I'll read

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I believe it's safe to grow on demand because the concurrent impl is used only when merging, and in that case we add the initial set of nodes on one thread and then NeighborArray is only updated when adding reciprocal neighbors (in HnswGraphBuilder.addDiverseNeighbors), where we acquire a write lock that should prevent concurrent updates. Basically - we already handle synchronization around these arrays for the purpose of adding/removing entries, so we should be free to resize (and replace the array pointer) in those cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok to grow because it's under lock as @msokolov pointed out. However I think we previously avoid growing because we don't want to spend extra time on resizing and copy-paste?
Could you run the knn benchmark to check the performance is ok? (Maybe better the multithread merge version to check btw the resizing is safe

return (long) node.length * (Integer.BYTES + Float.BYTES)
+ RamUsageEstimator.NUM_BYTES_ARRAY_HEADER * 2L
+ RamUsageEstimator.NUM_BYTES_OBJECT_REF * 2L
+ Integer.BYTES * 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the RamUsageEstimator utility functions instead of estimating RAM usage manually? (shallowSizeOfInstance and sizeOf specifically)

for (NeighborArray[] neighborArraysPerNode : graph) {
if (neighborArraysPerNode != null) {
for (NeighborArray neighborArrayPerNodeAndLevel : neighborArraysPerNode) {
total += neighborArrayPerNodeAndLevel.ramBytesUsed();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm this can make this method potentially a bit slow, it might be ok since normally nobody will call it in any critical code path I guess, but it would still be better if we can have an estimation that runs faster rather than the every accurate account?

*/
public class NeighborArray {
public class NeighborArray implements Accountable {
private static final int INITIAL_CAPACITY = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok to grow because it's under lock as @msokolov pointed out. However I think we previously avoid growing because we don't want to spend extra time on resizing and copy-paste?
Could you run the knn benchmark to check the performance is ok? (Maybe better the multithread merge version to check btw the resizing is safe

}

int potentialLength = oversize(minLength, Integer.BYTES);
if (potentialLength > maxLength) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think growExact(array, Math.min(potentialLength, maxLength)) would be clearer?


/**
* Returns an array whose size is at least {@code minSize}, generally over-allocating
* exponentially
Copy link
Contributor

@dungba88 dungba88 Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense if we delegate the existing grow(int[], int) to the new one to avoid having double code path? Something like return growInRange(array, minSize, Integer.MAX_VALUE) would work I guess.

@benwtrent
Copy link
Member

I ran knnPerfTest in Lucene util against this PR. 100k cohere vectors. Flushing was depending on memory usage (Do we check the ram usage of onHeapgraph during indexing to determine when to flush?)

main branch:

recall	latency	nDoc	fanout	maxConn	beamWidth index
0.912	 0.77	100000	0	16	100	  48602

This branch:

recall	latency	nDoc	fanout	maxConn	beamWidth index ms
0.912	 0.75	100000	0	16	100	  165929

Indexing is about 3.5x slower with this change. I don't know if its due to the OnHeapGraph memory estimation being slow or the node resizing. I am gonna run a profiler to see whats up.

Good news is that search latency and recall are unchanged. Forced merging time seems about the same as well :).

@benwtrent
Copy link
Member

Yeah, ramBytesUsed changes here are adding significant overhead to indexing. Here is a cpu profile.

pr12844-768-100000-wall.jfr.zip

@stefanvodita
Copy link
Contributor Author

Thank you for running the benchmarks @benwtrent, you were quicker than I was. I was worried the modified ramBytesUsed would be slow. Maybe we can come up with a smarter way to do the estimation.

@zhaih
Copy link
Contributor

zhaih commented Nov 29, 2023

@benwtrent Thanks for running the benchmark. I looked at the profile and I think we call ramBytesUsed after every document is indexed to control the flush here.

@stefanvodita I can think of two options here:

  1. just use maxSize as before for estimation, it's not too accurate but at least is a good upper bound
  2. Carefully account the extra memory used when we add new nodes to the graph and accumulate it by an AtomicLong, including the cost for the node itself as well as the cost of resizing the existing neighbor's NeighborArray. Which I think is doable (in both single thread and multi thread situation) but a bit tricky.

@stefanvodita
Copy link
Contributor Author

Thanks for the suggestions @zhaih! I have to think about option 2 a bit more.

If we change ramBytesUsed back, performance recovers (Mostly? I'm not sure how noisy this benchmark is).

Benchmark config:

dim = 100
doc_vectors = '%s/data/enwiki-20120502-lines-1k-100d.vec' % constants.BASE_DIR
query_vectors = '%s/util/tasks/vector-task-minilm.vec' % constants.BASE_DIR

main:

recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.720    0.16   10000   0       64      250     100     2551    1.00    post-filter
0.542    0.23   100000  0       64      250     100     43620   1.00    post-filter
0.512    0.27   200000  0       64      250     100     108235  1.00    post-filter

This PR:

recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.720    0.16   10000   0       64      250     100     5028    1.00    post-filter
0.542    0.23   100000  0       64      250     100     315369  1.00    post-filter
0.512    0.28   200000  0       64      250     100     1578461 1.00    post-filter

This PR with the old memory estimation:

recall	latency	nDoc	fanout	maxConn	beamWidth	visited	index ms
0.720	 0.16	10000	0	64	250	100	2497	1.00	post-filter
0.542	 0.23	100000	0	64	250	100	43886	1.00	post-filter
0.512	 0.28	200000	0	64	250	100	117152	1.00	post-filter

@stefanvodita
Copy link
Contributor Author

I thought some more about option 2. It does seem quite tricky. OnHeapHnswGraph only knows about NeighborArrays being created, but it doesn't know about nodes being added to the arrays - HnswGraphBuilder handles that. I don't know this code well, so I could be missing something.

I'm not sure if a more precise estimate is worth pursuing or not. I've pushed the version where we keep memory estimation like it was and I've added @dungba88's suggestions. I'll try to do a few more benchmark runs to see if there is a measurable slow-down and if we can reduce it by increasing INITIAL_CAPACITY.

@zhaih
Copy link
Contributor

zhaih commented Nov 30, 2023 via email

@stefanvodita
Copy link
Contributor Author

I did 5 benchmark runs for 4 configurations. To avoid making this comment way too large, I'll just report the averages across the 5 runs per configuration.
It looks like there are some genuine differences between the baseline and cadidate, but I'm not sure the differences between different candidate configurations are significant. How do we trade off between the extra miliseconds and the memory savings?

baseline
recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.720    0.16   10000   0       64      250     100     2502    1.00    post-filter
0.542    0.23   100000  0       64      250     100     43624   1.00    post-filter
0.512    0.28   200000  0       64      250     100     111019  1.00    post-filter

candidate with INITIAL_CAPACITY == 10
recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.720    0.16   10000   0       64      250     100     2498    1.00    post-filter
0.542    0.23   100000  0       64      250     100     45063   1.00    post-filter
0.512    0.28   200000  0       64      250     100     116683  1.00    post-filter

candidate with INITIAL_CAPACITY == 100
recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.720    0.16   10000   0       64      250     100     2528    1.00    post-filter
0.542    0.23   100000  0       64      250     100     46055   1.00    post-filter
0.512    0.28   200000  0       64      250     100     115407  1.00    post-filter

candidate with INITIAL_CAPACITY == 1000
recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.720    0.16   10000   0       64      250     100     2518    1.00    post-filter
0.542    0.23   100000  0       64      250     100     46617   1.00    post-filter
0.512    0.29   200000  0       64      250     100     118359  1.00    post-filter

@benwtrent
Copy link
Member

@stefanvodita

How do we trade off between the extra miliseconds and the memory savings?

It would be good to know the actual memory savings. I don't know how to measure the trade-off without know what we are trading off for.

public NeighborArray(int maxSize, boolean descOrder) {
node = new int[maxSize];
score = new float[maxSize];
node = new int[INITIAL_CAPACITY];
Copy link
Contributor

@dungba88 dungba88 Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was new to this class, just curious what happen if INITIAL_CAPACITY > maxSize. I thought maxSize is the maximum size that we ever needed.

But looking at previous code, it seems the array can still be expanded even if we already fully pre-allocated it with maxSize, from the line node = ArrayUtil.grow(node);. Is it just because it was initial grow on-demand, then change to full prelocation, but didn't clean that up.

If it's indeed the maximum size that we ever need, maybe Math.min(INITIAL_CAPACITY, maxSize) would be better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to come from the M parameter, which is maximum number of connection (neighbors) a node can have, which is M on upper level and M*2 on level 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never ever allow INITIAL_CAPACITY to be used to create these arrays when it is larger than maxSize. This would be a serious bug IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's some leftover stuff from before, yes. Although it's also possible we have a bug in our maxSize initialization? I don't think we do though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't understand why candidate with INITIAL_CAPACITY == 1000 would be slower. Is it GC-related? Have you been able to look at JFR at all?

Copy link
Contributor

@dungba88 dungba88 Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be due to the over-allocation that every node now needs to allocate 1000 size array whether they need it or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happen if INITIAL_CAPACITY > maxSize

Great point. I fixed this and tested with INITIAL_CAPACITY == 1000 again. In the profiler, it looks the same as running the baseline. I assume nodes tend not to have 1000 neighbors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefanvodita with a maxConn of 64, nodes will at most have 128 neighbors (maybe 129 as when we exceed the accounted size, we will then remove one before storing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Is there some recommended value or normal range for maxConn?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's just due to having consuming more memory -> more GC cycles needed -> higher latency.

So I rechecked the code, when we insert a node, we will first collect beamWidth number of candidates, and then try to diversely add those candidates to the neighborArray. So I think:

  • in case that beamWidth > maxSize, we can just init this with maxSize and done, because it's likely in a larger graph that the first fill will directly fill the NeighborArray to full and there's no point on resizing it with any init size.
  • in case that beamWidth < maxSize, we can just init this with beamWidth such that the init fill will likely fill the array in a nearly full state?

@stefanvodita
Copy link
Contributor Author

I did some memory profiling and it doesn't look promising. Let's take initial capacity 100 as an example. Total allocation is 3.41GB compared to 1.29GB on the baseline, and peak heap usage is 946MiB compared to 547MiB on baseline. It's worse for smaller initial capacities and larger capacities devolve into the baseline behavior.

I might try a few runs on a different data-set, but maybe these neighbor arrays tend to be mostly used in the general case.

Baseline: baseline.jfr.zip
image

Candidate: candidate-100.jfr.zip
image

@stefanvodita
Copy link
Contributor Author

Unless anyone has other ideas, I’ll revert the changes to NeighborArray and only keep growInRange for DirectoryTaxonomyReader. Separately, I will go through other uses of the array growth API and see if growInRange is more appropriate.

} else {
indexesMissingFromCache =
ArrayUtil.grow(indexesMissingFromCache, numberOfMissingFromCache + 1);
ArrayUtil.growInRange(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could add a comment here saying that indexesMissingFromCache cannot grow beyond categoryPaths.

Was also thinking if an assert is needed here but the check in growInRange should be enough.

@stefanvodita
Copy link
Contributor Author

@zhaih - I also tried your idea about beamWidth. I passed beamWidth in to NeighborArray and then changed the initial capacity of the arrays there, like so:

- int arraySize = Math.min(INITIAL_CAPACITY, maxSize);
+ int arraySize = Math.min(maxSize, beamWidth);

In the profiler, it doesn't look very different from baseline. I didn't see any resizing. If I understood you correctly, you were expecting there to be occasional resizing of arrays when beamWidth < maxSize.

The latency isn't measurably better either. Results are averaged from 5 runs (baseline is here).

recall	latency	nDoc	fanout	maxConn	beamWidth	visited	index ms
0.720	 0.16	10000	0	64	250	100	2531	1.00	post-filter
0.542	 0.23	100000	0	64	250	100	43960	1.00	post-filter
0.512	 0.28	200000	0	64	250	100	111647	1.00	post-filter

@stefanvodita
Copy link
Contributor Author

I reverted the changes to NeighborArray and OnHeapHnswGrap, since it doesn't look like we're making an improvement there.

assert minLength >= 0
: "length must be positive (got " + minLength + "): likely integer overflow?";

if (minLength > maxLength) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be assert instead, as this class is marked as internal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's contrast this with the assertion above. A negative minLength is probably not intended, but we can techincally handle it, so we don't stop execution unless assertions are enabled. On the other hand, minLength > maxLength is not a case we can handle while obeying the contract of this method. I guess the contrast is unintended input vs invalid input. That's why I prefer the current arrangement, but let me know if you strongly feel otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting, I didn't realize that.

I just learnt recently from #12624 that assert was used for the internal code path to catch bug, while throwing exceptions was used for the code path that users can directly control. In Lucene we always have assertion enabled, so assert would surely throw in tests. This class is marked as internal and not to be used by users so I thought it would be fine to just use assert here.

Anyhow, I'm fine with either. Maybe other people could have more thoughts here.

int node = nodesOnLevel.nextInt();
NeighborArray neighbors = hnsw.getNeighbors(level, node);
long maxNeighborsSize;
if (level == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally we would rather want to avoid having test to duplicate assumption or logic made on prod path? This seems to be a specific implementation decision that could be changed independently. I couldn't think of a better way though.

But I'm unsure about the need of this newly added code. It seems we only compute it in a single test, and we want to have a better estimation? The test seems to verify that our over-estimation cannot be more than 30% of the actual size. If we provide a better estimation maybe we can lower the tolerant threshold?

Still it seems strange that if we truly need this better estimation but it is only in test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this addition made sense with the changes to OnHeapHnswGrap. I forgot to remove it when I reverted those changes. Reverted now.

Copy link
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhaih
Copy link
Contributor

zhaih commented Dec 14, 2023

@stefanvodita Could you move the change entry to 9.10? Then I can merge it

In cases where we know there is an upper limit to the potential size
of an array, we can use `growInRange` to avoid allocating beyond that
limit.
@stefanvodita
Copy link
Contributor Author

Done, thank you @zhaih! I've opened #12941 to replace other uses of the unbounded growth API.

@zhaih zhaih merged commit b0ebb84 into apache:main Dec 14, 2023
zhaih pushed a commit that referenced this pull request Dec 14, 2023
In cases where we know there is an upper limit to the potential size
of an array, we can use `growInRange` to avoid allocating beyond that
limit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Grow arrays up to a given limit to avoid overallocation where possible

6 participants