-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Restricts snapshot concurrency based on available heap memory #136952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Restricts snapshot concurrency based on available heap memory #136952
Conversation
Limits the concurrency of smaller nodes when loading IndexMetaData objects from heap to prevent nodes with small heaps from going OOMe Closes elastic#131822 Closes: ES-12538
|
Hi @joshua-adams-1, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am extending this test to ensure that, depending on the available heap memory, we throttle the number of concurrent snapshot threads accordingly
| * Tests whether we adjust the maximum concurrency when deleting snapshots | ||
| * according to the size of the heap memory | ||
| */ | ||
| public void testMaxIndexDeletionConcurrency() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is different to the previous. This is just a basic unit test to ensure that the function of heap memory -> snapshot threads is as expected, but doesn't test whether all threads are properly utilised
| ); | ||
| private volatile int maxHeapSizeForSnapshotDeletion; | ||
|
|
||
| public static final Setting<ByteSizeValue> HEAP_SIZE_SETTING = Setting.memorySizeSetting( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is an easier way to get the total heap memory then please let me know. I was following a similar approach to above with the MAX_HEAP_SIZE_FOR_SNAPSHOT_DELETION_SETTING setting.
My concern is that to modify HEAP_SIZE_SETTING inside the tests I had to use the Setting.Property.Dynamic property, but I don't want other users/code changing this percentage to something stupidly small and then the snapshotting takes too long and times out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get the heap memory with Runtime.getRuntime().maxMemory() or JvmInfo.jvmInfo().getMem().getHeapMax(). Is that what you need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider this, but I thought it too static, and it also conflicted with the approach I took in #133558
My thought process is that by utilising the existing memorySizeSetting here to return 10% of the available heap memory for use when loading IndexMetadata objects, the actual % can be dynamically updated without requiring a code change which is nice, and the ByteSizeValue handles all the maths in the background. I still have a concern about this setting being abused by someone setting the value really low (say 1%) which will force us to use only a single snapshot thread, but this concern isn't specific to the MAX_HEAP_SIZE_FOR_INDEX_METADATA_SETTING setting, but a wider concern of any dynamically updatable setting.
My concern in:
If there is an easier way to get the total heap memory then please let me know
was because I was trying to get 100% of the heap via a setting which seemed long-winded, but after this comment by David it seems the best approach. Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok. I mean there's two made-up fiddle factors in play here, the "max 10% of heap" is a guess, as is the "max 50MiB of heap per IMD instance", but then the concurrency number is just one divided by the other. I'm not sure that's the most user-friendly interface really but my only other idea is to control the actual concurrency with a setting whose default is a function of JvmInfo.jvmInfo().getMem().getHeapMax(). One super-opaque setting or two slightly-less-opaque settings...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prior to loading the IndexMetadata object from heap memory, can we check 1) how big the object we will be loading is, and 2) whether we have enough space to load it? Then, if we don't have space, we block until we do. It's not ideal, but better than OOMing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yet another possibility would be to introduce a variant of org.elasticsearch.cluster.metadata.IndexMetadata.Builder#fromXContent() which skips over any mappings. It's the mappings that take up most of the heap space in practice, but we simply don't need them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XContentPaser can be configured to filter out keys. For snapshots, we control the XContentParser creation in ChecksumBlobStoreFormat where we can configure it to skip the mappings key. I tested it briefly and it seems to be possible with the following change.
--- a/server/src/main/java/org/elasticsearch/repositories/blobstore/ChecksumBlobStoreFormat.java
+++ b/server/src/main/java/org/elasticsearch/repositories/blobstore/ChecksumBlobStoreFormat.java
@@ -150,7 +151,8 @@ public final class ChecksumBlobStoreFormat<T> {
try (
XContentParser parser = XContentHelper.createParserNotCompressed(
XContentParserConfiguration.EMPTY.withRegistry(namedXContentRegistry)
- .withDeprecationHandler(LoggingDeprecationHandler.INSTANCE),
+ .withDeprecationHandler(LoggingDeprecationHandler.INSTANCE)
+ .withFiltering(null, Set.of("*.mappings"), false),
bytesReference,
XContentType.SMILE
)
@@ -161,7 +163,8 @@ public final class ChecksumBlobStoreFormat<T> {
try (
XContentParser parser = XContentHelper.createParserNotCompressed(
XContentParserConfiguration.EMPTY.withRegistry(namedXContentRegistry)
- .withDeprecationHandler(LoggingDeprecationHandler.INSTANCE),
+ .withDeprecationHandler(LoggingDeprecationHandler.INSTANCE)
+ .withFiltering(null, Set.of("*.mappings"), false),
bytesReference,
XContentType.SMILE
)If something like this is feasible, I'd think we don't need the memory limit which feels mostly a guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, TIL, I didn't know there was a filter-on-parse option.
Can we just use the includeStrings parameter? Really we only need settings.index.number_of_shards tho I could imagine we might have to pull a few more things in to satisfy various invariants on IndexMetadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On reflection, maybe we don't even need to load this as a full IndexMetadata. We could define another MetadataStateFormat<...> which defines a different fromXContent that only reads settings.index.number_of_shards and skips everything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed a WIP PR here if you guys wouldn't mind taking a look. If you agree with the approach, I would close this PR in favour of the above.
| this.maxIndexDeletionConcurrency = Math.min( | ||
| // Prevent smaller nodes from loading too many IndexMetadata objects in parallel | ||
| // and going OOMe (ES-12538) | ||
| (int) Math.pow(2, heapSizeInGb), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a heuristic since I didn't know what we wanted to throttle the snapshot threads to be.
For context to any reviewer, we currently have up to 10 snapshot threads on any size node, and in this instance a node with 1GiB heap OOMed because of this. I used a simple Math.pow(2, heapSizeInGb) heuristic, but this does mean that a node must have >4GB heap to use all 10 snapshot threads. I suspect this is far too conservative of an estimate so suggestions are appreciated. Happy to use a simple step function if people think that would suffice, ie if heap is less than 2GB, use 4 threads, else use 10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we currently have up to 10 snapshot threads on any size node
We do have fewer threads for smaller nodes, see here. The threshold is current 750MB. Are we considering this to be too low and want to raise it to something like 1GB or more? Or is this change specific about snapshot deletion? For the later, I'd probably just go with a simple heuristic like use half of the snapshot threads (minimal 1) if heap is smaller than 2GB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is specific to index-metadata-loading during snapshot deletion, which is something we do today concurrently on all the available snapshot threads and which we've seen to need unusually large amounts of heap. Some IndexMetadata instances can be 50MiB or more when deserialized. Most of the work that the snapshot threads do has much lower heap footprint than this.
I don't think it makes sense to have the concurrency limit be an exponentially-increasing function of the heap size tho, since the resource usage is linear function of the concurrency limit. I'd suggest we take the observed 50MiB as a reasonable estimate of a big IndexMetadata blob and make sure we don't use more than, say, 10% of heap for this. So if the node has 1GiB of heap then allow 2 threads, 2GiB means 4 threads, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks both, based on the suggestion above I have extended BlobStoreRepository with a setting to read 10% of available heap memory, in the same way I did for MAX_HEAP_SIZE_FOR_SNAPSHOT_DELETION_SETTING
…a-adams-1/elasticsearch into blobstore-repo-concurrency
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some clarification questions/comments.
| ); | ||
| private volatile int maxHeapSizeForSnapshotDeletion; | ||
|
|
||
| public static final Setting<ByteSizeValue> HEAP_SIZE_SETTING = Setting.memorySizeSetting( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get the heap memory with Runtime.getRuntime().maxMemory() or JvmInfo.jvmInfo().getMem().getHeapMax(). Is that what you need?
| this.maxIndexDeletionConcurrency = Math.min( | ||
| // Prevent smaller nodes from loading too many IndexMetadata objects in parallel | ||
| // and going OOMe (ES-12538) | ||
| (int) Math.pow(2, heapSizeInGb), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we currently have up to 10 snapshot threads on any size node
We do have fewer threads for smaller nodes, see here. The threshold is current 750MB. Are we considering this to be too low and want to raise it to something like 1GB or more? Or is this change specific about snapshot deletion? For the later, I'd probably just go with a simple heuristic like use half of the snapshot threads (minimal 1) if heap is smaller than 2GB?
Limits the concurrency of smaller nodes when loading
IndexMetaDataobjects from heap to prevent nodes with small heaps from going OOMeCloses #131822
Closes: ES-12538