Reduce AsyncQueue Size in SplitSources for Bucketed Tables#27181
Open
shelton408 wants to merge 1 commit intoprestodb:masterfrom
Open
Reduce AsyncQueue Size in SplitSources for Bucketed Tables#27181shelton408 wants to merge 1 commit intoprestodb:masterfrom
shelton408 wants to merge 1 commit intoprestodb:masterfrom
Conversation
Summary:
Heavily bucketed tables have been causing heap memory issues for the coordinator.
Each bucket creates an asyncQueue with an arrayDeque of size MaxOutstandingSplits * 2 + 1, but most of this reserved space is never touched for heavily bucketed tables.
An example that came up in our systems recently is a query against a table with 32768 buckets. The split source for the table alone created 32768 async queues which resulted in a memory usage of 571 megabytes.
(note that while this first image shows the asyncQueue size as 232, this only applies to the 32k empty asyncQueues, the other 32k use 17568 bytes)
{F1985791448}
Looking into the asyncQueue, we can see that the memory is primarily consumed by async queues which reserve the 2001 default slots, but only use 1 element. This uses 17376 bytes per arrayDeque per bucket, of which 16008 bytes are used just for reserving the array.
{F1985791302}
This change reduces the default memory reserved by the arrayDeque when bucketing from MaxOutstandingSplits * 2 to a dynamic value based on bucket count (capping on the lower end as 8 * 2 + 1) . In the case above, the total cost per bucket would be reduced from 17376 bytes to 1504 bytes (accounting for the 1 split in the array), leading to over a 90% save in memory.
A possible downside to this change is that if we exceed the array size, the array will copy into a larger array which may be an expensive operation, however from what I've seen, tables with large bucket counts do not generate a lot of splits per bucket, and this seems to generally scale linearly, which is how I ended up with my current formula.
References:
/presto-trunk/presto-hive/src/main/java/com/facebook/presto/hive/HiveSplitSource.java Line 229
/presto-trunk/presto-hive/src/main/java/com/facebook/presto/hive/util/AsyncQueue.java Line 59
Differential Revision: D93797024
Contributor
Reviewer's guide (collapsed on small PRs)Reviewer's GuideAdjusts Hive bucketed split scheduling to estimate a per-bucket outstanding-split limit based on bucket count, reducing AsyncQueue/ArrayDeque memory usage for heavily bucketed tables while preserving existing behavior for unbucketed or lightly bucketed tables. Sequence diagram for updated bucketed Hive split schedulingsequenceDiagram
participant Client
participant HiveSplitManager
participant HiveSplitLoader
participant HiveSplitSource
Client->>HiveSplitManager: getSplits(splitSchedulingContext, table, session, layout, functionRegistry)
HiveSplitManager->>HiveSplitLoader: new HiveSplitLoader(...)
HiveSplitManager->>HiveSplitManager: computeSplitSource(splitSchedulingContext, table, session, hiveSplitLoader, splitScanRatio, bucketHandle)
alt grouped_scheduling
HiveSplitManager->>HiveSplitManager: bucketCount = bucketHandle.map(getReadBucketCount).orElse(1)
HiveSplitManager->>HiveSplitManager: estimatedOutstandingSplitsPerBucket = max(8, maxOutstandingSplits / bucketCount)
HiveSplitManager->>HiveSplitSource: bucketed(session, dbName, tableName, cacheQuotaRequirement, maxInitialSplitSize, estimatedOutstandingSplitsPerBucket, maxOutstandingSplitsSize, hiveSplitLoader, executor, splitSchedulingStrategy, schedulerUsesHostAddresses, partialAggPushdown)
else ungroupped_scheduling
HiveSplitManager->>HiveSplitSource: nonBucketed factory method (unchanged)
end
HiveSplitManager->>HiveSplitLoader: start(splitSource)
HiveSplitManager-->>Client: HiveSplitSource
Class diagram for updated HiveSplitManager.computeSplitSource behaviorclassDiagram
class HiveSplitManager {
- CacheQuotaRequirementProvider cacheQuotaRequirementProvider
- int maxOutstandingSplits
- DataSize maxOutstandingSplitsSize
+ ConnectorSplitSource getSplits(SplitSchedulingContext splitSchedulingContext, ConnectorSession session, Table table, ConnectorTableLayoutHandle layout)
- HiveSplitSource computeSplitSource(SplitSchedulingContext splitSchedulingContext, Table table, ConnectorSession session, HiveSplitLoader hiveSplitLoader, double splitScanRatio, Optional<HiveBucketHandle> bucketHandle)
}
class HiveSplitSource {
+ static HiveSplitSource bucketed(ConnectorSession session, String databaseName, String tableName, CacheQuotaRequirement cacheQuotaRequirement, DataSize maxInitialSplitSize, int maxOutstandingSplits, DataSize maxOutstandingSplitsSize, HiveSplitLoader hiveSplitLoader, Executor executor, SplitSchedulingStrategy splitSchedulingStrategy, boolean schedulerUsesHostAddresses, boolean partialAggregationsPushedDown)
}
class HiveBucketHandle {
+ int getReadBucketCount()
}
class SplitSchedulingContext {
+ SplitSchedulingStrategy getSplitSchedulingStrategy()
+ boolean schedulerUsesHostAddresses()
}
class CacheQuotaRequirementProvider {
+ CacheQuotaRequirement getCacheQuotaRequirement(String databaseName, String tableName)
}
class CacheQuotaRequirement
class HiveSplitLoader {
+ void start(HiveSplitSource hiveSplitSource)
}
HiveSplitManager --> HiveSplitSource : uses
HiveSplitManager --> HiveSplitLoader : creates
HiveSplitManager --> HiveBucketHandle : optional bucketHandle
HiveSplitManager --> CacheQuotaRequirementProvider : uses
CacheQuotaRequirementProvider --> CacheQuotaRequirement : creates
HiveSplitSource --> HiveSplitLoader : uses
HiveSplitManager --> SplitSchedulingContext : uses
HiveSplitManager --> CacheQuotaRequirement : uses
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Contributor
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- The new
estimatedOutstandingSplitsPerBucketlogic changes behavior whenbucketCount > maxOutstandingSplitsby clamping up to 8 instead of using the (small) configured max per bucket; consider whether this is intentional and, if so, documenting or guarding against unexpectedly increasing per-bucket queue sizes in that case. - The hard-coded lower bound of
8inMath.max(8, maxOutstandingSplits / bucketCount)is a bit of a magic number; consider either tying it explicitly to the AsyncQueue sizing logic (e.g., via a named constant) or explaining the rationale so future maintainers understand why this value was chosen.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new `estimatedOutstandingSplitsPerBucket` logic changes behavior when `bucketCount > maxOutstandingSplits` by clamping up to 8 instead of using the (small) configured max per bucket; consider whether this is intentional and, if so, documenting or guarding against unexpectedly increasing per-bucket queue sizes in that case.
- The hard-coded lower bound of `8` in `Math.max(8, maxOutstandingSplits / bucketCount)` is a bit of a magic number; consider either tying it explicitly to the AsyncQueue sizing logic (e.g., via a named constant) or explaining the rationale so future maintainers understand why this value was chosen.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Contributor
|
Please add a release note entry as described in the Release Notes Guidelines to pass the not required but failing CI check. Please edit the title of the PR to pass the required and failing CI check. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Looking for insight from anyone with expertise in this area.
Heavily bucketed tables have been causing heap memory issues for the coordinator, leading to OOM crashes.
Each bucket for a bucketed split source creates an asyncQueue with an arrayDeque of size MaxOutstandingSplits * 2 + 1, the same size as the entire asyncQueue of a non-bucketed split source. Most of this reserved space is never touched for heavily bucketed tables.
An example that came up in our systems recently is a query against a table with 32768 buckets. The split source for the table alone created 32768 async queues which resulted in a memory usage of 571 megabytes.


(note that while this first image shows the asyncQueue size as 232, this only applies to the 32k empty asyncQueues, the other 32k use 17568 bytes)
Looking into the asyncQueue, we can see that the memory is primarily consumed by async queues which reserve the 2001 default slots, but only use 1 element. This uses 17376 bytes for the arrayDeque per bucket, of which 16008 bytes are used just for reserving the array.
This change reduces the default memory reserved by the arrayDeque when bucketing from MaxOutstandingSplits * 2 to a dynamic value based on bucket count (capping on the lower end as 8 * 2 + 1) . In the case above, the total cost per bucket would be reduced from 17376 bytes to 1504 bytes (accounting for the 1 split in the array), leading to over a 90% save in memory.
A possible downside to this change is that if we exceed the array size, the array will copy into a larger array which may be an expensive operation, however from what I've seen, tables with large bucket counts do not generate a lot of splits per bucket, and this seems to generally scale linearly, which is how I ended up with my current formula.
References:
/presto-trunk/presto-hive/src/main/java/com/facebook/presto/hive/HiveSplitSource.java Line 229
/presto-trunk/presto-hive/src/main/java/com/facebook/presto/hive/util/AsyncQueue.java Line 59
Differential Revision: D93797024
Summary by Sourcery
Adjust Hive split source configuration to reduce coordinator memory usage for heavily bucketed tables by limiting per-bucket outstanding splits.
Enhancements: