-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[Groupby Query Metrics] Add merge buffer tracking #18731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| if (perQueryStats.getMergeBufferAcquisitionTimeNs() > 0) { | ||
| mergeBufferQueries++; | ||
| mergeBufferAcquisitionTimeNs += perQueryStats.getMergeBufferAcquisitionTimeNs(); | ||
| mergeBufferTotalUsage += perQueryStats.getMergeBufferTotalUsage(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GWphua Instead of summing here, what do you think about taking the max? Then the metric emitted would be the max merge buffer usage of a single query in that emission period. This would be a good signal for operators on whether they need to tweak the mergeBuffer size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this.
There are other metrics where taking MAX will also make sense --
spilledBytes --> How much storage would be good to configure?
dictionarySize --> How large can the merge dictionary size get?
I am considering adding another metric (maxSpilledBytes, maxDictionarySize, maxSpilledBytes). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed. I do think it makes sense to have it for those 3 metrics
Even for mergeBuffer/acquisitionTimeNs I think there's value in having the max, as it gives operators a signal on whether to increase numMergeBuffers
|
Thanks for adding those max metrics @GWphua! What do you think about adding |
|
Hi @aho135 Thanks for the review! I also find that it will be very helpful to emit metrics for each query, so we know which query will take up alot of resources. In our version of Druid, we simply appended each of the Alternatively, we can look into migrating the groupBy query metrics in We can do more of this in a seperate PR. |
Sounds good @GWphua I was thinking on very similar lines to emit these from I have a first draft on this: aho135@9f82091 |
|
Hi @aho135, since the scope of adding
I have a draft for |
|
Hi @gianm, would appreciate it if I receive a review/feedback on this PR. Thanks! |
abhishekrb19
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GWphua, thanks for the improved observability and @aho135 for the helpful max metric suggestions! Overall, the changes look good to me - I will take a closer look at the grouper changes soon.
Checkpointing my review on the GroupByStatsProvider, docs and some test suggestions. Please let me know what you think.
processing/src/main/java/org/apache/druid/query/groupby/GroupByStatsProvider.java
Show resolved
Hide resolved
processing/src/test/java/org/apache/druid/query/groupby/GroupByStatsProviderTest.java
Outdated
Show resolved
Hide resolved
processing/src/test/java/org/apache/druid/query/groupby/GroupByStatsProviderTest.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/apache/druid/server/metrics/GroupByStatsMonitorTest.java
Outdated
Show resolved
Hide resolved
|
@abhishekrb19, thanks for the review! I have made changes according to your suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @GWphua! I tried to take a closer look at the grouper changes - please see my latest comments. To track actual merge buffer usage the way #17902
proposes, we may need some additional thought: using buffer.capacity() in the various places is going to reflect the configured buffer size, not the actual bytes used.
My suggestion would be to split all the new max metrics into a separate PR: mergeBuffer/maxAcquisitionTimeNs, groupBy/maxSpilledBytes, groupBy/maxMergeDictionarySize. Those look more straightforward and would still be useful for operators to track.
Then we can keep this PR and the linked issue open to track mergeBuffer/bytesUsed and mergeBuffer/maxBytesUsed. We'll also want to think about and add some test coverage for the grouper changes once the approach is finalized.
server/src/test/java/org/apache/druid/server/metrics/GroupByStatsMonitorTest.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/apache/druid/server/metrics/GroupByStatsMonitorTest.java
Outdated
Show resolved
Hide resolved
| throw new ISE("Grouper is closed"); | ||
| } | ||
|
|
||
| groupers.forEach(Grouper::reset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change required? Given that the ConcurrentGrouper operates on SpillingGrouper instance, I suppose this is technically correct. But calling Grouper::reset as it was earlier should already ensure that the specific reset/close methods from SpillingGrouper are invoked? If so, could we revert this change here and below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, only SpillingGrouper's method is called here.
I changed this to make my life easier during development:
- IntelliJ can jump to the SpillingGrouper method, instead of going to the Grouper interface.
- Future readers will be able to tell that the groupers object held by
ConcurrentGrouperwill beSpillingGroupers.
If the purpose is to keep the changes limited to this PR only connected to the Groupby metrics, I will be opening a new PR. Let me know if you would prefer that.
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/BufferHashGrouper.java
Show resolved
Hide resolved
| ); | ||
|
|
||
| private final Grouper<KeyType> grouper; | ||
| private final AbstractBufferHashGrouper<KeyType> grouper; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above. I think it would be better to define these methods in the Grouper interface and leave this change as it was earlier. Mixing Grouper with impls otherwise seems confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this, I will need to maintain the AbstractBufferHashGrouper. The reason will also address why getMergeBufferUsedBytes is not placed in the Groupers interface.
This is because the groupby metrics are only collated when Groupers#close is called. This means that we will not be interfacing with Groupers#getMergeBufferUsedBytes at all, but letting Groupers#close retrieve getMergeBufferUsedBytes. This is why SpillingGrouper#getMergeBufferUsedBytes is private, following the example set by SpillingGrouper#getDictionarySizeEstimate.
The only time we need to interface with getMergeBufferUsedBytes is in AbstractBufferHashGrouper, which are the underlying groupers used by the SpillingGrouper.
processing/src/main/java/org/apache/druid/query/groupby/GroupByStatsProvider.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/groupby/GroupByStatsProvider.java
Outdated
Show resolved
Hide resolved
| } | ||
| buffer.putInt(numElements * Integer.BYTES, val); | ||
| numElements++; | ||
| maxMergeBufferUsageBytes = Math.max(maxMergeBufferUsageBytes, numElements * Integer.BYTES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think this variable and state tracking isn't needed in add() since we're tracking numElements already. We can just do it inline numElements * Integer.BYTES from getMaxMergeBufferUsageBytes()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking to get the maximum usage, because we request a getMergeBufferUsage when SpillingGrouper#close is called. I do not really know whether we are guaranteed to see the maximum usage when the grouper is closing this class...
I am worried about the case where maybe we configure 1GiB to the merge buffer, and the usage in the middle goes to, say 900MiB, but when the Grouper is closed, the usage shows ~300MiB. The user is then encouraged to lower the merge buffer allocation to 500MiB, which will be problematic.
This comment would also hopefully address your query about why reset does not change maxMergeBufferUsageBytes.
|
|
||
| protected void updateMaxTableBufferUsage() | ||
| { | ||
| maxTableBufferUsage = Math.max(maxTableBufferUsage, tableBuffer.capacity()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, the issue #17902 was created to actually get some visibility into actual merge buffer usage, but this metric would just tell us how much was actually configured instead?
tableBuffer.capacity() would just indicate the capacity of the buffers, so more or less what was configured via druid.processing.buffer.sizeBytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right about the current metric reporting the allocation.
I relooked the implementation, and found that maybe we can better estimate the ByteBufferHashTable. Since the ByteBufferHashTable is an open-addressing hash table, we can use the number of elements in table (size) * space taken up per bucket (bucketSizeWithHash) to estimate the usage in bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. Please see https://github.com/apache/druid/pull/18731/changes#r2706872449
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the AlternatingByteBufferHashTable inherit the max metrics reporting from the superclass. Should now accurately report the usage :)
| long hashTableUsage = hashTable.getMaxTableBufferUsage(); | ||
| long offSetListUsage = offsetList.getMaxMergeBufferUsageBytes(); | ||
| return hashTableUsage + offSetListUsage; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment, I think this would just more or less tell us configured size rather than actual buffer usage. (more or less because of offset list tracking)
…atsMonitorTest.java Co-authored-by: Abhishek Radhakrishnan <abhishek.rb19@gmail.com>
|
Hi @abhishekrb19, I have revisited your comments, and have made the relevant changes/replies. Please take a look at the new approach to calculating the usage. Thanks! |
| long currentBufferUsedBytes = 0; | ||
| for (ByteBuffer buffer : subHashTableBuffers) { | ||
| currentBufferUsedBytes += buffer.capacity(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this effectively would just be tableArenaSize which would reflect the allocated configured size rather than actual used size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update LimitedBufferHashGrouperTest, BufferHashGrouperTest and related grouper tests to validate the correctness of these implementations.
I just pulled in the latest patch locally and ran some group by queries and noticed that the bytesUsed and maxBytesUsed were more or less what was configured druid.processing.buffer.sizeBytes 🤔
I’ll try to dig into this more, but in the meantime, I’d still recommend splitting the PR into two parts: 1. max metrics 2. the bytesUsed and maxBytesUsed metrics
2 seems a bit more involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, I have added the tests for the groupers.
I did not get the same results as you, maybe because I used queries for a smaller dataset.
How I did in my tests is to query with spill to disk enabled:
- Set druid.processing.buffer.sizeBytes = 1GB
- Query on a dataset. (Let's say the results for this is 100MB)
- Set druid.processing.buffer.sizeBytes to a much smaller value ~5MB
- Query on the same dataset, and watch the usage metrics cap at 5MB, with spillage to disk ~95MB.
Here's an example of how my max metrics look like:

I do have to admit, some of the values are kinda "blocky", like it will report ~28MB repeatedly for, say 3 consecutive metrics, then report some other value. Maybe this is because similar queries are being sent during a short period of time, and perhaps the allocated space is the same for these similar queries. Hopefully, this will be fixed by your catch -- reporting the usage instead of the capacity. 😄
|
|
||
| protected void updateMaxTableBufferUsage() | ||
| { | ||
| maxTableBufferUsage = Math.max(maxTableBufferUsage, tableBuffer.capacity()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. Please see https://github.com/apache/druid/pull/18731/changes#r2706872449
As suggested, moved 3 max metrics to #18934, leaving |
Tests for buffer hash grouper
Fixes #17902
Huge thanks to @gianm for the implementation tip in the issue!
Description
Tracking merge buffer usage
AbstractBufferHashGrouperand its implementations.ByteBufferHashTablealong with an offset tracker.ByteBufferHashTable, and maximum offset size calculated throughout the query's lifecycle.Incorporated a helpful suggestion by @aho135 : since the size of the hash tables are ever-changing, it makes sense to conduct calculations by taking the maximum values across queries -- so operators can have a better understanding of how the size of merge buffers can be configured.Edit: max metrics provided in #18934
Here's an example of the current SUM implementations, vs the MAX implementation The latter helps to tell us that we should probably configure merge buffer sizes to 2G for this case:

Release note
GroupByStatsMonitornow provides metrics "mergeBuffer/bytesUsed", and max metrics for merge buffer acquisition time, bytes used, spilled bytes, and merge dictionary size.Key changed/added classes in this PR
GroupByStatsProviderThis PR has:
Possible further enhancements
While building this PR, I have come across some problems which we can further enhance in the future:
Nested Group-bys
The current metric is great, but will not report accurately for nested group-by's. (Do correct me on this if I'm mistaken though!)
As far as I know, nested groupby limits the merge buffers usage count to 2, meaning the merge buffer will be re-used. IIUC, every ConcurrentGrouper (if concurrency is enabled) / SpillingGrouper (if concurrency disabled) is created and closed multiple times, and hence a per-query metric will likely over-report the merge buffer usage.
Simplify Memory Management
Right now we need to configure the following for each queryable service:
It will be great if we can simplify the calculations down to simply configuring direct memory, and we can manage a memory pool instead. This allows for more flexibility (unused memory allocated for merge buffers may be used by processing threads instead).