-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add heap memory estimation for statistics #19599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
datafusion/common/src/stats.rs
Outdated
| pub fn heap_size(&self) -> usize { | ||
| // column_statistics + num_rows + total_byte_size | ||
| self.column_statistics.capacity() * size_of::<ColumnStatistics>() | ||
| + size_of::<Precision<usize>>() * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here Precision<usize> is an enum and does not have a heap allocated fields, so it is allocated in the stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these things are usually Arc'ed - so everything should be moved to the heap, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So size_of::<Precision<usize>>() * 2 should be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, if we want to follow the trait in arrow, which I think according to #19599 (comment) was the conclusion of the next step? Do you plan on push a commit to do so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes i am on it. pr coming up soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be able to get the heap size of arrays to implement it for Statistics? What's the chain of fields that takes us there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/datafusion/blob/main/datafusion/common/src/scalar/mod.rs#L376 as part of ScalarValue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And these are part of ColumnStatistics https://github.com/apache/datafusion/blob/main/datafusion/common/src/scalar/mod.rs#L376
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bummer. Isn't there ways to get the size of an array in memory? E.g. Array::get_array_memory_size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may actually work. Thanks, I will try that.
ebef154 to
107cacc
Compare
This adds a heap_size method returning the amount of memory a statistics struct allocates on the heap.
17a4cd1 to
153d1ad
Compare
| fn heap_size(&self) -> usize { | ||
| self.num_rows.heap_size() | ||
| + self.total_byte_size.heap_size() | ||
| + self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_rows and total_byte_size will result in 0, so this is included for consistency, but could also be omitted.
Which issue does this PR close?
Relates to #19052 (comment)
Rationale for this change
This adds heap memory estimation to statistics.
What changes are included in this PR?
NA.
Are these changes tested?
Yes
Are there any user-facing changes?
Adds a new
HeapSizetrait and implementations for all relevant types used in memory estimation. The trait is taken from arrow-rs, where it is currently private, and is intended as a temporary solution until arrow-rs is updated.