-
Notifications
You must be signed in to change notification settings - Fork 73
add new minmax function
#526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mdtanker for pushing this forward! I just left a few minor suggestions. Let me know what do you think!
It might be also nice to check if min_percentile <= max_percentile so we are ensure that min <= max regardless of the choices of percentile values.
Co-authored-by: Santiago Soler <santisoler@fastmail.com>
Co-authored-by: Santiago Soler <santisoler@fastmail.com>
Co-authored-by: Santiago Soler <santisoler@fastmail.com>
Co-authored-by: Santiago Soler <santisoler@fastmail.com>
Since percentile calculations are slower than min/max, I only do them if non-default values (0,100) are provided. If only 1 non-default percentile provided, only calculate that percentile and use min/max for other value.
|
Here are some timings for future reference: import numpy as np
import verde as vd
a = np.random.uniform(size=(27, 100))
|
santisoler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, @mdtanker. Just left a few comments. Let me know what do you think!
| # get min value | ||
| if min_percentile == 0: | ||
| min_ = npmin([npmin(i) for i in arrays]) | ||
|
|
||
| # get max value | ||
| if max_percentile == 100: | ||
| max_ = npmax([npmax(i) for i in arrays]) | ||
|
|
||
| # calculate min, max or both percentiles | ||
| if min_percentile != 0 or max_percentile != 100: | ||
| # concatenate values of all arrays | ||
| combined_array = np.concatenate([a.ravel() for a in arrays]) | ||
|
|
||
| # if neither percentiles are defaults, calculate them together | ||
| if min_percentile != 0 and max_percentile != 100: | ||
| min_, max_ = nppercentile(combined_array, [min_percentile, max_percentile]) | ||
|
|
||
| # only calculate min percentile | ||
| elif min_percentile != 0: | ||
| min_ = nppercentile(combined_array, min_percentile) | ||
|
|
||
| # only calculate max percentile | ||
| if max_percentile != 100: | ||
| max_ = nppercentile(combined_array, max_percentile) | ||
|
|
||
| return min_, max_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could simplify this logic a bit. What about this:
| # get min value | |
| if min_percentile == 0: | |
| min_ = npmin([npmin(i) for i in arrays]) | |
| # get max value | |
| if max_percentile == 100: | |
| max_ = npmax([npmax(i) for i in arrays]) | |
| # calculate min, max or both percentiles | |
| if min_percentile != 0 or max_percentile != 100: | |
| # concatenate values of all arrays | |
| combined_array = np.concatenate([a.ravel() for a in arrays]) | |
| # if neither percentiles are defaults, calculate them together | |
| if min_percentile != 0 and max_percentile != 100: | |
| min_, max_ = nppercentile(combined_array, [min_percentile, max_percentile]) | |
| # only calculate min percentile | |
| elif min_percentile != 0: | |
| min_ = nppercentile(combined_array, min_percentile) | |
| # only calculate max percentile | |
| if max_percentile != 100: | |
| max_ = nppercentile(combined_array, max_percentile) | |
| return min_, max_ | |
| if min_percentile == 0 and max_percentile == 0: | |
| min_ = npmin([npmin(i) for i in arrays]) | |
| max_ = npmax([npmax(i) for i in arrays]) | |
| return min_, max_ | |
| # concatenate values of all arrays | |
| combined_array = np.concatenate([a.ravel() for a in arrays]) | |
| min_, max_ = nppercentile(combined_array, [min_percentile, max_percentile]) | |
| return min_, max_ |
If any of the min_percentile and max_percentile is different from the defaults, then we'll use nppercentile anyways. And a min_percentile=0 will always lead to the minimum value. Moreover, I suspect there's no significant computational hit in passing two percentiles, since I guess that Numpy is computing each while looping over the elements in the array once.
So, we would only use npmin and npmax if min_percentile==0 and max_percentile==100.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a bit of time savings from only calculating one percentile and using min/max for the other. But maybe the time savings is not worth the extra code? At least that's how I interpreted the benchmarking results from below. The first uses min and max, with no percentiles, the second uses 1 and 99 percentiles, and the third uses min and 99 percentiles. As you can see, using 1 percentile and 1 min/max calculation is a bit fast than using 2 percentile calculations.
What do you think?
%timeit vd.minmax(a, min_percentile=0, max_percentile=100, nan=True)
%timeit vd.minmax(a, min_percentile=1, max_percentile=99, nan=True)
%timeit vd.minmax(a, min_percentile=0, max_percentile=99, nan=True)
>>>50.5 μs ± 2.56 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>>289 μs ± 26.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>>179 μs ± 3.93 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I would have guessed that passing multiple percentiles would not impact the computation time that much. But nonetheless, those differences are not significant, so I think I would keep the code simpler. Do you agree?
Even for a large array with 100 million elements, the difference is not very noticeable by users:
import numpy as np
a = np.random.default_rng().uniform(size=100_000_000)
# benchmark np.min and np.percentile with a single value
%timeit np.min(a)
%timeit np.percentile(a, 90)38.1 ms ± 351 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
862 ms ± 3.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# benchmark percentile with two values
%timeit np.percentile(a, [0, 90])1.01 s ± 5.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Co-authored-by: Santiago Soler <santisoler@fastmail.com>
Co-authored-by: Santiago Soler <santisoler@fastmail.com>
Co-authored-by: Santiago Soler <santisoler@fastmail.com>
Co-authored-by: Santiago Soler <santisoler@fastmail.com>
Adds an equivalanet function to
maxabsbut for calculating the min and max values of arrays, or optionally user-specified percentiles of the values.This is useful for
Relevant issues/PRs:
Implements the function requested in #525
Follows the percentiles approach from #524 and #523