-
Notifications
You must be signed in to change notification settings - Fork 112
Add move_quantile function #418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Implementation of moving quantile instead of moving median, which is a partial case of quantile. For now using the fixed constant in #define
Add quantile and has_quantile parameters to the template
Added all imports Fixed a bug when q=1. Call move_max for this case (on python layer) Added a lot of tests for move_median and move_quantile
Fix keyword argument "method"/"interpolation" for different numpy verisons (keyword was changed after 1.22.0) Copy over doc string for move_quantile from C layer to Python layer
This reverts commit c2a2ae3. move_min is significanlty faster than move_quantile with q = 0. So in case of q=0 apply move_min instead. Same for q=1 and move_max.
Instead of move_quantile substituting move_median completely, have both move_median and move_quantile implemented separately.
Remove the wrapper for move_quantile on python level which checked for q = 0 or 1. Now it's fully in C. Also check for q=0.5 as we checked it's 3-4% faster to call move_median
Add some more tests
Remove redundant import
Mostly get rid of macros
for versions comparison
This eliminate the need for macro in move_median.c mm_handle will just have an unused membet "quanitle" for the case of move_median.
move_median and move_quantile now have all the same functions except for the construction of mm/mq.
Update 1The implementation of move_median.c was refactored in 72677f8 to remove code repetition and usage of macro completely. Now move_median and move_quantile use the same functions for managing heaps, and only differ when they calculate the actual statistic. This makes the implementation of both mm and mq at the same time cleaner while keeping the performance of move_median unchanged. The diff in the source code is much smaller now. |
Update 2In 2c892db added a very simple python layer for move_quantile to support iterable q argument. Also argument |
|
Hi @rdbisme, I was wondering if someone could take a look or make a comment on this PR at their convenience. I know it's a large one, just want to understand what I could expect from this. Thanks :) |
|
Ehi @andrii-riazanov, thanks for your contribution. I'm currently alone managing this package, mostly focusing on keep it easily available and installable on supported Python versions. I hope someone else from the community can step in and help to review implementations and improvements of the actual business logic as for your PR. Otherwise, I'll try to find a bit of free time to actually give it a look, but it might take time. Anyway, if anyone is reading this, feel free to step in this discussion and provide feedback :) |
|
While I'm not able to help directly with the code, I'm very thankful and eager to try this out. Also:
|
This PR adds
move_quantileto the list of supported move functions.Why?
Quantiles (and moving quantiles) are often useful statistics to look at, and having a fast move version of quantile would be great.
How?
Moving/rolling quantile is implemented in almost exactly the same way as moving median: via two heaps (max-heap and min-heap). The only difference is in sizes of the heaps -- for
move_medianthey should have the same size (modulo parity nuances), while for themove_quantilesizes of the heaps should be rebalanced differently.The changes to transform
move_medianintomove_quantileare very minor, and were implemented in the first commit 524afbf (36++, 13--). This commit fully implementedmove_quantilewith fixed q=0.25 out ofmove_median.The initial approach was to substitute
move_medianwithmove_quantilecompletely. Then, onmove_mediancall, just callmove_quantile(q=0.5). This is implemented and tested in commits until de181da , where fully workingmove_quantile(andmove_medianviamove_quantile) was implemented.At this point, new
move_medianbench was compared to oldmove_medianbench. It was observed that the newmove_medianbecame slower by 1-3%. Even though the changes were minor, apparently new arithmetic operations introduced were enough to cause this overhead. For a performance-oriented package with decrease in speed is not justifiable.It was decided to implement
move_quantileparallel tomove_median. This causes a lot of code repetition, but this needed to be done to not sacrificymove_medianperformance (and also to avoid abusing macros) cd49b4f . A lot of the functions inmove_median.cwere almost duplicated, hence a large diff. At this commit, bothmove_quantileandmove_medianwere fully implemented and almost fully tested.When
move_quantileis called withq=0., insteadmove_minis called, which is much faster. Similarly withq=1.andmove_max, and withq=0.5andmove_median.Only interpolation method "midpoint" was implemented for now.
Other changes
parse_argsinmove_template.cwas heavily refactored for better clarityTechnicalities
np.percentilegives unreasonable results when array containsnp.infnumpy/numpy#21932, BUG: inf in quantile has undefined behaviour (and possibly different for -inf vs +inf) numpy/numpy#21091 . In particular,np.nanquantile(q=0.5)doesn't give the same result asnp.nanmedianon such data, because of how arithmetic operation work on np.infs. Ourmove_quantilebehaves as expected and in agreement withmove_medianwhen q=0.5. To test properly (and have a numpy slow version ofmove_quantile), we notice thatnp.nanmedianbehaviour can be achieved if one takes(np.nanquantile(a, q=0.5, method="lower") + np.nanquantile(a, q=0.5, method="higher")) / 2. This is what we use for slow function if there are np.inf's in the data. The fact that this andnp.nanmedianreturn the same is tested inmove_test.py. This issue is also discussed in there in comments (which I used pretty liberally)a, the usualnp.nanquantileis called inbn.slow.move_quantile, so benching is "fair", since we don't consider infinite values during benching.Tests
move_quantile. With constantREPEAT_FULL_QUANTILEset to 1 intest_move_quantile_with_infs_and_nans, the test considers 200k instances, and takes ~7 mins to run. It was tested with more repetitions and larger range of parameter, the current values are set so that the Github Actions tests run reasonable time.Benches
bn.move_quantileis significantly faster thanbn.slow.move_quantile:The increase in speed was tested and confirmed separately (outside of bn.bench) for sanity check. q = 0.25 is used for all benches with
move_quantile.np.nanquantileis.bn.bench(functions=["move_quantile"])runs for about 20 minutes:Further changes
Several things that can be improved with
move_quantilegoing further:parse_argfunction made it much easier to pass additional arguments to functions inmove. Changing behavior ofmq_get_quantileshould not be a problem as wellnp.quantilesupports a list (iterable) of quantiles to compute. Can also add it here, quite easy to do if implement it at the first step on python level.qa required argument formove_quantile(as it should be), but was met with some complications and left it as is. If will create a python wrapper to parse the iterableqinput anyway, can add non-keywordqto that python layer.Wrap-up
Thanks for considering, and sorry for a large diff. 50% of that is duplicating code in move_median.c, and another 20% is new tests. You can see in de181da how few changes were actually made for
move_quantileto work, but this approach just unfortunately slowed downmove_medianby a bit.