Skip to content

Conversation

@beingamanforever
Copy link

As pointed out in the community about the hidden cost of median(x) and possible fix.

Summary

This change replaces the full sort used in median() for dense inputs with order-statistic selection via nth_element, reducing average complexity from O(n log n) to O(n). Sparse inputs retain the original sort-based code path to preserve historical behavior and output sparsity.


Algorithmic change

Previous implementation:

  • sort(x) followed by indexing middle element(s)

New implementation:

  • nth_element(x, k) for odd length
  • nth_element(x, [k, k+1]) for even length

This computes only the required order statistics without fully sorting.


NaN handling

NaN semantics are preserved explicitly:

  • "omitnan": NaNs are filtered before selection.
  • "includenan": slices containing NaNs return NaN.

This matches the behavior previously provided implicitly by sort()


Sparse behavior

Sparse inputs continue to use the original sort-based code path to preserve:

  • sparsity of outputs,
  • historical indexing behavior,
  • and memory characteristics.

Dense inputs use the new nth_element path.


I went a step ahead and ran a benchmark to see if it's really doing the speed up as promised, here are the results below. I am also attaching the benchmarking script.

Performance

Benchmarks on large dense inputs (Octave 9.x, Apple clang):

Case Size Before (s) After (s) Speedup
Vector 10,000 0.001555 0.000634 2.5×
Vector 100,000 0.008014 0.001303 6.2×
Vector 500,000 0.044739 0.005207 8.6×
Vector 1,000,000 0.095003 0.012283 7.7×
Matrix (n×100) 10,000 0.060745 0.015934 3.8×
Matrix (n×100) 100,000 0.754142 0.126997 5.9×
Matrix (n×100) 500,000 4.648894 0.590293 7.9×
Matrix (n×100) 1,000,000 9.967157 1.471443 6.8×
omitnan (n×100) 10,000 0.071202 0.015734 4.5×
omitnan (n×100) 100,000 0.717203 0.117018 6.1×
omitnan (n×100) 500,000 4.470758 0.560758 8.0×
omitnan (n×100) 1,000,000 9.405867 1.238683 7.6×

Speedup grows with input size, consistent with removal of the log n factor.


bench_median_comparison

Benchmarking Script can be found here
CSV results from which plots were generated can be found here


Notes

  • No semantic or API changes.
  • Only the internal selection algorithm is modified.
  • All existing tests pass
  • All existing BISTs pass unchanged.
  • As mentioned in community, NaN values are taken care of properly
  • I can move to improving the implementation of KDTree and improve it's build time if this gets merged as median was being a bottleneck as raised in this issue

@beingamanforever
Copy link
Author

beingamanforever commented Jan 22, 2026

Just to be in sync with the issue, please refer the community page too!

Copy link
Member

@mmuetzel mmuetzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution.

I only had a very cursory look at the patch and didn't run any tests yet.

For now, only a minor style request. And one comment that needs clarification.

@beingamanforever
Copy link
Author

I wrote a central helper function mid_two_evals for the even/odd median calculation as this part is being used in the code multiple times, hence leading to code duplication it also handles all the special cases (Inf, integer overflow, etc.) in one place.

I changed the comments as per your suggestion too. Passes all TCs.

@mmuetzel
Copy link
Member

Why did you remove the BISTs that you added previously. Are they no longer passing?

function m = mid_two_vals (m1, m2, is_int)
if (is_int)
samesign = sign (m1) == sign (m2);
m = samesign .* (m1 + (m2 - m1) / 2) + ! samesign .* ((m1 + m2) / 2);
Copy link
Member

@mmuetzel mmuetzel Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't timed that. But there would probably be a couple less arithmetic instructions for something like this:

m(samesign) = (m1(samesign) + (m2(samesign) - m1(samesign)) / 2);
m(! samesign) = ((m1(! samesign) + m2(! samesign)) / 2);

But maybe that is outweighed by the indexing operations. But maybe worth testing (given the motivation of this change is performance).

@beingamanforever
Copy link
Author

@mmuetzel They were passing,

image but since those BISTs were effectively testing impossible inputs rather than real behaviour. I dropped them.

Let think of anyway of benchmarking the hot paths with extreme precision as from my experience at very minute precision the way of benchmarking becomes imp, (warmup steps, blocking sec processes, testing on multiple diff archs etc) i will add the BISTs too along with this commit + results (as there is no such thing as more BISTs !!)

@mmuetzel
Copy link
Member

mmuetzel commented Jan 27, 2026

If we had those BISTs, you would have caught earlier that something was wrong with the implementation that you had intermediately. So, having exactly those kinds of tests is useful. Could you please add them again?

@beingamanforever
Copy link
Author

done!

@beingamanforever beingamanforever force-pushed the median-nth-element-refactor branch from b992a65 to bc2087d Compare January 27, 2026 19:37
@beingamanforever beingamanforever force-pushed the median-nth-element-refactor branch from bc2087d to 7d220c1 Compare January 27, 2026 19:38
@beingamanforever
Copy link
Author

beingamanforever commented Jan 27, 2026

@mmuetzel I ran a simple microbenchmark comparing the current mask-multiplication form with a masked-indexing variant. On my machine (N = 1e7), the mask-multiplication version was consistently faster (~2x). This is likely because it avoids gather/scatter and keeps contiguous memory access, whereas the masked-indexing version introduces extra memory traffic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants