Skip to content

performance of median and iqr compared to python libraries #3462

@lampretl

Description

@lampretl

I'd like to efficiently and in parallel compute the median = q_0.5 and IQR = q_0.75 - q_0.25 of each column in a dataframe. Let's compare the 3 most used libraries:

pandas:

import numpy as np, pandas as pd, scipy
n,m=10**8,10;   df = pd.DataFrame(np.random.rand(n,m))
%time df.median(axis=0)
%time df.quantile(0.5)
%time df.quantile(0.75)-df.quantile(0.25)
%time scipy.stats.iqr(df,axis=0)
CPU times: user 23.4 s, sys: 921 ms, total: 24.4 s
Wall time: 24.4 s
CPU times: user 20.3 s, sys: 830 ms, total: 21.1 s
Wall time: 21.2 s
CPU times: user 39.9 s, sys: 1.71 s, total: 41.6 s
Wall time: 41.6 s
CPU times: user 25.6 s, sys: 5.28 s, total: 30.9 s
Wall time: 31 s

polars:

import numpy as np, polars as pl
n,m=10**8,10;   df = pl.DataFrame(np.random.rand(n,m), schema=[f"x{k}" for k in range(m)])
%time df.median()
%time df.quantile(0.75,interpolation='linear')
%time df.quantile(0.75,interpolation='linear') - df.quantile(0.25,interpolation='linear')
CPU times: user 21.4 s, sys: 3.51 s, total: 24.9 s
Wall time: 2.95 s
CPU times: user 19.2 s, sys: 3.86 s, total: 23.1 s
Wall time: 2.95 s
CPU times: user 43.8 s, sys: 11.4 s, total: 55.2 s
Wall time: 6.44 s

DataFrames.jl + Julia:

using DataFrames, StatsBase
n,m=10^1,10;   df = DataFrame(rand(n,m), :auto); 
function f1(df::DataFrame) ::Vector{Float64}  return map(median, eachcol(df)) end
function f2(df::DataFrame) ::Vector{Float64}  return map(iqr, eachcol(df)) end
function f3(df::DataFrame) ::Vector{Float64}  m=size(df,2);  res=fill(NaN,m);  Threads.@threads for j=1:m res[j] = median(df[:,j]) end; return res end
function f4(df::DataFrame) ::Vector{Float64}  m=size(df,2);  res=fill(NaN,m);  Threads.@threads for j=1:m res[j] = iqr(df[:,j]) end; return res end
@time f1(df);
@time f2(df);
@time f3(df);
@time f4(df);
14.686185 seconds (53 allocations: 14.901 GiB, 4.56% gc time)
86.758428 seconds (53 allocations: 7.451 GiB, 0.36% gc time)
8.259288 seconds (146 allocations: 22.352 GiB, 9.15% gc time)
50.395623 seconds (144 allocations: 14.901 GiB, 0.47% gc time)

Is there a better, more efficient way to compute medians and IQRs in Julia?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions