-
Notifications
You must be signed in to change notification settings - Fork 373
Open
Description
I'd like to efficiently and in parallel compute the median = q_0.5 and IQR = q_0.75 - q_0.25 of each column in a dataframe. Let's compare the 3 most used libraries:
pandas:
import numpy as np, pandas as pd, scipy
n,m=10**8,10; df = pd.DataFrame(np.random.rand(n,m))
%time df.median(axis=0)
%time df.quantile(0.5)
%time df.quantile(0.75)-df.quantile(0.25)
%time scipy.stats.iqr(df,axis=0)
CPU times: user 23.4 s, sys: 921 ms, total: 24.4 s
Wall time: 24.4 s
CPU times: user 20.3 s, sys: 830 ms, total: 21.1 s
Wall time: 21.2 s
CPU times: user 39.9 s, sys: 1.71 s, total: 41.6 s
Wall time: 41.6 s
CPU times: user 25.6 s, sys: 5.28 s, total: 30.9 s
Wall time: 31 s
polars:
import numpy as np, polars as pl
n,m=10**8,10; df = pl.DataFrame(np.random.rand(n,m), schema=[f"x{k}" for k in range(m)])
%time df.median()
%time df.quantile(0.75,interpolation='linear')
%time df.quantile(0.75,interpolation='linear') - df.quantile(0.25,interpolation='linear')
CPU times: user 21.4 s, sys: 3.51 s, total: 24.9 s
Wall time: 2.95 s
CPU times: user 19.2 s, sys: 3.86 s, total: 23.1 s
Wall time: 2.95 s
CPU times: user 43.8 s, sys: 11.4 s, total: 55.2 s
Wall time: 6.44 s
DataFrames.jl + Julia:
using DataFrames, StatsBase
n,m=10^1,10; df = DataFrame(rand(n,m), :auto);
function f1(df::DataFrame) ::Vector{Float64} return map(median, eachcol(df)) end
function f2(df::DataFrame) ::Vector{Float64} return map(iqr, eachcol(df)) end
function f3(df::DataFrame) ::Vector{Float64} m=size(df,2); res=fill(NaN,m); Threads.@threads for j=1:m res[j] = median(df[:,j]) end; return res end
function f4(df::DataFrame) ::Vector{Float64} m=size(df,2); res=fill(NaN,m); Threads.@threads for j=1:m res[j] = iqr(df[:,j]) end; return res end
@time f1(df);
@time f2(df);
@time f3(df);
@time f4(df);
14.686185 seconds (53 allocations: 14.901 GiB, 4.56% gc time)
86.758428 seconds (53 allocations: 7.451 GiB, 0.36% gc time)
8.259288 seconds (146 allocations: 22.352 GiB, 9.15% gc time)
50.395623 seconds (144 allocations: 14.901 GiB, 0.47% gc time)
Is there a better, more efficient way to compute medians and IQRs in Julia?
Metadata
Metadata
Assignees
Labels
No labels