-
Notifications
You must be signed in to change notification settings - Fork 373
Closed
Labels
Description
As originally discussed in #2637
There appears to be some important performance deterioration when filtering on large datasets.
For example:
using DataFrames
df1 = DataFrame(rand(Int(2e6), 200));
function filter_df_1(df)
df_f = filter(:x1 => x -> (x >= 0.1 && x <= 0.9), df)
return df_f
end
@btime filter_df_1($df1);
15.972 s (617 allocations: 2.40 GiB)
The computer resources/RAM seemed under control while performing the above:

However, the size of the resulting dataframe appears to have a great importance on the performance. If the filtering condition is restrained, the speed gets fine:
function filter_df_1(df)
df_f = filter(:x1 => x -> (x >= 0.85 && x <= 0.9), df)
return df_f
end
@btime filter_df_1($df1);
168.308 ms (617 allocations: 154.34 MiB)Does such behavior would have more to do on how Julia handles the memory / gc or the the actual DataFrames implementation? Comparative tests with R's data.table show no such latency issue when performing the same filtering.
DataFrames v0.22.5
julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: AMD Ryzen 7 4800HS with Radeon Graphics
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, znver2)