Skip to content

Performance issue on filter #2663

@jeremiedb

Description

@jeremiedb

As originally discussed in #2637

There appears to be some important performance deterioration when filtering on large datasets.

For example:

using DataFrames
df1 = DataFrame(rand(Int(2e6), 200));
function filter_df_1(df)
    df_f = filter(:x1 => x -> (x >= 0.1 && x <= 0.9), df)
    return df_f
end
@btime filter_df_1($df1);
 15.972 s (617 allocations: 2.40 GiB)

The computer resources/RAM seemed under control while performing the above:
image

However, the size of the resulting dataframe appears to have a great importance on the performance. If the filtering condition is restrained, the speed gets fine:

function filter_df_1(df)
    df_f = filter(:x1 => x -> (x >= 0.85 && x <= 0.9), df)
    return df_f
end
@btime filter_df_1($df1);
  168.308 ms (617 allocations: 154.34 MiB)

Does such behavior would have more to do on how Julia handles the memory / gc or the the actual DataFrames implementation? Comparative tests with R's data.table show no such latency issue when performing the same filtering.

DataFrames v0.22.5

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 7 4800HS with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, znver2)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions