Skip to content

Conversation

drizk1
Copy link
Member

@drizk1 drizk1 commented Apr 3, 2025

Was experimenting with the summarize macro to try to improve its benchmark a bit as the last macro not = to df.jl speed.
This cuts the benchmark time in half and reduces some allocs as well. not quite as good as DF.jl but an improvement without any breaking changes .

# setup adapted from a bogul blog post  https://bkamins.github.io/julialang/2022/05/27/strings.html 
using Random, BenchmarkTools,  DataFrames, TidierData, PooledArrays, CategoricalArrays, InlineStrings

Random.seed!(1234);
df = transform!(DataFrame(str=[randstring() for _ in 10:10^6]),
                       :str .=>
                       [inlinestrings, ByRow(Symbol),
                        PooledArray, CategoricalArray] .=>
                       [:istr, :sym, :pstr, :cstr]);

df.A = rand(1:10, nrow(df));
df.B = rand(1:10, nrow(df));
df.C = rand(1:10, nrow(df));
categories = ["Category1", "Category2", "Category3", "Category4"];
df.CatVar = categorical(rand(categories, nrow(df)));
strings = ["String1", "String2", "String3", "String4"];
df.StrVar = rand(strings, nrow(df));
morecats = ["Category1", "Category2", "Category3", "Category4","Category5", "Category6", "Category7", "Category8"];
df.More_Cats = categorical(rand(morecats, nrow(df)));
tweleve_cats = ["Category1", "Category2", "Category3", "Category4","Category5", "Category6", "Category7", "Category8","Category9", "Category10", "Category11", "Category12"];
df.Cats_12 = categorical(rand(tweleve_cats, nrow(df)));

OG summarize

@benchmark  @summarize(@group_by(df, CatVar), A_mean = mean(A)) 
BenchmarkTools.Trial: 247 samples with 1 evaluation per sample.
 Range (min … max):  11.454 ms … 148.554 ms  ┊ GC (min … max):  0.00% … 90.16%
 Time  (median):     14.436 ms               ┊ GC (median):     9.01%
 Time  (mean ± σ):   20.259 ms ±  21.745 ms  ┊ GC (mean ± σ):  34.06% ± 20.50%

   █                                                            
  ▇█▆▅▅▃▃▃▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▂
  11.5 ms         Histogram: frequency by time          117 ms <

 Memory estimate: 101.43 MiB, allocs estimate: 529.

DF.jl reference

@benchmark df |> (df -> groupby(df, :CatVar)) |> (df -> combine(df,:A => mean)) 
BenchmarkTools.Trial: 1865 samples with 1 evaluation per sample.
 Range (min … max):  2.229 ms …  10.298 ms  ┊ GC (min … max): 0.00% … 76.43%
 Time  (median):     2.482 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.681 ms ± 515.947 μs  ┊ GC (mean ± σ):  8.88% ± 11.80%

     ▃█▆▆▃▁                                                    
  ▁▄███████▇▃▂▂▂▄▃▄▄▄▃▂▂▃▂▂▂▁▁▂▁▁▁▁▂▂▁▂▂▂▂▂▂▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  2.23 ms         Histogram: frequency by time        4.25 ms <

 Memory estimate: 7.64 MiB, allocs estimate: 286.

updated summarize

@benchmark  @summarize(@group_by(df, CatVar), A_mean = mean(A)) 
BenchmarkTools.Trial: 735 samples with 1 evaluation per sample.
 Range (min … max):  5.469 ms …  13.391 ms  ┊ GC (min … max):  4.81% … 52.58%
 Time  (median):     6.799 ms               ┊ GC (median):     9.00%
 Time  (mean ± σ):   6.802 ms ± 686.940 μs  ┊ GC (mean ± σ):  11.96% ±  5.91%

                     ▁▇▄▆▇▅ ▃▅ ▂ ▄█▅▆▃▄▃▆                      
  ▅▅▂▃▇▄▅▅▃▃▃▅▅▆▅▃▃▁▅██████▇██▇██████████▆▅▅▇▅▅▄▆▃▂▂▁▁▂▂▃▁▂▂▁ ▄
  5.47 ms         Histogram: frequency by time        8.22 ms <

 Memory estimate: 38.90 MiB, allocs estimate: 399.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant