-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Taking that out from #4346 so discussion only about that could be here.
I would like to propose for forderv to have a default retGrp=TRUE, that means secondary indices would carry that attribute as well. As a result it will be a little bit more heavy, but it opens more possibilities to avoid heavy re-computation. One of many examples
# TODO: could check/reuse secondary indices, but we need 'starts' attribute as well!
as well #2947
I made small benchmark...
tl;dr
The difference in timings above are significant. My conclusion is that we should not make that a defaut, but rather keep those information whenever user compute them somehow, for example when calling unique. In such case there is no extra performance cost, and those information doesn't have to be re-computed again. It could be computed when calling setindex.
Each of comment describes a different factor used.
library(data.table)
set.seed(108)
forderv = data.table:::forderv
N = 1e8
## th
setDTthreads(40L)
setDTthreads(1L)
## n unique
DT = data.table(V1 = sample(N, N, FALSE))
DT = data.table(V1 = sample(1:2, N, TRUE))
## fun: order vs order+groups
system.time(o <- forderv(DT, by="V1", sort=TRUE, retGrp=FALSE))
system.time(p <- forderv(DT, by="V1", sort=TRUE, retGrp=TRUE))and got the following timings
d = fread("
th,unqn,fun,sec
40,1e8,o,0.851
40,1e8,og,1.759
40,2,o,0.244
40,2,og,0.253
1,1e8,o,4.901
1,1e8,og,5.630
1,2,o,1.061
1,2,og,1.075
")
cube(d, by=c("th","unqn"), j=sprintf("%.2f%%", mean(sec[fun=="o"]/sec[fun=="og"])*100))
# th unqn V1
#1: 40 1e+08 48.38%
#2: 40 2e+00 96.44%
#3: 1 1e+08 87.05%
#4: 1 2e+00 98.70%
#5: 40 NA 72.41%
#6: 1 NA 92.87%
#7: NA 1e+08 67.72%
#8: NA 2e+00 97.57%
#9: NA NA 82.64%On average finding order but no groups takes 82% of time that order+groups would take.
Importance of unique value (number of groups) is 97% vs 67%. So if there are only 2 groups, the difference is not significant, but for all unique rows, the average difference is 67%.
Importance of 40 vs 1 thread is 92% vs 72%.
In combination of 40 threads and all unique rows, calculating order+groups is twice slower comparing to just order. When using 1 thread it is only around 10% slower.
Regarding memory, number of threads is not factor anymore.
All unique rows, will take twice as much memory, while 2 groups will take almost no extra memory.