-
Notifications
You must be signed in to change notification settings - Fork 1k
Open
Labels
Description
keyed tables are already known sorted, so finding unique values is much easier than it is in the general case.
Compare:
NN = 1e8
set.seed(13013)
# about 400 MB, if you're RAM-conscious
DT = data.table(sample(1e5, NN, TRUE), key = 'V1')
system.time(unique(DT$V1))
# user system elapsed
# 1.354 0.415 1.798
system.time(DT[ , unique(V1)])
# user system elapsed
# 1.266 0.414 1.681
system.time(DT[ , TRUE, keyby = V1])
# user system elapsed
# 0.375 0.000 0.375
It seems to me we should be able to match (or exceed) the final time in the second call to unique (i.e. within []).
If we were willing to do something like add a dt_primary_key class to the primary key, we could also achieve this speed in the first approach by writing a unique.dt_primary_key method, but I'm not sure how extensible this is to multiple keys (S4?)