Skip to content

unique can be optimized on keyed data.tables #2947

@MichaelChirico

Description

@MichaelChirico

keyed tables are already known sorted, so finding unique values is much easier than it is in the general case.

Compare:

NN = 1e8

set.seed(13013)
# about 400 MB, if you're RAM-conscious
DT = data.table(sample(1e5, NN, TRUE), key = 'V1')

system.time(unique(DT$V1))
#    user  system elapsed 
#   1.354   0.415   1.798 

system.time(DT[ , unique(V1)])
#    user  system elapsed 
#   1.266   0.414   1.681 

system.time(DT[ , TRUE, keyby = V1])
#    user  system elapsed 
#   0.375   0.000   0.375 

It seems to me we should be able to match (or exceed) the final time in the second call to unique (i.e. within []).

If we were willing to do something like add a dt_primary_key class to the primary key, we could also achieve this speed in the first approach by writing a unique.dt_primary_key method, but I'm not sure how extensible this is to multiple keys (S4?)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions