unique can be optimized on keyed data.tables

`key`ed tables are already known sorted, so finding unique values is much easier than it is in the general case.

Compare:

```
NN = 1e8

set.seed(13013)
# about 400 MB, if you're RAM-conscious
DT = data.table(sample(1e5, NN, TRUE), key = 'V1')

system.time(unique(DT$V1))
#    user  system elapsed 
#   1.354   0.415   1.798 

system.time(DT[ , unique(V1)])
#    user  system elapsed 
#   1.266   0.414   1.681 

system.time(DT[ , TRUE, keyby = V1])
#    user  system elapsed 
#   0.375   0.000   0.375 
```

It seems to me we should be able to match (or exceed) the final time in the second call to `unique` (i.e. within `[]`).

If we were willing to do something like add a `dt_primary_key` class to the primary key, we could also achieve this speed in the first approach by writing a `unique.dt_primary_key` method, but I'm not sure how extensible this is to multiple keys (`S4`?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

unique can be optimized on keyed data.tables #2947

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

unique can be optimized on keyed data.tables #2947

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions