-
Notifications
You must be signed in to change notification settings - Fork 1k
Teach forder to re-use existing key and index attributes instead of sorting from scratch #4386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+576
−90
Merged
Changes from all commits
Commits
Show all changes
62 commits
Select commit
Hold shift + click to select a range
b0efcf5
lazy forder
jangorecki aaead65
fix tests
jangorecki b0858df
fix tests
jangorecki 9df2668
respect use.index option
jangorecki 62982ed
bmerge timings
jangorecki eec4971
codecov
jangorecki 9affbfa
helper function
jangorecki b36136a
reduce diff to master
jangorecki 3442b78
rename fix
jangorecki dfe883e
setindex writes groups (retGrp=TRUE)
jangorecki efb3319
calc order, not groups
jangorecki 68b378d
expect to reach optimization
jangorecki 0bd0e1e
skip opt for list
jangorecki 71f5aeb
override retGrp=F by retGrp=T is legit use
jangorecki de916aa
more backward compatiblility, no retGrp from getindex
jangorecki 0a3da1d
more verbose messages during opts in forderLazy
jangorecki 068c1a9
recycle order 1/-1 argument in one place
jangorecki 326695d
precise verbose messages
jangorecki 8be2d53
recycle arg once _by_ arg is known
jangorecki 3383bbf
copy paste typo fix
jangorecki cc7a2b6
forder writing index disabled
jangorecki 514913a
fix tests
jangorecki ca2823d
code coverage
jangorecki 4e9bbb3
minor update for safer use of internal option
jangorecki 4d0dec3
fix bad name in unit test
jangorecki 1606046
retGrp=F requires downgrade idx and it seems to be costly
jangorecki a9b01ff
NA stats from forder
jangorecki ded1e1a
keyOpt fix, and existing tests
jangorecki 53cce98
fixes for na.last in key and setting idx
jangorecki 1e92366
filling tests for na.last=T and possible fixes
jangorecki 0f95e6f
more stats, any non ascii utf8
jangorecki bbe1d03
better naming of new stats attributes
jangorecki 91437de
add extra escape to escape IS_ASCII checks
jangorecki 5239c3f
Merge branch 'master' into lazy-forder
jangorecki 0f6b15a
Merge branch 'master' into lazy-forder
mattdowle 14e9168
Merge branch 'master' into lazy-forder
jangorecki 8bf5b5c
Merge branch 'master' into lazy-forder
MichaelChirico 0f15775
update test number after merge
MichaelChirico e2710b8
Merge branch 'master' into lazy-forder
MichaelChirico c8f5b7e
apply minor review feedback
MichaelChirico 92a97d1
More minor review feedback
MichaelChirico c2ae34a
use options= to set options
MichaelChirico 4184fe9
more feedback
MichaelChirico 0588d4d
Rename forderLazy->forderMaybePresorted
MichaelChirico 8b3a80a
UNPROTECT() more aggressively
MichaelChirico 497a08f
maybe_reset_index() helper
MichaelChirico b948345
Merge remote-tracking branch 'origin/master' into lazy-forder
MichaelChirico 3e898e9
Strict prototyping (-Wstrict-prototypes)
MichaelChirico d8e0ea7
Merge remote-tracking branch 'origin/lazy-forder' into lazy-forder
MichaelChirico d8adf3c
fix sloppy refactor for maybe_reset_index()
MichaelChirico 1d398aa
Fix implicit reliance on datatable.optimize
MichaelChirico ac19b83
Fix elsewhere, and encapsulate the logic inside a test()
MichaelChirico 9675f06
style
MichaelChirico e8b9dcd
spurious whitespace change
MichaelChirico a48ff5e
NEWS entry for lazy forder
jangorecki 59f5f21
tidy up NEWS
MichaelChirico 32c6305
PROTECT() on key attribute
MichaelChirico 050e048
Merge branch 'master' into lazy-forder
MichaelChirico 02ff718
rename arg/option 'lazy' -> 'reuseSorting'
MichaelChirico 320f496
MaybeSorted->ReuseSorting
MichaelChirico 799b4a9
other lazy= usage
MichaelChirico ffe431f
Merge branch 'master' into lazy-forder
MichaelChirico File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -50,23 +50,9 @@ setkeyv = function(x, cols, verbose=getOption("datatable.verbose"), physical=TRU | |
| miss = !(cols %chin% colnames(x)) | ||
| if (any(miss)) stopf("some columns are not in the data.table: %s", brackify(cols[miss])) | ||
|
|
||
| ## determine, whether key is already present: | ||
| if (identical(key(x),cols)) { | ||
| if (!physical) { | ||
| ## create index as integer() because already sorted by those columns | ||
| if (is.null(attr(x, "index", exact=TRUE))) setattr(x, "index", integer()) | ||
| setattr(attr(x, "index", exact=TRUE), paste0("__", cols, collapse=""), integer()) | ||
| } | ||
| return(invisible(x)) | ||
| } else if(identical(head(key(x), length(cols)), cols)){ | ||
| if (!physical) { | ||
| ## create index as integer() because already sorted by those columns | ||
| if (is.null(attr(x, "index", exact=TRUE))) setattr(x, "index", integer()) | ||
| setattr(attr(x, "index", exact=TRUE), paste0("__", cols, collapse=""), integer()) | ||
| } else { | ||
| ## key is present but x has a longer key. No sorting needed, only attribute is changed to shorter key. | ||
| setattr(x,"sorted",cols) | ||
| } | ||
| if (physical && identical(head(key(x), length(cols)), cols)){ ## for !physical we need to compute groups as well #4387 | ||
| ## key is present but x has a longer key. No sorting needed, only attribute is changed to shorter key. | ||
| setattr(x,"sorted",cols) | ||
| return(invisible(x)) | ||
| } | ||
|
|
||
|
|
@@ -77,26 +63,20 @@ setkeyv = function(x, cols, verbose=getOption("datatable.verbose"), physical=TRU | |
| } | ||
| if (!is.character(cols) || length(cols)<1L) stopf("Internal error. 'cols' should be character at this point in setkey; please report.") # nocov | ||
|
|
||
| newkey = paste(cols, collapse="__") | ||
| if (!any(indices(x) == newkey)) { | ||
| if (verbose) { | ||
| tt = suppressMessages(system.time(o <- forderv(x, cols, sort=TRUE, retGrp=FALSE))) # system.time does a gc, so we don't want this always on, until refcnt is on by default in R | ||
| # suppress needed for tests 644 and 645 in verbose mode | ||
| catf("forder took %.03f sec\n", tt["user.self"]+tt["sys.self"]) | ||
| } else { | ||
| o = forderv(x, cols, sort=TRUE, retGrp=FALSE) | ||
| } | ||
| if (verbose) { | ||
| # we now also retGrp=TRUE #4387 for !physical | ||
| tt = suppressMessages(system.time(o <- forderv(x, cols, sort=TRUE, retGrp=!physical, reuseSorting=TRUE))) # system.time does a gc, so we don't want this always on, until refcnt is on by default in R | ||
| # suppress needed for tests 644 and 645 in verbose mode | ||
| catf("forder took %.03f sec\n", tt["user.self"]+tt["sys.self"]) | ||
| } else { | ||
| if (verbose) catf("setkey on columns %s using existing index '%s'\n", brackify(cols), newkey) | ||
| o = getindex(x, newkey) | ||
| o = forderv(x, cols, sort=TRUE, retGrp=!physical, reuseSorting=TRUE) | ||
| } | ||
| if (!physical) { | ||
| if (is.null(attr(x, "index", exact=TRUE))) setattr(x, "index", integer()) | ||
| setattr(attr(x, "index", exact=TRUE), paste0("__", cols, collapse=""), o) | ||
| if (!physical) { # index COULD BE saved from C forderReuseSorting already, but disabled for now | ||
| maybe_reset_index(x, o, cols) | ||
| return(invisible(x)) | ||
| } | ||
| setattr(x,"index",NULL) # TO DO: reorder existing indexes likely faster than rebuilding again. Allow optionally. Simpler for now to clear. | ||
| if (length(o)) { | ||
| setattr(x,"index",NULL) # TO DO: reorder existing indexes likely faster than rebuilding again. Allow optionally. Simpler for now to clear. Only when order changes. | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. there is a related comment next to test 1419.62 but we should not drop that index, as long as order does not change then index is still valid, and it can even be useful now, because it can store groups info. |
||
| if (verbose) { last.started.at = proc.time() } | ||
| .Call(Creorder,x,o) | ||
| if (verbose) { catf("reorder took %s\n", timetaken(last.started.at)); flush.console() } | ||
|
|
@@ -124,7 +104,7 @@ getindex = function(x, name) { | |
| if (!is.null(ans) && (!is.integer(ans) || (length(ans)!=nrow(x) && length(ans)!=0L))) { | ||
| stopf("Internal error: index '%s' exists but is invalid", name) # nocov | ||
| } | ||
| ans | ||
| c(ans) ## drop starts and maxgrpn attributes | ||
| } | ||
|
|
||
| haskey = function(x) !is.null(key(x)) | ||
|
|
@@ -160,19 +140,24 @@ is.sorted = function(x, by=NULL) { | |
| # Return value of TRUE/FALSE is relied on in [.data.table quite a bit on vectors. Simple. Stick with that (rather than -1/0/+1) | ||
| } | ||
|
|
||
| maybe_reset_index = function(x, idx, cols) { | ||
| if (isTRUE(getOption("datatable.forder.auto.index"))) return(invisible()) | ||
| if (is.null(attr(x, "index", exact=TRUE))) setattr(x, "index", integer()) | ||
| setattr(attr(x, "index", exact=TRUE), paste0("__", cols, collapse=""), idx) | ||
| invisible(x) | ||
| } | ||
|
|
||
| ORDERING_TYPES = c('logical', 'integer', 'double', 'complex', 'character') | ||
| forderv = function(x, by=seq_along(x), retGrp=FALSE, sort=TRUE, order=1L, na.last=FALSE) | ||
| { | ||
| forderv = function(x, by=seq_along(x), retGrp=FALSE, retStats=retGrp, sort=TRUE, order=1L, na.last=FALSE, reuseSorting=getOption("datatable.reuse.sorting", NA)) { | ||
| if (is.atomic(x) || is.null(x)) { # including forderv(NULL) which returns error consistent with base::order(NULL), | ||
| if (!missing(by) && !is.null(by)) stopf("x is a single vector, non-NULL 'by' doesn't make sense") | ||
| by = NULL | ||
| } else { | ||
| if (!length(x)) return(integer(0L)) # e.g. forderv(data.table(NULL)) and forderv(list()) return integer(0L)) | ||
| by = colnamesInt(x, by, check_dups=FALSE) | ||
| if (length(order) == 1L) order = rep(order, length(by)) | ||
MichaelChirico marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| } | ||
| order = as.integer(order) # length and contents of order being +1/-1 is checked at C level | ||
| .Call(Cforder, x, by, retGrp, sort, order, na.last) # returns integer() if already sorted, regardless of sort=TRUE|FALSE | ||
| .Call(CforderReuseSorting, x, by, retGrp, retStats, sort, order, na.last, reuseSorting) # returns integer() if already sorted, regardless of sort=TRUE|FALSE | ||
| } | ||
|
|
||
| forder = function(..., na.last=TRUE, decreasing=FALSE) | ||
|
|
@@ -209,7 +194,7 @@ forder = function(..., na.last=TRUE, decreasing=FALSE) | |
| data = eval(sub, parent.frame(), parent.frame()) | ||
| } | ||
| stopifnot(isTRUEorFALSE(decreasing)) | ||
| o = forderv(data, seq_along(data), sort=TRUE, retGrp=FALSE, order= if (decreasing) -asc else asc, na.last) | ||
| o = forderv(data, seq_along(data), retGrp=FALSE, retStats=FALSE, sort=TRUE, order=if (decreasing) -asc else asc, na.last=na.last) | ||
| if (!length(o) && length(data)>=1L) o = seq_along(data[[1L]]) else o | ||
| o | ||
| } | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.