-
Notifications
You must be signed in to change notification settings - Fork 1k
delete rows by reference #7536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
delete rows by reference #7536
Changes from all commits
9a278cd
6d2182a
b9e76eb
703456a
6f4867c
67d6505
4888272
8d5e869
4936003
e026e8d
1309d52
57a4bdc
c95d199
5ec8ff6
dd58303
53ce564
104691a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -9,10 +9,12 @@ | |
| \alias{.EACHI} | ||
| \alias{.NGRP} | ||
| \alias{.NATURAL} | ||
| \alias{.ROW} | ||
| \title{ Special symbols } | ||
| \description{ | ||
| \code{.SD}, \code{.BY}, \code{.N}, \code{.I}, \code{.GRP}, and \code{.NGRP} are \emph{read-only} symbols for use in \code{j}. \code{.N} can be used in \code{i} as well. \code{.I} can be used in \code{by} as well. See the vignettes, Details and Examples here and in \code{\link{data.table}}. | ||
| \code{.EACHI} is a symbol passed to \code{by}; i.e. \code{by=.EACHI}, \code{.NATURAL} is a symbol passed to \code{on}; i.e. \code{on=.NATURAL} | ||
| \code{.ROW} is a symbol used with \code{:= NULL} to delete rows by reference; i.e. \code{DT[i, .ROW := NULL]} deletes the rows selected by \code{i}. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe just point to |
||
| } | ||
| \details{ | ||
| The bindings of these variables are locked and attempting to assign to them will generate an error. If you wish to manipulate \code{.SD} before returning it, take a \code{\link{copy}(.SD)} first (see FAQ 4.5). Using \code{:=} in the \code{j} of \code{.SD} is reserved for future use as a (tortuously) flexible way to update \code{DT} by reference by group (even when groups are not contiguous in an ad hoc by). | ||
|
|
@@ -32,6 +34,8 @@ | |
|
|
||
| \code{.NATURAL} is defined as \code{NULL} but its value is not used. Its usage is \code{on=.NATURAL} (alternative of \code{X[on=Y]}) which joins two tables on their common column names, performing a natural join; see \code{\link{data.table}}'s \code{on} argument for more details. | ||
|
|
||
| \code{.ROW} is a symbol that can only be used with \code{:= NULL} to delete rows by reference. When you use \code{DT[i, .ROW := NULL]}, the rows matching the \code{i} expression are removed from \code{DT} in-place. This is an efficient way to delete rows without copying the entire data.table. The \code{i} argument is required and \code{by}/\code{keyby} are not supported. After deletion, any keys and indices on \code{DT} are cleared. See \code{\link{:=}} for more on reference semantics. | ||
|
|
||
| Note that \code{.N} in \code{i} is computed up-front, while that in \code{j} applies \emph{after filtering in \code{i}}. That means that even absent grouping, \code{.N} in \code{i} can be different from \code{.N} in \code{j}. See Examples. | ||
|
|
||
| Note also that you should consider these symbols read-only and of limited scope -- internal data.table code might manipulate them in unexpected ways, and as such their bindings are locked. There are subtle ways to wind up with the wrong object, especially when attempting to copy their values outside a grouping context. See examples; when in doubt, \code{copy()} is your friend. | ||
|
|
@@ -72,5 +76,12 @@ DT[, .(min(.SD[,-1])), by=.I] | |
| # Do not expect this to correctly append the value of .BY in each group; copy(.BY) will work. | ||
| by_tracker = list() | ||
| DT[, { append(by_tracker, .BY); sum(v) }, by=x] | ||
|
|
||
| # .ROW to delete rows by reference | ||
| DT = data.table(a=1:5, b=letters[1:5]) | ||
| DT[c(2,4), .ROW := NULL] | ||
| DT | ||
| DT[a>2, .ROW := NULL] | ||
| DT | ||
| } | ||
| \keyword{ data } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,6 +2,7 @@ | |
| \alias{truelength} | ||
| \alias{setalloccol} | ||
| \alias{alloc.col} | ||
| \alias{setallocrow} | ||
| \title{ Over-allocation access } | ||
| \description{ | ||
| These functions are experimental and somewhat advanced. By \emph{experimental} we mean their names might change and perhaps the syntax, argument names and types. So if you write a lot of code using them, you have been warned! They should work and be stable, though, so please report problems with them. \code{alloc.col} is just an alias to \code{setalloccol}. We recommend to use \code{setalloccol} (though \code{alloc.col} will continue to be supported) because the \code{set*} prefix in \code{setalloccol} makes it clear that its input argument is modified in-place. | ||
|
|
@@ -14,11 +15,14 @@ setalloccol(DT, | |
| alloc.col(DT, | ||
| n = getOption("datatable.alloccol"), # default: 1024L | ||
| verbose = getOption("datatable.verbose")) # default: FALSE | ||
| setallocrow(DT, n = 0L) | ||
| } | ||
| \arguments{ | ||
| \item{x}{ Any type of vector, including \code{data.table} which is a \code{list} vector of column pointers. } | ||
| \item{DT}{ A \code{data.table}. } | ||
| \item{n}{ The number of spare column pointer slots to ensure are available. If \code{DT} is a 1,000 column \code{data.table} with 24 spare slots remaining, \code{n=1024L} means grow the 24 spare slots to be 1024. \code{truelength(DT)} will then be 2024 in this example. } | ||
| \item{n}{ For \code{setalloccol} and \code{alloc.col}: the number of spare column pointer slots to ensure are available. If \code{DT} is a 1,000 column \code{data.table} with 24 spare slots remaining, \code{n=1024L} means grow the 24 spare slots to be 1024. \code{truelength(DT)} will then be 2024 in this example. | ||
|
|
||
| For \code{setallocrow}: the number of rows to over-allocate. If \code{n > 0}, allocates capacity for current rows plus \code{n} additional rows. If \code{n == 0} (default), shrinks columns to exact current size to free excess memory. } | ||
| \item{verbose}{ Output status and information. } | ||
| } | ||
| \details{ | ||
|
|
@@ -34,6 +38,12 @@ alloc.col(DT, | |
| (perhaps in your .Rprofile); e.g., \code{options(datatable.alloccol=10000L)}. | ||
|
|
||
| Please note: over-allocation of the column pointer vector is not for efficiency \emph{per se}; it is so that \code{:=} can add columns by reference without a shallow copy. | ||
|
|
||
| \code{setallocrow} is a utility function that prepares columns for fast row operations (delete or insert) by reference and manages row capacity. (Note that 'insert' by reference is not yet implemented) | ||
| Before deleting or inserting rows by reference, columns must be resizable. | ||
| \code{setallocrow} ensures all columns are in the appropriate state by converting ALTREP columns to materialized form and reallocating | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this needs a bit of clarification -- when I read it, my thought was "oh, we can't do OTOH, I think (?) we strive to expand all
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We dont really need Once we introduce fast insert, using it to allocate more seems the more convincing option. |
||
| columns to have the target capacity. When \code{n > 0}, columns are over-allocated with extra capacity for future row additions. | ||
| When \code{n == 0}, columns are shrunk to exact size to free unused memory. This operation modifies \code{DT} by reference. | ||
| } | ||
| \value{ | ||
| \code{truelength(x)} returns the length of the vector allocated in memory. \code{length(x)} of those items are in use. Currently, it is just the list vector of column | ||
|
|
@@ -43,6 +53,8 @@ alloc.col(DT, | |
|
|
||
| \code{setalloccol} \emph{reallocates} \code{DT} by reference. This may be useful for efficiency if you know you are about to going to add a lot of columns in a loop. | ||
| It also returns the new \code{DT}, for convenience in compound queries. | ||
|
|
||
| \code{setallocrow} modifies \code{DT} by reference to ensure all columns are resizable. | ||
| } | ||
| \seealso{ \code{\link{copy}} } | ||
| \examples{ | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside for future consideration -- another use case would be
DT[<optional i>, by=<grp>, having=<condition>, .ROW := NULL]Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, be sure to point to?setallocrowas the functional equivalentOh, I misunderstood