You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: vignettes/datatable-intro.Rmd
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -101,7 +101,7 @@ You can also convert existing objects to a `data.table` using `setDT()` (for `da
101
101
getOption("datatable.print.nrows")
102
102
```
103
103
104
-
* `data.table` doesn't set or use *row names*, ever. We will see why in the *"Keys and fast binary search based subset"* vignette.
104
+
* `data.table` doesn't set or use *row names*, ever. We will see why in the [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html) vignette.
105
105
106
106
### b) General form - in what way is a `data.table` *enhanced*? {#enhanced-1b}
107
107
@@ -479,7 +479,7 @@ ans
479
479
480
480
**Keys:** Actually `keyby` does a little more than *just ordering*. It also *sets a key* after ordering by setting an `attribute` called `sorted`.
481
481
482
-
We'll learn more about `keys` in the `vignette("datatable-keys-fast-subset", package="data.table")`; for now, all you have to know is that you can use `keyby` to automatically order the result by the columns specified in `by`.
482
+
We'll learn more about `keys` in the [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html) vignette; for now, all you have to know is that you can use `keyby` to automatically order the result by the columns specified in `by`.
483
483
484
484
### c) Chaining
485
485
@@ -659,7 +659,7 @@ We have seen so far that,
659
659
660
660
* We can also sort a `data.table` using `order()`, which internally uses data.table's fast order for better performance.
661
661
662
-
We can do much more in `i` by keying a `data.table`, which allows for blazing fast subsets and joins. We will see this in the `vignette("datatable-keys-fast-subset", package="data.table")` and the `vignette("datatable-joins", package="data.table")`.
662
+
We can do much more in `i` by keying a `data.table`, which allows for blazing fast subsets and joins. We will see this in the vignettes [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html) and [`vignette("datatable-joins", package="data.table")`](datatable-joins.html).
663
663
664
664
#### Using `j`:
665
665
@@ -693,7 +693,7 @@ We can do much more in `i` by keying a `data.table`, which allows for blazing fa
693
693
694
694
As long as `j` returns a `list`, each element of the list will become a column in the resulting `data.table`.
695
695
696
-
We will see how to *add/update/delete* columns *by reference* and how to combine them with `i` and `by` in the next vignette (`vignette("datatable-reference-semantics", package="data.table")`).
696
+
We will see how to *add/update/delete* columns *by reference* and how to combine them with `i` and `by` in the [next vignette (`vignette("datatable-reference-semantics", package="data.table")`)](datatable-reference-semantics.html).
Copy file name to clipboardExpand all lines: vignettes/datatable-keys-fast-subset.Rmd
+7-7Lines changed: 7 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -24,13 +24,13 @@ knitr::opts_chunk$set(
24
24
.old.th = setDTthreads(1)
25
25
```
26
26
27
-
This vignette is aimed at those who are already familiar with *data.table* syntax, its general form, how to subset rows in `i`, select and compute on columns, add/modify/delete columns *by reference* in `j` and group by using `by`. If you're not familiar with these concepts, please read the `vignette("datatable-intro", package="data.table")` and the `vignette("datatable-reference-semantics", package="data.table")` first.
27
+
This vignette is aimed at those who are already familiar with *data.table* syntax, its general form, how to subset rows in `i`, select and compute on columns, add/modify/delete columns *by reference* in `j` and group by using `by`. If you're not familiar with these concepts, please read the vignettes [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) and [`vignette("datatable-reference-semantics", package="data.table")`](datatable-reference-semantics.html) first.
28
28
29
29
***
30
30
31
31
## Data {#data}
32
32
33
-
We will use the same `flights` data as in the `vignette("datatable-intro", package="data.table")`.
33
+
We will use the same `flights` data as in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette.
34
34
35
35
```{r echo = FALSE}
36
36
options(width = 100L)
@@ -58,7 +58,7 @@ In this vignette, we will
58
58
59
59
### a) What is a *key*?
60
60
61
-
In the `vignette("datatable-intro", package="data.table")`, we saw how to subset rows in `i` using logical expressions, row numbers and using `order()`. In this section, we will look at another way of subsetting incredibly fast - using *keys*.
61
+
In the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette, we saw how to subset rows in `i` using logical expressions, row numbers and using `order()`. In this section, we will look at another way of subsetting incredibly fast - using *keys*.
62
62
63
63
But first, let's start by looking at *data.frames*. All *data.frames* have a row names attribute. Consider the *data.frame*`DF` below.
64
64
@@ -143,7 +143,7 @@ head(flights)
143
143
144
144
* Alternatively you can pass a character vector of column names to the function `setkeyv()`. This is particularly useful while designing functions to pass columns to set key on as function arguments.
145
145
146
-
* Note that we did not have to assign the result back to a variable. This is because like the `:=` function we saw in the `vignette("datatable-reference-semantics", package="data.table")`, `setkey()` and `setkeyv()` modify the input *data.table**by reference*. They return the result invisibly.
146
+
* Note that we did not have to assign the result back to a variable. This is because like the `:=` function we saw in the [`vignette("datatable-reference-semantics", package="data.table")`](datatable-reference-semantics.html) vignette, `setkey()` and `setkeyv()` modify the input *data.table**by reference*. They return the result invisibly.
147
147
148
148
* The *data.table* is now reordered (or sorted) by the column we provided - `origin`. Since we reorder by reference, we only require additional memory of one column of length equal to the number of rows in the *data.table*, and is therefore very memory efficient.
* The *row indices* corresponding to `origin == "LGA"` and `dest == "TPA"` are obtained using *key based subset*.
264
264
265
-
* Once we have the row indices, we look at `j` which requires only the `arr_delay` column. So we simply select the column `arr_delay` for those *row indices* in the exact same way as we have seen in `vignette("datatable-intro", package="data.table")`.
265
+
* Once we have the row indices, we look at `j` which requires only the `arr_delay` column. So we simply select the column `arr_delay` for those *row indices* in the exact same way as we have seen in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette.
266
266
267
267
* We could have returned the result by using `with = FALSE` as well.
### d) *sub-assign* by reference using `:=` in `j`
292
292
293
-
We have seen this example already in the `vignette("datatable-reference-semantics", package="data.table")`. Let's take a look at all the `hours` available in the `flights`*data.table*:
293
+
We have seen this example already in the [`vignette("datatable-reference-semantics", package="data.table")`](datatable-reference-semantics.html) vignette. Let's take a look at all the `hours` available in the `flights`*data.table*:
294
294
295
295
```{r}
296
296
# get all 'hours' in flights
@@ -498,7 +498,7 @@ In this vignette, we have learnt another method to subset rows in `i` by keying
498
498
499
499
* combine key based subsets with `j` and `by`. Note that the `j` and `by` operations are exactly the same as before.
500
500
501
-
Key based subsets are **incredibly fast** and are particularly useful when the task involves *repeated subsetting*. But it may not be always desirable to set key and physically reorder the *data.table*. In the next `vignette("datatable-secondary-indices-and-auto-indexing", package="data.table")`, we will address this using a *new* feature -- *secondary indexes*.
501
+
Key based subsets are **incredibly fast** and are particularly useful when the task involves *repeated subsetting*. But it may not be always desirable to set key and physically reorder the *data.table*. In the next [next vignette (`vignette("datatable-secondary-indices-and-auto-indexing", package="data.table")`)](datatable-secondary-indices-and-auto-indexing.html), we will address this using a *new* feature -- *secondary indexes*.
Copy file name to clipboardExpand all lines: vignettes/datatable-reference-semantics.Rmd
+7-7Lines changed: 7 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -23,13 +23,13 @@ knitr::opts_chunk$set(
23
23
collapse = TRUE)
24
24
.old.th = setDTthreads(1)
25
25
```
26
-
This vignette discusses *data.table*'s reference semantics which allows to *add/update/delete* columns of a *data.table by reference*, and also combine them with `i` and `by`. It is aimed at those who are already familiar with *data.table* syntax, its general form, how to subset rows in `i`, select and compute on columns, and perform aggregations by group. If you're not familiar with these concepts, please read the `vignette("datatable-intro", package="data.table")` first.
26
+
This vignette discusses *data.table*'s reference semantics which allows to *add/update/delete* columns of a *data.table by reference*, and also combine them with `i` and `by`. It is aimed at those who are already familiar with *data.table* syntax, its general form, how to subset rows in `i`, select and compute on columns, and perform aggregations by group. If you're not familiar with these concepts, please read the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette first.
27
27
28
28
***
29
29
30
30
## Data {#data}
31
31
32
-
We will use the same `flights` data as in the `vignette("datatable-intro", package="data.table")`.
32
+
We will use the same `flights` data as in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette.
33
33
34
34
```{r echo = FALSE}
35
35
options(width = 100L)
@@ -169,7 +169,7 @@ We see that there are totally `25` unique values in the data. Both *0* and *24*
169
169
flights[hour == 24L, hour := 0L]
170
170
```
171
171
172
-
* We can use `i` along with `:=` in `j` the very same way as we have already seen in the `vignette("datatable-intro", package="data.table")`.
172
+
* We can use `i` along with `:=` in `j` the very same way as we have already seen in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette.
173
173
174
174
* Column `hour` is replaced with `0` only on those *row indices* where the condition `hour == 24L` specified in `i` evaluates to `TRUE`.
175
175
@@ -234,7 +234,7 @@ head(flights)
234
234
235
235
* We provide the columns to group by the same way as shown in the *Introduction to data.table* vignette. For each group, `max(speed)` is computed, which returns a single value. That value is recycled to fit the length of the group. Once again, no copies are being made at all. `flights`*data.table* is modified *in-place*.
236
236
237
-
* We could have also provided `by` with a *character vector* as we saw in the `vignette("datatable-intro", package="data.table")`, e.g., `by = c("origin", "dest")`.
237
+
* We could have also provided `by` with a *character vector* as we saw in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette, e.g., `by = c("origin", "dest")`.
238
238
239
239
#
240
240
@@ -253,7 +253,7 @@ head(flights)
253
253
254
254
* Note that since we allow assignment by reference without quoting column names when there is only one column as explained in [Section 2c](#delete-convenience), we can not do `out_cols := lapply(.SD, max)`. That would result in adding one new column named `out_cols`. Instead we should do either `c(out_cols)` or simply `(out_cols)`. Wrapping the variable name with `(` is enough to differentiate between the two cases.
255
255
256
-
* The `LHS := RHS` form allows us to operate on multiple columns. In the RHS, to compute the `max` on columns specified in `.SDcols`, we make use of the base function `lapply()` along with `.SD` in the same way as we have seen before in the `vignette("datatable-intro", package="data.table")`. It returns a list of two elements, containing the maximum value corresponding to `dep_delay` and `arr_delay` for each group.
256
+
* The `LHS := RHS` form allows us to operate on multiple columns. In the RHS, to compute the `max` on columns specified in `.SDcols`, we make use of the base function `lapply()` along with `.SD` in the same way as we have seen before in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette. It returns a list of two elements, containing the maximum value corresponding to `dep_delay` and `arr_delay` for each group.
257
257
258
258
#
259
259
Before moving on to the next section, let's clean up the newly created columns `speed`, `max_speed`, `max_dep_delay` and `max_arr_delay`.
@@ -369,7 +369,7 @@ However we could improve this functionality further by *shallow* copying instead
369
369
370
370
* It is used to *add/update/delete* columns by reference.
371
371
372
-
* We have also seen how to use `:=` along with `i` and `by` the same way as we have seen in the `vignette("datatable-intro", package="data.table")`. We can in the same way use `keyby`, chain operations together, and pass expressions to `by` as well all in the same way. The syntax is *consistent*.
372
+
* We have also seen how to use `:=` along with `i` and `by` the same way as we have seen in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette. We can in the same way use `keyby`, chain operations together, and pass expressions to `by` as well all in the same way. The syntax is *consistent*.
373
373
374
374
* We can use `:=` for its side effect or use `copy()` to not modify the original object while updating by reference.
375
375
@@ -379,6 +379,6 @@ setDTthreads(.old.th)
379
379
380
380
#
381
381
382
-
So far we have seen a whole lot in `j`, and how to combine it with `by` and little of `i`. Let's turn our attention back to `i` in the next vignette `vignette("datatable-keys-fast-subset", package="data.table")` to perform *blazing fast subsets* by *keying data.tables*.
382
+
So far we have seen a whole lot in `j`, and how to combine it with `by` and little of `i`. Let's turn our attention back to `i` in the [next vignette (`vignette("datatable-keys-fast-subset", package="data.table")`)](datatable-keys-fast-subset.html) to perform *blazing fast subsets* by *keying data.tables*.
1. The `:=` is an assignment operator to update the `data.table` in place without making a copy. See `vignette("datatable-reference-semantics", package="data.table")` for more.
127
+
1. The `:=` is an assignment operator to update the `data.table` in place without making a copy. See [`vignette("datatable-reference-semantics", package="data.table")`](datatable-reference-semantics.html) for more.
128
128
2. The LHS, `names(.SD)`, indicates which columns we are updating - in this case we update the entire `.SD`.
129
129
3. The RHS, `lapply()`, loops through each column of the `.SD` and converts the column to a factor.
130
130
4. We use the `.SDcols` to only select columns that have pattern of `teamID`.
Copy file name to clipboardExpand all lines: vignettes/datatable-secondary-indices-and-auto-indexing.Rmd
+6-6Lines changed: 6 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -24,13 +24,13 @@ knitr::opts_chunk$set(
24
24
.old.th = setDTthreads(1)
25
25
```
26
26
27
-
This vignette assumes that the reader is familiar with data.table's `[i, j, by]` syntax, and how to perform fast key based subsets. If you're not familiar with these concepts, please read the *"Introduction to data.table"*, *"Reference semantics"* and *"Keys and fast binary search based subset"* vignettes first.
27
+
This vignette assumes that the reader is familiar with data.table's `[i, j, by]` syntax, and how to perform fast key based subsets. If you're not familiar with these concepts, please read the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html), [`vignette("datatable-reference-semantics", package="data.table")`](datatable-reference-semantics.html), and [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html) vignettes first.
28
28
29
29
***
30
30
31
31
## Data {#data}
32
32
33
-
We will use the same `flights` data as in the `vignette("datatable-intro", package="data.table")`.
33
+
We will use the same `flights` data as in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette.
34
34
35
35
```{r echo = FALSE}
36
36
options(width = 100L)
@@ -193,7 +193,7 @@ flights[.("JFK", "LAX"), on = c("origin", "dest")][1:5]
193
193
194
194
### b) Select in `j`
195
195
196
-
All the operations we will discuss below are no different to the ones we already saw in the `vignette("datatable-keys-fast-subset", package="data.table")`. Except we'll be using the `on` argument instead of setting keys.
196
+
All the operations we will discuss below are no different to the ones we already saw in the [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html) vignette. Except we'll be using the `on` argument instead of setting keys.
197
197
198
198
#### -- Return `arr_delay` column alone as a data.table corresponding to `origin = "LGA"` and `dest = "TPA"`
### e) *sub-assign* by reference using `:=` in `j`
221
221
222
-
We have seen this example already in the `vignette("datatable-reference-semantics", package="data.table")` and the `vignette("datatable-keys-fast-subset", package="data.table")`. Let's take a look at all the `hours` available in the `flights`*data.table*:
222
+
We have seen this example already in the vignettes [`vignette("datatable-reference-semantics", package="data.table")`](datatable-reference-semantics.html) and [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html). Let's take a look at all the `hours` available in the `flights`*data.table*:
223
223
224
224
```{r}
225
225
# get all 'hours' in flights
@@ -253,7 +253,7 @@ head(ans)
253
253
254
254
### g) The *mult* argument
255
255
256
-
The other arguments including `mult` work exactly the same way as we saw in the `vignette("datatable-keys-fast-subset", package="data.table")`. The default value for `mult` is "all". We can choose, instead only the "first" or "last" matching rows should be returned.
256
+
The other arguments including `mult` work exactly the same way as we saw in the [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html) vignette. The default value for `mult` is "all". We can choose, instead only the "first" or "last" matching rows should be returned.
257
257
258
258
#### -- Subset only the first matching row where `dest` matches *"BOS"* and *"DAY"*
In recent version we extended auto indexing to expressions involving more than one column (combined with `&` operator). In the future, we plan to extend binary search to work with more binary operators like `<`, `<=`, `>` and `>=`.
329
329
330
-
We will discuss fast *subsets* using keys and secondary indices to *joins* in the next vignette, `vignette("datatable-joins", package="data.table")`.
330
+
We will discuss fast *subsets* using keys and secondary indices to *joins* in the [next vignette (`vignette("datatable-joins", package="data.table")`)](datatable-joins.html).
0 commit comments