Skip to content

Commit f5a1e09

Browse files
authored
Merge pull request #6184 from Rdatatable/subset
Enhance Documentation Clarity on Subsetting Behavior with Non-Existing Elements
2 parents e7b7e47 + e89d0ad commit f5a1e09

File tree

1 file changed

+44
-1
lines changed

1 file changed

+44
-1
lines changed

vignettes/datatable-intro.Rmd

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -250,6 +250,49 @@ The function `length()` requires an input argument. We just need to compute the
250250

251251
This type of operation occurs quite frequently, especially while grouping (as we will see in the next section), to the point where `data.table` provides a *special symbol* `.N` for it.
252252

253+
### g) Handle non-existing elements in `i`
254+
255+
#### -- What happens when querying for non-existing elements?
256+
257+
When querying a `data.table` for elements that do not exist, the behavior differs based on the method used.
258+
259+
```r
260+
setkeyv(flights, "origin")
261+
```
262+
263+
* **Key-based subsetting: `dt["d"]`**
264+
265+
This performs a right join on the key column `x`, resulting in a row with `d` and `NA` for columns not found. When using `setkeyv`, the table is sorted by the specified keys and an internal index is created, enabling binary search for efficient subsetting.
266+
267+
```r
268+
flights["XYZ"]
269+
# Returns:
270+
# origin year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum ...
271+
# 1: XYZ NA NA NA NA NA NA NA NA NA NA NA NA ...
272+
```
273+
274+
* **Logical subsetting: `dt[x == "d"]`**
275+
276+
This performs a standard subset operation that does not find any matching rows and thus returns an empty `data.table`.
277+
278+
```r
279+
flights[origin == "XYZ"]
280+
# Returns:
281+
# Empty data.table (0 rows and 19 cols): year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,...
282+
```
283+
284+
* **Exact match using `nomatch=NULL`**
285+
286+
For exact matches without `NA` for non-existing elements, use `nomatch=NULL`:
287+
288+
```r
289+
flights["XYZ", nomatch=NULL]
290+
# Returns:
291+
# Empty data.table (0 rows and 19 cols): year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,...
292+
```
293+
294+
Understanding these behaviors can help prevent confusion when dealing with non-existing elements in your data.
295+
253296
#### Special symbol `.N`: {#special-N}
254297

255298
`.N` is a special built-in variable that holds the number of observations _in the current group_. It is particularly useful when combined with `by` as we'll see in the next section. In the absence of group by operations, it simply returns the number of rows in the subset.
@@ -269,7 +312,7 @@ ans
269312

270313
We could have accomplished the same operation by doing `nrow(flights[origin == "JFK" & month == 6L])`. However, it would have to subset the entire `data.table` first corresponding to the *row indices* in `i` *and then* return the rows using `nrow()`, which is unnecessary and inefficient. We will cover this and other optimisation aspects in detail under the *`data.table` design* vignette.
271314

272-
### g) Great! But how can I refer to columns by names in `j` (like in a `data.frame`)? {#refer_j}
315+
### h) Great! But how can I refer to columns by names in `j` (like in a `data.frame`)? {#refer_j}
273316

274317
If you're writing out the column names explicitly, there's no difference compared to a `data.frame` (since v1.9.8).
275318

0 commit comments

Comments
 (0)