Skip to content

Commit da2437e

Browse files
committed
updated version
1 parent e46f338 commit da2437e

File tree

1 file changed

+59
-18
lines changed

1 file changed

+59
-18
lines changed

vignettes/datatable-joins.Rmd

Lines changed: 59 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -702,55 +702,96 @@ Products[!"popcorn",
702702

703703
The `:=` operator in `data.table` is used for updating or adding columns by reference. This means it modifies the original `data.table` without creating a copy, which is very memory-efficient, especially for large datasets. When used inside a `data.table`, `:=` allows you to **add new columns** or **modify existing ones** as part of your query.
704704

705-
#### Let's update our `Products` table with the latest price from `ProductPriceHistory`:
705+
Let's update our `Products` table with the latest price from `ProductPriceHistory`:
706+
706707
```{r Simple_One_to_One_Update}
707708
Products[ProductPriceHistory, on = .(id = product_id), price := i.price]
708709
```
709710
- The `price` column in `Products` is updated using the `price` column from `ProductPriceHistory`.
710711
- The `on = .(id = product_id)` ensures that updates happen based on matching IDs.
711712
- This method modifies `Products` in place, avoiding unnecessary copies.
712713

713-
#### If we need to get the latest price and date (instead of all matches), we can still use := efficiently:
714+
Grouped Updates with `.EACHI`
715+
716+
If we need to get the latest price and date (instead of all matches), we can use grouped updates efficiently:
717+
714718
```{r Updating_with_the_Latest_Record}
715719
Products[ProductPriceHistory,
716720
on = .(id = product_id),
717721
`:=`(price = last(i.price), last_updated = last(i.date)),
718722
by = .EACHI]
719723
```
720-
- `last(i.price)` ensures that only the latest price is selected.
721-
- `last_updated` column is added to track the last update date.
722-
- `by = .EACHI` ensures that `last(i.price)` is applied separately for each product."
724+
Grouped Behavior `(by = .EACHI)`:
725+
- The grouping `(by = .EACHI)` ensures that updates are performed separately for each product (id).
726+
- Within each group, only the last record `(last(i.price)` and `last(i.date))` is selected for updating.
727+
- This is different from a simple one-to-one match, where only the first matching record is used.
723728

724-
#### Understanding last() vs. tail()
729+
Behavior of `last()`:
730+
- The function `last()` returns the last element of a vector or column within each group.
731+
- It does not skip `NA` values.
732+
```{r}
733+
data.table::last(c(1, NA)) # Returns NA
734+
dt <- data.table(group = c(1, 1, 2, 2), value = c(10, NA, 20, NA))
735+
dt[, .(last_value = last(value)), by = group]
736+
# group last_value
737+
# 1: 1 NA
738+
# 2: 2 NA
739+
```
740+
Difference from Simple Join:
741+
- A simple join `(on)` updates rows based on matching IDs without considering grouping or ordering.
742+
- Grouped updates allow operations like selecting the "latest" record within each group using `.EACHI`.
743+
744+
**Right Join**
745+
To update the right table by reference without copying (similar to SQL right join workflows), use `.SD` and `.SDcols`. This approach avoids modifying the left table directly while dynamically selecting columns.
725746

726-
- The key difference between `last()` and `tail()` is:
727-
- `last(x):` Returns the last element of x, including NA if it's the last element.
728-
- `tail(x, 1):` Also returns the last element but works more consistently with different object types.
729-
- For lists, `last(list)` returns the last element, while `tail(list, 1)` returns a list of length 1 containing the last element.
747+
```{r}
748+
# Get all columns from Products except the ID column
749+
product_cols <- setdiff(names(Products), "id")
750+
751+
# Update ProductPriceHistory with product details from Products
752+
ProductPriceHistory[, (product_cols) := Products[.SD, on = .(id = product_id), .SD, .SDcols = product_cols]]
753+
```
754+
- The dynamic selection of columns `(.SDcols)` ensures flexibility when column names are not known upfront.
755+
- The right table `(ProductPriceHistory)` is updated in place using columns from the left table `(Products)` without creating unnecessary copies.
756+
- This method is memory-efficient and avoids modifying the left table directly.
757+
758+
Understanding last() vs. tail()
759+
760+
last(x):
761+
- Returns the last element of a `vector`, `list`, or `data.table` column directly.
762+
- Dispatches to `xts::last()` if xts is loaded and the object inherits from xts.
763+
- Includes `NA` if it is the last element.
764+
- Optimized for use within `data.table` operations.
765+
766+
tail(x, 1):
767+
- Returns the last element of a `vector` or `data.table` column.
768+
- For lists, it returns a `list` containing the last element instead of the element directly.
769+
- Handles negative values (n) correctly to exclude elements from the end.
730770

731771
```{r Example_Behavior}
732772
# Test 1: Simple vector with NA at the end
733773
x <- c(1, 2, 3, NA)
734774
last(x) # Returns NA
735775
tail(x, 1) # Returns NA
736776
737-
# Test 2: data.table grouping behavior
777+
# Test 2: Grouping behavior in data.table
738778
dt <- data.table(group = c(1,1,2,2), value = c(10, NA, 20, NA))
739-
dt[, .(last_value = last(value)), by = group] # last() does not skip NA
740-
dt[, .(tail_value = tail(value, 1)), by = group] # tail() behaves similarly
779+
dt[, .(last_value = last(value)), by = group] # Returns NA
780+
dt[, .(tail_value = tail(value, 1)), by = group] # Returns NA
741781
742782
# Test 3: Working with lists
743783
l <- list(a = 1, b = 2, c = 3)
744784
last(l) # Returns 3
745-
tail(l, 1) # Returns a list of length 1
785+
tail(l, 1) # Returns a list containing the last element (`list(c = 3)`)
746786
747787
# Test 4: Empty vector behavior
748788
z <- numeric(0)
749-
length(last(z)) # Returns 0
750-
length(tail(z, 1)) # Returns 0
789+
length(last(z)) # Returns length of 0
790+
length(tail(z, 1)) # Returns length of 0
751791
```
752792

753-
#### When we need to update Products with multiple columns from ProductPriceHistory
793+
When we need to update `Products` with multiple columns from `ProductPriceHistory`
794+
754795
```{r Efficient_Right_Join_Update }
755796
cols <- setdiff(names(ProductPriceHistory), 'product_id')
756797
Products[ProductPriceHistory,
@@ -759,7 +800,7 @@ Products[ProductPriceHistory,
759800
```
760801
- Efficiently updates multiple columns in `Products` from `ProductPriceHistory`.
761802
- `mget(cols)` retrieves multiple matching columns dynamically.
762-
- This method is faster and more memory-efficient than Products <- `ProductPriceHistory[Products, on=...]`.
803+
- This method avoids creating a copy of the data, making it more memory-efficient for large datasets.
763804
- Note: `:=` updates `Products` in place, but does not modify `ProductPriceHistory`.
764805
- Unlike traditional RIGHT JOIN, `data.table` does not allow i (right table) to be updated directly.
765806

0 commit comments

Comments
 (0)