Skip to content

Commit a6e4be1

Browse files
committed
reduced the size
1 parent 8ff9957 commit a6e4be1

File tree

1 file changed

+24
-86
lines changed

1 file changed

+24
-86
lines changed

vignettes/datatable-joins.Rmd

Lines changed: 24 additions & 86 deletions
Original file line numberDiff line numberDiff line change
@@ -700,109 +700,47 @@ Products[!"popcorn",
700700

701701
### 7.2. Updating by reference
702702

703-
The `:=` operator in `data.table` is used for updating or adding columns by reference. This means it modifies the original `data.table` without creating a copy, which is very memory-efficient, especially for large datasets. When used inside a `data.table`, `:=` allows you to **add new columns** or **modify existing ones** as part of your query.
703+
Use `:=` to modify columns **by reference** (no copy) during joins. General syntax: `x[i, on=, (cols) := val]`.
704704

705-
Let's update our `Products` table with the latest price from `ProductPriceHistory`:
705+
- Simple One-to-One Update
706+
Update `Products` with prices from `ProductPriceHistory`:
706707

707-
```{r Simple_One_to_One_Update}
708-
Products[ProductPriceHistory, on = .(id = product_id), price := i.price]
708+
```{r}
709+
Products[ProductPriceHistory,
710+
on = .(id = product_id),
711+
price := i.price]
709712
```
710-
- The `price` column in `Products` is updated using the `price` column from `ProductPriceHistory`.
711-
- The `on = .(id = product_id)` ensures that updates happen based on matching IDs.
712-
- This method modifies `Products` in place, avoiding unnecessary copies.
713-
714-
Grouped Updates with `.EACHI`
715-
716-
If we need to get the latest price and date (instead of all matches), we can use grouped updates efficiently:
713+
- i.price refers to price from i (ProductPriceHistory).
714+
- Modifies Products in-place.
717715

716+
- Grouped Updates with `.EACHI`
717+
Get last price/date for each product:
718718
```{r Updating_with_the_Latest_Record}
719719
Products[ProductPriceHistory,
720720
on = .(id = product_id),
721721
`:=`(price = last(i.price), last_updated = last(i.date)),
722722
by = .EACHI]
723723
```
724-
Grouped Behavior `(by = .EACHI)`:
725-
- The grouping `(by = .EACHI)` ensures that updates are performed separately for each product (id).
726-
- Within each group, only the last record `(last(i.price)` and `last(i.date))` is selected for updating.
727-
- This is different from a simple one-to-one match, where only the first matching record is used.
724+
- by = .EACHI groups by i's rows (1 group per Products row).
725+
- last() returns last value including NA:
728726

729-
Behavior of `last()`:
730-
- The function `last()` returns the last element of a vector or column within each group.
731-
- It does not skip `NA` values.
732727
```{r}
733-
data.table::last(c(1, NA)) # Returns NA
734-
dt <- data.table(group = c(1, 1, 2, 2), value = c(10, NA, 20, NA))
735-
dt[, .(last_value = last(value)), by = group]
736-
# group last_value
737-
# 1: 1 NA
738-
# 2: 2 NA
728+
data.table::last(c(1, NA)) # NA
739729
```
740-
Difference from Simple Join:
741-
- A simple join `(on)` updates rows based on matching IDs without considering grouping or ordering.
742-
- Grouped updates allow operations like selecting the "latest" record within each group using `.EACHI`.
743-
744-
**Right Join**
745-
To update the right table by reference without copying (similar to SQL right join workflows), use `.SD` and `.SDcols`. This approach avoids modifying the left table directly while dynamically selecting columns.
730+
- Efficient Right Join Update
731+
Add product details to ProductPriceHistory without copying:
746732

747733
```{r}
748-
# Get all columns from Products except the ID column
749-
product_cols <- setdiff(names(Products), "id")
750-
751-
# Update ProductPriceHistory with product details from Products
752-
ProductPriceHistory[, (product_cols) := Products[.SD, on = .(id = product_id), .SD, .SDcols = product_cols]]
734+
cols <- setdiff(names(Products), "id")
735+
ProductPriceHistory[, (cols) :=
736+
Products[.SD, on = .(id = product_id), .SD, .SDcols = cols]]
753737
```
754-
- The dynamic selection of columns `(.SDcols)` ensures flexibility when column names are not known upfront.
755-
- The right table `(ProductPriceHistory)` is updated in place using columns from the left table `(Products)` without creating unnecessary copies.
756-
- This method is memory-efficient and avoids modifying the left table directly.
757-
758-
Understanding last() vs. tail()
759-
760-
last(x):
761-
- Returns the last element of a `vector`, `list`, or `data.table` column directly.
762-
- Dispatches to `xts::last()` if xts is loaded and the object inherits from xts.
763-
- Includes `NA` if it is the last element.
764-
- Optimized for use within `data.table` operations.
765-
766-
tail(x, 1):
767-
- Returns the last element of a `vector` or `data.table` column.
768-
- For lists, it returns a `list` containing the last element instead of the element directly.
769-
- Handles negative values (n) correctly to exclude elements from the end.
770-
771-
```{r Example_Behavior}
772-
# Test 1: Simple vector with NA at the end
773-
x <- c(1, 2, 3, NA)
774-
last(x) # Returns NA
775-
tail(x, 1) # Returns NA
738+
- .SD refers to ProductPriceHistory during the join.
739+
- Updates ProductPriceHistory by reference.
776740

777-
# Test 2: Grouping behavior in data.table
778-
dt <- data.table(group = c(1,1,2,2), value = c(10, NA, 20, NA))
779-
dt[, .(last_value = last(value)), by = group] # Returns NA
780-
dt[, .(tail_value = tail(value, 1)), by = group] # Returns NA
781-
782-
# Test 3: Working with lists
783-
l <- list(a = 1, b = 2, c = 3)
784-
last(l) # Returns 3
785-
tail(l, 1) # Returns a list containing the last element (`list(c = 3)`)
786-
787-
# Test 4: Empty vector behavior
788-
z <- numeric(0)
789-
length(last(z)) # Returns length of 0
790-
length(tail(z, 1)) # Returns length of 0
791-
```
792-
793-
When we need to update `Products` with multiple columns from `ProductPriceHistory`
794-
795-
```{r Efficient_Right_Join_Update }
796-
cols <- setdiff(names(ProductPriceHistory), 'product_id')
797-
Products[ProductPriceHistory,
798-
on = .(id = product_id),
799-
(cols) := mget(paste0("i.", cols))]
800-
```
801-
- Efficiently updates multiple columns in `Products` from `ProductPriceHistory`.
802-
- `mget(cols)` retrieves multiple matching columns dynamically.
803-
- This method avoids creating a copy of the data, making it more memory-efficient for large datasets.
804-
- Note: `:=` updates `Products` in place, but does not modify `ProductPriceHistory`.
805-
- Unlike traditional RIGHT JOIN, `data.table` does not allow i (right table) to be updated directly.
741+
- last(x) vs tail(x,1): Both return last element, but tail() returns list for lists.
742+
- := always modifies x, never i. For right joins, update i directly via i[, ... := x[.SD]].
743+
- .EACHI is crucial for per-row operations; simple joins use first match.
806744

807745
***
808746

0 commit comments

Comments
 (0)