Skip to content

Commit 3b2812b

Browse files
Various tweaks to joins vignette (#6688)
* various tweaks to joins rmd file * few more tweaks * one more * another * bit more concise wording * another tweak * add code styles to logical ops * re-work non-equi join sec * codifying data.table words * whoops, i do not know how this made it's way into the repo. Deleting * updating tab * o to or * updating advantage * minor refinements --------- Co-authored-by: Michael Chirico <[email protected]>
1 parent e4b0bbb commit 3b2812b

File tree

1 file changed

+35
-35
lines changed

1 file changed

+35
-35
lines changed

vignettes/datatable-joins.Rmd

Lines changed: 35 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -126,12 +126,12 @@ The next diagram shows a description for each basic argument. In the following s
126126
x[i, on, nomatch]
127127
| | | |
128128
| | | \__ If NULL only returns rows linked in x and i tables
129-
| | \____ a character vector o list defining match logict
129+
| | \____ a character vector or list defining match logic
130130
| \_____ primary data.table, list or data.frame
131131
\____ secondary data.table
132132
```
133133

134-
> Please keep in mind that the standard argument order in data.table is `dt[i, j, by]`. For join operations, it is recommended to pass the `on` and `nomatch` arguments by name to avoid using `j` and `by` when they are not needed.
134+
> Please keep in mind that the standard argument order in `data.table` is `dt[i, j, by]`. For join operations, it is recommended to pass the `on` and `nomatch` arguments by name to avoid using `j` and `by` when they are not needed.
135135
136136
## 3. Equi joins
137137

@@ -160,8 +160,8 @@ Products[ProductReceived,
160160
As many things have changed, let's explain the new characteristics in the following groups:
161161

162162
- **Column level**
163-
- The *first group* of columns in the new data.table comes from the `x` table.
164-
- The *second group* of columns in the new data.table comes from the `i` table.
163+
- The *first group* of columns in the new `data.table` comes from the `x` table.
164+
- The *second group* of columns in the new `data.table` comes from the `i` table.
165165
- If the join operation presents a present any **name conflict** (both table have same column name) the ***prefix*** `i.` is added to column names from the **right-hand table** (table on `i` position).
166166

167167
- **Row level**
@@ -183,7 +183,7 @@ Products[ProductReceived,
183183
on = list(id = product_id)]
184184
```
185185

186-
- Wrapping the related columns in the data.table `list` alias `.`.
186+
- Wrapping the related columns in the `data.table` `list` alias `.`.
187187

188188
```{r, eval=FALSE}
189189
Products[ProductReceived,
@@ -249,7 +249,7 @@ Products[
249249
```
250250

251251

252-
##### Summarizing with on in data.table
252+
##### Summarizing with `on` in `data.table`
253253

254254
We can also use this alternative to return aggregated results based columns present in the `x` table.
255255

@@ -302,18 +302,18 @@ ProductReceived[Products,
302302
nomatch = NULL]
303303
```
304304

305-
Despite both tables have the same information, they present some relevant differences:
305+
Despite both tables having the same information, there are some relevant differences:
306306

307-
- They present different order for their columns
308-
- They have some name differences on their columns names:
309-
- The `id` column of first table has the same information as the `product_id` in the second table.
310-
- The `i.id` column of first table has the same information as the `id` in the second table.
307+
- They present different column ordering.
308+
- They have column name differences:
309+
- The `id` column in the first table has the same information as the `product_id` in the second table.
310+
- The `i.id` column in the first table has the same information as the `id` in the second table.
311311

312312
### 3.3. Not join
313313

314314
This method **keeps only the rows that don't match with any row of a second table**.
315315

316-
To apply this technique we just need to negate (`!`) the table located on the `i` argument.
316+
To apply this technique we can negate (`!`) the table located on the `i` argument.
317317

318318
```{r}
319319
Products[!ProductReceived,
@@ -331,7 +331,7 @@ In this case, the operation returns the row with `product_id = 6,` as it is not
331331

332332
### 3.4. Semi join
333333

334-
This method extract **keeps only the rows that match with any row in a second table** without combining the column of the tables.
334+
This method extracts **only the rows that match any row in a second table**, without combining the columns of the tables.
335335

336336
It's very similar to subset as join, but as in this time we are passing a complete table to the `i` we need to ensure that:
337337

@@ -391,7 +391,7 @@ Here some important considerations:
391391

392392
- **Row level**
393393
- All rows from in the `i` table were kept as we never received any banana but row is still part of the results.
394-
- The row related to `product_id = 6` is no part of the results any more as it is not present in the `Products` table.
394+
- The row related to `product_id = 6` is not part of the results any more as it is not present in the `Products` table.
395395

396396

397397
#### 3.5.1. Joining after chain operations
@@ -510,7 +510,7 @@ Use this method if you need to combine columns from 2 tables based on one or mor
510510

511511
As we saw in the previous section, any of the prior operations can keep the missing `product_id = 6` and the **soda** (`product_id = 4`) as part of the results.
512512

513-
To save this problem, we can use the `merge` function even thought it is lower than using the native `data.table`'s joining syntax.
513+
To save this problem, we can use the `merge` function even though it is lower than using the native `data.table`'s joining syntax.
514514

515515
```{r}
516516
merge(x = Products,
@@ -524,24 +524,24 @@ merge(x = Products,
524524

525525
## 4. Non-equi join
526526

527-
A non-equi join is a type of join where the condition for matching rows is not based on equality, but on other comparison operators like <, >, <=, or >=. This allows for **more flexible joining criteria**. In `data.table`, non-equi joins are particularly useful for operations like:
527+
A non-equi join is a type of join where the condition for matching rows is based on comparison operators other than equality, such as `<`, `>`, `<=`, or `>=`. This allows for **more flexible joining criteria**. In `data.table`, non-equi joins are particularly useful for operations like:
528528

529-
- Finding the nearest match
530-
- Comparing ranges of values between tables
529+
- Finding the nearest match.
530+
- Comparing ranges of values between tables.
531531

532-
It's a great alternative if after applying a right of inner join:
532+
It is a great alternative when, after applying a right or inner join, you:
533533

534-
- You want to decrease the number of returned rows based on comparing numeric columns of different table.
535-
- You don't need to keep the columns from table `x`*(secondary data.table)* in the final table.
534+
- Want to reduce the number of returned rows based on comparisons of numeric columns between tables.
535+
- Do not need to retain the columns from table x *(the secondary `data.table`)* in the final result.
536536

537-
To illustrate how this work, let's center over attention on how are the sales and receives for product 2.
537+
To illustrate how this works, let's focus on the sales and receives for product 2.
538538

539539
```{r}
540540
ProductSalesProd2 = ProductSales[product_id == 2L]
541541
ProductReceivedProd2 = ProductReceived[product_id == 2L]
542542
```
543543

544-
If want to know, for example, if can find any receive that took place before a sales date, we can apply the next code.
544+
If want to know, for example, you can find any receive that took place before a sales date, we can apply the following.
545545

546546
```{r}
547547
ProductReceivedProd2[ProductSalesProd2,
@@ -552,16 +552,16 @@ ProductReceivedProd2[ProductSalesProd2,
552552

553553
What does happen if we just apply the same logic on the list passed to `on`?
554554

555-
- As this opperation it's still a right join, it returns all rows from the `i` table, but only shows the values for `id` and `count` when the rules are met.
555+
- As this operation is still a right join, it returns all rows from the `i` table, but only shows the values for `id` and `count` when the rules are met.
556556

557-
- The date related `ProductReceivedProd2` was omited from this new table.
557+
- The date related `ProductReceivedProd2` was omitted from this new table.
558558

559559
```{r}
560560
ProductReceivedProd2[ProductSalesProd2,
561561
on = list(product_id, date < date)]
562562
```
563563

564-
Now, after applying the join, we can limit the results only show the cases that meet all joining criteria.
564+
Now, after applying the join, we can limit the results only showing the cases that meet all joining criteria.
565565

566566
```{r}
567567
ProductReceivedProd2[ProductSalesProd2,
@@ -574,7 +574,7 @@ ProductReceivedProd2[ProductSalesProd2,
574574

575575
Rolling joins are particularly useful in time-series data analysis. They allow you to **match rows based on the nearest value** in a sorted column, typically a date or time column.
576576

577-
This is valuable when you need to align data from different sources **that may not have exactly matching timestamps**, or when you want to carry forward the most recent value.
577+
This is valuable when you need to align data from different sources **that may not have exact matching timestamps**, or when you want to carry forward the most recent value.
578578

579579
For example, in financial data, you might use a rolling join to assign the most recent stock price to each transaction, even if the price updates and transactions don't occur at the exact same times.
580580

@@ -594,7 +594,7 @@ ProductPriceHistory = data.table(
594594
ProductPriceHistory
595595
```
596596

597-
Now, we can perform a right join giving a different prices for each product based on the sale date.
597+
Now, we can perform a right join giving a different price for each product based on the sale date.
598598

599599
```{r}
600600
ProductPriceHistory[ProductSales,
@@ -613,13 +613,13 @@ ProductPriceHistory[ProductSales,
613613
j = .(product_id, date, count, price)]
614614
```
615615

616-
## 7. Taking advange of joining speed
616+
## 7. Taking advantage of joining speed
617617

618618
### 7.1. Subsets as joins
619619

620-
As we just saw in the prior section the `x` table gets filtered by the values available in the `i` table. Actually, that process is faster than passing a Boolean expression to the `i` argument.
620+
As we just saw in the prior section the `x` table gets filtered by the values available in the `i` table. This process is faster than passing a Boolean expression to the `i` argument.
621621

622-
To filter the `x` table at speed we don't to pass a complete `data.table`, we can pass a `list()` of vectors with the values that we want to keep or omit from the original table.
622+
To filter the `x` table at speed we don't need to pass a complete `data.table`, we can pass a `list()` of vectors with the values that we want to keep or omit from the original table.
623623

624624
For example, to filter dates where the market received 100 units of bananas (`product_id = 1`) or popcorn (`product_id = 3`) we can use the following:
625625

@@ -628,7 +628,7 @@ ProductReceived[list(c(1L, 3L), 100L),
628628
on = c("product_id", "count")]
629629
```
630630

631-
As at the end, we are filtering based on a join operation the code returned a **row that was not present in original table**. To avoid that behavior, it is recommended to always to add the argument `nomatch = NULL`.
631+
As at the end, we are filtering based on a join operation the code returned a **row that was not present in original table**. To avoid that behavior, it is recommended to always add the argument `nomatch = NULL`.
632632

633633
```{r}
634634
ProductReceived[list(c(1L, 3L), 100L),
@@ -644,7 +644,7 @@ ProductReceived[!list(c(1L, 3L), 100L),
644644
on = c("product_id", "count")]
645645
```
646646

647-
If you just want to filter a value for a single **character column**, you can omit calling the `list()` function pass the value to been filtered in the `i` argument.
647+
If you just want to filter a value for a single **character column**, you can omit calling the `list()` function and pass the value to be filtered in the `i` argument.
648648

649649
```{r}
650650
Products[c("banana","popcorn"),
@@ -660,7 +660,7 @@ Products[!"popcorn",
660660

661661
### 7.2. Updating by reference
662662

663-
The `:=` operator in data.table is used for updating or adding columns by reference. This means it modifies the original data.table without creating a copy, which is very memory-efficient, especially for large datasets. When used inside a data.table, `:=` allows you to **add new columns** or **modify existing ones** as part of your query.
663+
The `:=` operator in `data.table` is used for updating or adding columns by reference. This means it modifies the original `data.table` without creating a copy, which is very memory-efficient, especially for large datasets. When used inside a `data.table`, `:=` allows you to **add new columns** or **modify existing ones** as part of your query.
664664

665665
Let's update our `Products` table with the latest price from `ProductPriceHistory`:
666666

@@ -674,7 +674,7 @@ copy(Products)[ProductPriceHistory,
674674

675675
In this operation:
676676

677-
- The function `copy` prevent that `:=` changes by reference the `Products` table.s
677+
- The function copy creates a ***deep*** copy of the `Products` table, preventing modifications made by `:=` from changing the original table by reference.
678678
- We join `Products` with `ProductPriceHistory` based on `id` and `product_id`.
679679
- We update the `price` column with the latest price from `ProductPriceHistory`.
680680
- We add a new `last_updated` column to track when the price was last changed.

0 commit comments

Comments
 (0)