You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* various tweaks to joins rmd file
* few more tweaks
* one more
* another
* bit more concise wording
* another tweak
* add code styles to logical ops
* re-work non-equi join sec
* codifying data.table words
* whoops, i do not know how this made it's way into the repo. Deleting
* updating tab
* o to or
* updating advantage
* minor refinements
---------
Co-authored-by: Michael Chirico <[email protected]>
Copy file name to clipboardExpand all lines: vignettes/datatable-joins.Rmd
+35-35Lines changed: 35 additions & 35 deletions
Original file line number
Diff line number
Diff line change
@@ -126,12 +126,12 @@ The next diagram shows a description for each basic argument. In the following s
126
126
x[i, on, nomatch]
127
127
| | | |
128
128
| | | \__ If NULL only returns rows linked in x and i tables
129
-
| | \____ a character vector o list defining match logict
129
+
| | \____ a character vector or list defining match logic
130
130
| \_____ primary data.table, list or data.frame
131
131
\____ secondary data.table
132
132
```
133
133
134
-
> Please keep in mind that the standard argument order in data.table is `dt[i, j, by]`. For join operations, it is recommended to pass the `on` and `nomatch` arguments by name to avoid using `j` and `by` when they are not needed.
134
+
> Please keep in mind that the standard argument order in `data.table` is `dt[i, j, by]`. For join operations, it is recommended to pass the `on` and `nomatch` arguments by name to avoid using `j` and `by` when they are not needed.
135
135
136
136
## 3. Equi joins
137
137
@@ -160,8 +160,8 @@ Products[ProductReceived,
160
160
As many things have changed, let's explain the new characteristics in the following groups:
161
161
162
162
-**Column level**
163
-
- The *first group* of columns in the new data.table comes from the `x` table.
164
-
- The *second group* of columns in the new data.table comes from the `i` table.
163
+
- The *first group* of columns in the new `data.table` comes from the `x` table.
164
+
- The *second group* of columns in the new `data.table` comes from the `i` table.
165
165
- If the join operation presents a present any **name conflict** (both table have same column name) the ***prefix***`i.` is added to column names from the **right-hand table** (table on `i` position).
166
166
167
167
-**Row level**
@@ -183,7 +183,7 @@ Products[ProductReceived,
183
183
on = list(id = product_id)]
184
184
```
185
185
186
-
- Wrapping the related columns in the data.table `list`alias `.`.
186
+
- Wrapping the related columns in the `data.table``list`alias `.`.
187
187
188
188
```{r, eval=FALSE}
189
189
Products[ProductReceived,
@@ -249,7 +249,7 @@ Products[
249
249
```
250
250
251
251
252
-
##### Summarizing with on in data.table
252
+
##### Summarizing with `on` in `data.table`
253
253
254
254
We can also use this alternative to return aggregated results based columns present in the `x` table.
255
255
@@ -302,18 +302,18 @@ ProductReceived[Products,
302
302
nomatch = NULL]
303
303
```
304
304
305
-
Despite both tables have the same information, they present some relevant differences:
305
+
Despite both tables having the same information, there are some relevant differences:
306
306
307
-
- They present different order for their columns
308
-
- They have some name differences on their columns names:
309
-
- The `id` column of first table has the same information as the `product_id` in the second table.
310
-
- The `i.id` column of first table has the same information as the `id` in the second table.
307
+
- They present different column ordering.
308
+
- They have column name differences:
309
+
- The `id` column in the first table has the same information as the `product_id` in the second table.
310
+
- The `i.id` column in the first table has the same information as the `id` in the second table.
311
311
312
312
### 3.3. Not join
313
313
314
314
This method **keeps only the rows that don't match with any row of a second table**.
315
315
316
-
To apply this technique we just need to negate (`!`) the table located on the `i` argument.
316
+
To apply this technique we can negate (`!`) the table located on the `i` argument.
317
317
318
318
```{r}
319
319
Products[!ProductReceived,
@@ -331,7 +331,7 @@ In this case, the operation returns the row with `product_id = 6,` as it is not
331
331
332
332
### 3.4. Semi join
333
333
334
-
This method extract**keeps only the rows that match with any row in a second table** without combining the column of the tables.
334
+
This method extracts**only the rows that match any row in a second table**, without combining the columns of the tables.
335
335
336
336
It's very similar to subset as join, but as in this time we are passing a complete table to the `i` we need to ensure that:
337
337
@@ -391,7 +391,7 @@ Here some important considerations:
391
391
392
392
-**Row level**
393
393
- All rows from in the `i` table were kept as we never received any banana but row is still part of the results.
394
-
- The row related to `product_id = 6` is no part of the results any more as it is not present in the `Products` table.
394
+
- The row related to `product_id = 6` is not part of the results any more as it is not present in the `Products` table.
395
395
396
396
397
397
#### 3.5.1. Joining after chain operations
@@ -510,7 +510,7 @@ Use this method if you need to combine columns from 2 tables based on one or mor
510
510
511
511
As we saw in the previous section, any of the prior operations can keep the missing `product_id = 6` and the **soda** (`product_id = 4`) as part of the results.
512
512
513
-
To save this problem, we can use the `merge` function even thought it is lower than using the native `data.table`'s joining syntax.
513
+
To save this problem, we can use the `merge` function even though it is lower than using the native `data.table`'s joining syntax.
514
514
515
515
```{r}
516
516
merge(x = Products,
@@ -524,24 +524,24 @@ merge(x = Products,
524
524
525
525
## 4. Non-equi join
526
526
527
-
A non-equi join is a type of join where the condition for matching rows is not based on equality, but on other comparison operators like <, >, <=, or >=. This allows for **more flexible joining criteria**. In `data.table`, non-equi joins are particularly useful for operations like:
527
+
A non-equi join is a type of join where the condition for matching rows is based on comparison operators other than equality, such as `<`, `>`, `<=`, or `>=`. This allows for **more flexible joining criteria**. In `data.table`, non-equi joins are particularly useful for operations like:
528
528
529
-
- Finding the nearest match
530
-
- Comparing ranges of values between tables
529
+
- Finding the nearest match.
530
+
- Comparing ranges of values between tables.
531
531
532
-
It's a great alternative if after applying a right of inner join:
532
+
It is a great alternative when, after applying a right or inner join, you:
533
533
534
-
-You want to decrease the number of returned rows based on comparing numeric columns of different table.
535
-
-You don't need to keep the columns from table `x`*(secondary data.table)* in the final table.
534
+
-Want to reduce the number of returned rows based on comparisons of numeric columns between tables.
535
+
-Do not need to retain the columns from table x *(the secondary `data.table`)* in the final result.
536
536
537
-
To illustrate how this work, let's center over attention on how are the sales and receives for product 2.
537
+
To illustrate how this works, let's focus on the sales and receives for product 2.
What does happen if we just apply the same logic on the list passed to `on`?
554
554
555
-
- As this opperation it's still a right join, it returns all rows from the `i` table, but only shows the values for `id` and `count` when the rules are met.
555
+
- As this operation is still a right join, it returns all rows from the `i` table, but only shows the values for `id` and `count` when the rules are met.
556
556
557
-
- The date related `ProductReceivedProd2` was omited from this new table.
557
+
- The date related `ProductReceivedProd2` was omitted from this new table.
558
558
559
559
```{r}
560
560
ProductReceivedProd2[ProductSalesProd2,
561
561
on = list(product_id, date < date)]
562
562
```
563
563
564
-
Now, after applying the join, we can limit the results only show the cases that meet all joining criteria.
564
+
Now, after applying the join, we can limit the results only showing the cases that meet all joining criteria.
Rolling joins are particularly useful in time-series data analysis. They allow you to **match rows based on the nearest value** in a sorted column, typically a date or time column.
576
576
577
-
This is valuable when you need to align data from different sources **that may not have exactly matching timestamps**, or when you want to carry forward the most recent value.
577
+
This is valuable when you need to align data from different sources **that may not have exact matching timestamps**, or when you want to carry forward the most recent value.
578
578
579
579
For example, in financial data, you might use a rolling join to assign the most recent stock price to each transaction, even if the price updates and transactions don't occur at the exact same times.
As we just saw in the prior section the `x` table gets filtered by the values available in the `i` table. Actually, that process is faster than passing a Boolean expression to the `i` argument.
620
+
As we just saw in the prior section the `x` table gets filtered by the values available in the `i` table. This process is faster than passing a Boolean expression to the `i` argument.
621
621
622
-
To filter the `x` table at speed we don't to pass a complete `data.table`, we can pass a `list()` of vectors with the values that we want to keep or omit from the original table.
622
+
To filter the `x` table at speed we don't need to pass a complete `data.table`, we can pass a `list()` of vectors with the values that we want to keep or omit from the original table.
623
623
624
624
For example, to filter dates where the market received 100 units of bananas (`product_id = 1`) or popcorn (`product_id = 3`) we can use the following:
As at the end, we are filtering based on a join operation the code returned a **row that was not present in original table**. To avoid that behavior, it is recommended to always to add the argument `nomatch = NULL`.
631
+
As at the end, we are filtering based on a join operation the code returned a **row that was not present in original table**. To avoid that behavior, it is recommended to always add the argument `nomatch = NULL`.
If you just want to filter a value for a single **character column**, you can omit calling the `list()` function pass the value to been filtered in the `i` argument.
647
+
If you just want to filter a value for a single **character column**, you can omit calling the `list()` function and pass the value to be filtered in the `i` argument.
648
648
649
649
```{r}
650
650
Products[c("banana","popcorn"),
@@ -660,7 +660,7 @@ Products[!"popcorn",
660
660
661
661
### 7.2. Updating by reference
662
662
663
-
The `:=` operator in data.table is used for updating or adding columns by reference. This means it modifies the original data.table without creating a copy, which is very memory-efficient, especially for large datasets. When used inside a data.table, `:=` allows you to **add new columns** or **modify existing ones** as part of your query.
663
+
The `:=` operator in `data.table` is used for updating or adding columns by reference. This means it modifies the original `data.table` without creating a copy, which is very memory-efficient, especially for large datasets. When used inside a `data.table`, `:=` allows you to **add new columns** or **modify existing ones** as part of your query.
664
664
665
665
Let's update our `Products` table with the latest price from `ProductPriceHistory`:
- The function `copy` prevent that `:=`changes by reference the `Products` table.s
677
+
- The function copy creates a ***deep*** copy of the `Products` table, preventing modifications made by `:=`from changing the original table by reference.
678
678
- We join `Products` with `ProductPriceHistory` based on `id` and `product_id`.
679
679
- We update the `price` column with the latest price from `ProductPriceHistory`.
680
680
- We add a new `last_updated` column to track when the price was last changed.
0 commit comments