added examples

venom1204 · venom1204 · commit 4a416fce377c · 2025-03-15T13:44:52.000+05:30
diff --git a/vignettes/datatable-joins.Rmd b/vignettes/datatable-joins.Rmd
@@ -194,6 +194,40 @@ Products[ProductReceived,
          on = .(id = product_id)]
 ```
 
+#### 3.1.2. Using Named Lists for Explicit Joins
+In `data.table`, joins can be performed using unnamed lists `(list())` or named lists. Named lists provide greater clarity and reduce ambiguity when matching column names, especially when joining on multiple columns.
+
+- Using Unnamed
+```{r}
+dt1 <- data.table(id = c(1, 2, 3), value = c("A", "B", "C"))
+dt2 <- data.table(id = c(2, 3, 4), info = c("X", "Y", "Z"))
+dt1[dt2, on = "id"]
+```
+- Using a named list for explicit joins:
+```{r}
+dt1[dt2, on = .(id = id)]
+```
+- Here, .() is a shorthand for list(), and explicitly naming the column (id = id) makes the join easier to understand.
+
+#####  Named Lists for Multiple Column Joins
+When joining on multiple columns, named lists prevent mismatches and make the query more readable:
+```{r}
+dt1 <- data.table(id = c(1, 2, 3), key1 = c("A", "B", "C"), value = c(10, 20, 30))
+dt2 <- data.table(id = c(2, 3, 4), key1 = c("B", "C", "D"), info = c("X", "Y", "Z"))
+
+# Unnamed list approach (less readable)
+dt1[dt2, on = c("id", "key1")]
+
+# Named list approach (explicit and clear)
+dt1[dt2, on = .(id = id, key1 = key1)]
+```
+This ensures that column names are explicitly matched, which is especially useful when working with complex datasets.
+
+- When Should You Use Named Lists?
+There is potential ambiguity in column names.
+You are joining on multiple columns.
+You want to make your joins self-documenting and more readable.
+
 #### 3.1.2. Alternatives to define the `on` argument
 
 In all the prior example we have pass the column names we want to match to the `on` argument but `data.table` also have alternatives to that syntax.
diff --git a/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd b/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd
@@ -191,33 +191,63 @@ flights[.("JFK", "LAX"), on = c("origin", "dest")][1:5]
 
 * Since the time to compute the secondary index is quite small, we don't have to use `setindex()`, unless, once again, the task involves repeated subsetting on the same column.
 
-### b) Select in `j`
+### b) Using named list elements in `i` 
+When subsetting using the on argument, values in `i` are typically passed as unnamed elements. However, naming elements explicitly in `i` improves readability, especially when dealing with multiple keys.
+
+- Example: Standard subsetting using unnamed elements
+```{r}
+flights[.("LGA", "TPA"), max(arr_delay), on = c("origin", "dest")]
+```
+While this syntax is concise, it may not be immediately clear which value corresponds to which key in `on`.
+
+- Subsetting using named elements in `i`
+```{r}
+flights[.(origin = "LGA", dest = "TPA"), max(arr_delay), on = c("origin", "dest")]
+```
+Here, naming the elements explicitly `(origin = "LGA", dest = "TPA")` makes it clear which variable each value corresponds to. This improves code maintainability, especially in complex queries.
+
+- Using named lists with multiple values
+When multiple values are passed, named elements further enhance clarity:
+```{r}
+flights[.("LGA", "JFK", "EWR"), mult = "last", on = c("origin", "dest"), nomatch = 0L]
+```
+- Named elements
+```{r}
+flights[.(origin = c("LGA", "JFK", "EWR"), dest = "XNA"), mult = "last", on = c("origin", "dest"), nomatch = 0L]
+```
+- Impact of named elements on key order
+It's important to note that naming elements in `i` only affects ordering when `on` is specified. If `on` is not used, data.table will match values based on key order, regardless of the names used.
+
+- When to use named list elements in `i`.
+when working with multiple keys in `on`, as it improves readability.
+
+### c) Select in `j`
 
 All the operations we will discuss below are no different to the ones we already saw in the [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html) vignette. Except we'll be using the `on` argument instead of setting keys.
 
 #### -- Return `arr_delay` column alone as a data.table corresponding to `origin = "LGA"` and `dest = "TPA"`
 
 ```{r}
-flights[.("LGA", "TPA"), .(arr_delay), on = c("origin", "dest")]
+flights[.(origin = "LGA", dest = "TPA"), .(arr_delay), on = c("origin", "dest")]
 ```
 
-### c) Chaining
+### d) Chaining
 
 #### -- On the result obtained above, use chaining to order the column in decreasing order.
 
 ```{r}
-flights[.("LGA", "TPA"), .(arr_delay), on = c("origin", "dest")][order(-arr_delay)]
+flights[.(origin = "LGA", dest = "TPA"), .(arr_delay), on = c("origin", "dest")][order(-arr_delay)]
 ```
 
-### d) Compute or *do* in `j`
+### e) Compute or *do* in `j`
 
 #### -- Find the maximum arrival delay corresponding to `origin = "LGA"` and `dest = "TPA"`.
 
 ```{r}
-flights[.("LGA", "TPA"), max(arr_delay), on = c("origin", "dest")]
+flights[.(origin = "LGA", dest = "TPA"), max(arr_delay), on = c("origin", "dest")]
 ```
 
-### e) *sub-assign* by reference using `:=` in `j`
+### f) *sub-assign* by reference using `:=` in `j`
 
 We have seen this example already in the vignettes [`vignette("datatable-reference-semantics", package="data.table")`](datatable-reference-semantics.html) and [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html). Let's take a look at all the `hours` available in the `flights` *data.table*:
 
@@ -240,7 +270,7 @@ flights[, sort(unique(hour))]
 
 * This is particularly a huge advantage of secondary indices. Previously, just to update a few rows of `hour`, we had to `setkey()` on it, which inevitably reorders the entire data.table. With `on`, the order is preserved, and the operation is much faster! Looking at the code, the task we wanted to perform is also quite clear.
 
-### f) Aggregation using `by`
+### g) Aggregation using `by`
 
 #### -- Get the maximum departure delay for each `month` corresponding to `origin = "JFK"`. Order the result by `month`
 
@@ -251,7 +281,7 @@ head(ans)
 
 * We would have had to set the `key` back to `origin, dest` again, if we did not use `on` which internally builds secondary indices on the fly.
 
-### g) The *mult* argument
+### h) The *mult* argument
 
 The other arguments including `mult` work exactly the same way as we saw in the [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html) vignette. The default value for `mult` is "all". We can choose, instead only the "first" or "last" matching rows should be returned.
 
@@ -264,17 +294,17 @@ flights[c("BOS", "DAY"), on = "dest", mult = "first"]
 #### -- Subset only the last matching row where `origin` matches *"LGA", "JFK", "EWR"* and `dest` matches *"XNA"*
 
 ```{r}
-flights[.(c("LGA", "JFK", "EWR"), "XNA"), on = c("origin", "dest"), mult = "last"]
+flights[.(origin = c("LGA", "JFK", "EWR"), dest = "XNA"), on = c("origin", "dest"), mult = "last"]
 ```
 
-### h) The *nomatch* argument
+### i) The *nomatch* argument
 
 We can choose if queries that do not match should return `NA` or be skipped altogether using the `nomatch` argument.
 
 #### -- From the previous example, subset all rows only if there's a match
 
 ```{r}
-flights[.(c("LGA", "JFK", "EWR"), "XNA"), mult = "last", on = c("origin", "dest"), nomatch = NULL]
+flights[.(origin = c("LGA", "JFK", "EWR"), dest = "XNA"), mult = "last", on = c("origin", "dest"), nomatch = NULL]
 ```
 
 * There are no flights connecting "JFK" and "XNA". Therefore, that row is skipped in the result.