You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<!-- Suppose we wanted to find the maximum value for all the numeric columns in the `tidy_lang` data set. -->
1154
-
1155
-
<!-- We could apply `summarize` in the same way that we did above to find the maximum values: -->
1156
-
1157
-
<!-- ```{r} -->
1158
-
1159
-
<!-- lang_summary_max <- summarize(tidy_lang, -->
1160
-
1161
-
<!-- most_most_at_home = max(most_at_home), -->
1162
-
1163
-
<!-- most_most_at_work = max(most_at_work)) -->
1164
-
1165
-
<!-- lang_summary_max -->
1166
-
1167
-
<!-- ``` -->
1168
-
1169
-
<!-- The approach above is a valid way to do this, but if we had many numeric columns in our data set then this method would take a lot of time since we would have to explicitly write out the name of each column! A faster and less error-prone way to apply function(s) to columns that satisfy a certain condition is to use the `summarize_if` function. The first argument is the data set we want to summarize (`tidy_lang`). The second argument is the required condition, here if a particular column is numeric then the function will be applied. The third argument is the function we want to summarize with, here `max`. Therefore we write: -->
1170
-
1171
-
<!-- ```{r 02-summarize-if} -->
1172
-
1173
-
<!-- summarize_if(tidy_lang, -->
1174
-
1175
-
<!-- is.numeric, -->
1176
-
1177
-
<!-- max) -->
1178
-
1179
-
<!-- ``` -->
1180
-
1181
-
<!-- Notice that we get the same output as we did above! From the table, we see that the most commonly spoken -->
1182
-
1183
-
<!-- primary language at home is spoken by X people and the most commonly spoken language at work is spoken by X people. -->
1184
-
1185
1153
### Calculating group summary statistics:
1186
1154
1187
1155
A common pairing with `summarize` is `group_by`. Pairing these functions
1188
1156
together can let you summarize values for subgroups within a data set. For
1189
-
example, here, we can use `group_by` to group the regions and then calculate the
1190
-
minimum and maximum number of Canadians reporting the language as the primary
1191
-
language at home for each of the groups.
1157
+
example, here, we can use `group_by` to group the regions of the `tidy_lang` dataframe
1158
+
and then calculate the minimum and maximum number of Canadians
1159
+
reporting the language as the primary language at home for each of the groups.
1192
1160
1193
1161
The `group_by` function takes at least two arguments. The first is the data
1194
1162
frame that will be grouped, and the second and onwards are columns to use in the
@@ -1205,7 +1173,7 @@ lang_summary_by_region
1205
1173
```
1206
1174
1207
1175
Notice that `group_by` on its own doesn't change the way the data looks. In the output below
1208
-
the data set looks the same, and it doesn't *appear* to be grouped by `region`.
1176
+
the grouped data set looks the same, and it doesn't *appear* to be grouped by `region`.
1209
1177
Instead, `group_by` simply changes how other functions work with the data, as we saw with `summarize` above.
1210
1178
1211
1179
```{r}
@@ -1387,42 +1355,40 @@ iteration. Additionally, their use is not limited to columns of a data frame;
1387
1355
`map_*` functions can be used to apply functions to elements of a vector or
1388
1356
list, and even to lists of data frames, or nested data frames.
1389
1357
1390
-
## Iterating over rows in a data frame with `rowwise()`
1358
+
## Apply functions across columns within one row with `rowwise`
1391
1359
1392
-
1393
-
What if you want to apply a function across rows instead of columns?
1360
+
What if you want to apply a function across columns but within one row?
1394
1361
For instance, suppose we want to know the maximum value between `mother_tongue`,
1395
-
`most_at_home`, `most_at_work` and `lang_known` for each language in Vancouver.
1362
+
`most_at_home`, `most_at_work` and `lang_known` for each language in the `region_lang` data set?
1396
1363
In other words, we want to apply the `max` function row-wise. We will use the aptly
1397
-
named function `rowwise` to accomplish this task. First, we `filter` the data for
1398
-
only the languages in Vancouver. We also `select` specific columns simply
1399
-
so we can see all the columns in the data frame output
1400
-
but note that this step is not strictly necessary.
1364
+
named function `rowwise` in combination with `mutate` to accomplish this task.
1365
+
>**Note:** Before we apply `rowwise` we will `select` only the count columns
1366
+
so we can see all the columns in the dataframe's output easily in the book.
1401
1367
1402
-
```{r vancouver_filter}
1403
-
vancouver_lang <- region_lang |>
1404
-
filter(region == "Vancouver") |>
1405
-
select(region, language:lang_known)
1406
-
vancouver_lang
1407
-
```
1408
-
Similar to `group_by`, `rowwise` doesn't do anything when it is called by itself,
1409
-
however, we can apply `rowwise` in combination with other functions to change how
1410
-
these other functions operate on the data. We will use `rowwise` and `mutate`
1411
-
to find the maximum count for each language in the data set.
Similar to `group_by`, `rowwise` doesn't do anything when it is called by itself,
1376
+
however, we can apply `rowwise` in combination with other functions to change how
1377
+
these other functions operate on the data.
1417
1378
Notice if we used `mutate` without `rowwise`, we would have computed the maximum
1418
-
value across *all* rows rather than the maximum value for *each* row. Therefore in the output below
1419
-
`r format(vancouver_lang |> mutate(maximum = max(c(mother_tongue, most_at_home, most_at_work, lang_known))) |> slice(1) |> pull(maximum), scientific = FALSE, big.mark = ",")` is reported as the maximum value in every single row since it is
1379
+
value across *all* rows rather than the maximum value for *each* row.
1380
+
Therefore in the output below the same maximum value is reported
1381
+
in every single row since it is
1420
1382
the maximum value among *all* the rows, so this code is not doing what we want.
0 commit comments