@@ -311,7 +311,7 @@ Figure \@ref(fig:02-wide-to-long).
311
311
knitr::include_graphics("img/pivot_functions/pivot_functions.001.jpeg")
312
312
```
313
313
314
- We can achieve this effect in R using the ` pivot_longer ` function.
314
+ We can achieve this effect in R using the ` pivot_longer ` function from the ` tidyverse ` package .
315
315
The ` pivot_longer ` function combines columns,
316
316
and is usually used during tidying data
317
317
when we need to make the data frame longer and narrower.
@@ -329,7 +329,7 @@ lang_wide <- read_csv("data/region_lang_top5_cities_wide.csv")
329
329
lang_wide
330
330
```
331
331
332
- What is wrong with our untidy format above?
332
+ What is wrong with the untidy format above?
333
333
The table on the left in Figure \@ ref(fig: img-pivot-longer-with-table )
334
334
represents the data in the "wide" (messy) format.
335
335
From a data analysis perspective, this format is not ideal because the values of
@@ -356,8 +356,8 @@ to get the maximum value.
356
356
knitr::include_graphics("img/pivot_functions/pivot_functions.003.jpeg")
357
357
```
358
358
359
- Figure \@ ref(fig: img-pivot-longer ) details what arguments we need to specify to
360
- use the ` tidyverse ` function, ` pivot_longer ` , to accomplish this data transformation.
359
+ Figure \@ ref(fig: img-pivot-longer ) details the arguments that we need to specify
360
+ in the ` pivot_longer ` function to accomplish this data transformation.
361
361
362
362
(ref: img-pivot-longer ) Syntax for the ` pivot_longer ` function.
363
363
@@ -447,7 +447,7 @@ In this example, each observation is a language in a region.
447
447
However, each observation is split across multiple rows:
448
448
one where the count for ` most_at_home ` is recorded,
449
449
and the other where the count for ` most_at_work ` is recorded.
450
- Suppose our analysis goal with this data set was to
450
+ Suppose the goal with this data was to
451
451
visualize the relationship between the number of
452
452
Canadians reporting their primary language at home and work.
453
453
Doing that would be difficult with this data in its current form,
@@ -461,8 +461,9 @@ will be tidied using the `pivot_wider` function.
461
461
knitr::include_graphics("img/pivot_functions/pivot_functions.004.jpeg")
462
462
```
463
463
464
- Figure \@ ref(fig: img-pivot-wider ) details what we need to specify
465
- to use the ` pivot_wider ` function.
464
+ Figure \@ ref(fig: img-pivot-wider ) details the arguments that we need to specify
465
+ in the ` pivot_wider ` function.
466
+
466
467
467
468
(ref: img-pivot-wider ) Syntax for the ` pivot_wider ` function.
468
469
@@ -492,8 +493,8 @@ that this data is a tidy data set.
492
493
3 . Each value is a single cell (i.e., its row, column position in the data
493
494
frame is not shared with another value).
494
495
495
- You might notice that we have the same number of columns in our tidy data set as
496
- we did in our messy one. Therefore ` pivot_wider ` didn't really "widen" our data,
496
+ You might notice that we have the same number of columns in the tidy data set as
497
+ we did in the messy one. Therefore ` pivot_wider ` didn't really "widen" the data,
497
498
as the name suggests. This is just because the original ` type ` column only had
498
499
two categories in it. If it had more than two, ` pivot_wider ` would have created
499
500
more columns, and we would see the data set "widen."
@@ -565,7 +566,7 @@ Is this data set now tidy? If we recall the three criteria for tidy data:
565
566
We can see that this data now satisfies all three criteria, making it easier to
566
567
analyze. But we aren't done yet! Notice in the table above that the word
567
568
` <chr> ` appears beneath each of the column names. The word under the column name
568
- indicates the data type of each column. Here all of our variables are
569
+ indicates the data type of each column. Here all of the variables are
569
570
"character" data types. Recall, character data types are letter(s) or digits(s)
570
571
surrounded by quotes. In the previous example in Section \@ ref(pivot-wider), the
571
572
` most_at_home ` and ` most_at_work ` variables were ` <dbl> ` (double)&mdash ; you can
@@ -600,7 +601,7 @@ indicating they are integer data types (i.e., numbers)!
600
601
601
602
## Using ` select ` to extract a range of columns
602
603
603
- Now that our ` tidy_lang ` data is indeed * tidy* , we can start manipulating it \index{select!helpers}
604
+ Now that the ` tidy_lang ` data is indeed * tidy* , we can start manipulating it \index{select!helpers}
604
605
using the powerful suite of functions from the ` tidyverse ` .
605
606
For the first example, recall the ` select ` function from Chapter \@ ref(intro),
606
607
which lets us create a subset of columns from a data frame.
@@ -679,7 +680,7 @@ to compare the values of the `category` column
679
680
with the value ` "Official languages" ` .
680
681
With these arguments, ` filter ` returns a data frame with all the columns
681
682
of the input data frame
682
- but only the rows we asked for in our logical filter statement, i.e.,
683
+ but only the rows we asked for in the logical statement, i.e.,
683
684
those where the ` category ` column holds the value ` "Official languages" ` .
684
685
We name this data frame ` official_langs ` .
685
686
@@ -728,8 +729,8 @@ filter(official_langs, region == "Montréal" & language == "French")
728
729
729
730
### Extracting rows satisfying at least one condition using ` | `
730
731
731
- Suppose we were interested in the rows for only the Albertan cities
732
- in our ` official_langs ` data set (Edmonton and Calgary).
732
+ Suppose we were interested in only those rows corresponding to cities in Alberta
733
+ in the ` official_langs ` data set (Edmonton and Calgary).
733
734
We can't use ` , ` as we did above because ` region `
734
735
cannot be both Edmonton * and* Calgary simultaneously.
735
736
Instead, we can use the vertical pipe (` | ` ) logical operator,
@@ -925,11 +926,11 @@ for our five cities of focus in this chapter.
925
926
To accomplish this, we will need to do two tasks
926
927
beforehand:
927
928
928
- 1 . Create a vector containing the population values for our cities.
929
+ 1 . Create a vector containing the population values for the cities.
929
930
2 . Filter the ` official_langs ` data frame
930
931
so that we only keep the rows where the language is English.
931
932
932
- To create a vector containing the population values for our cities
933
+ To create a vector containing the population values for the five cities
933
934
(Toronto, Montréal, Vancouver, Calgary, Edmonton),
934
935
we will use the ` c ` function (recall that ` c ` stands for "concatenate"):
935
936
@@ -977,10 +978,10 @@ Failing to do this would have resulted in the incorrect math being performed.
977
978
<!--
978
979
#### Creating a visualization with tidy data {-}
979
980
980
- Now that we have cleaned and wrangled our data, we can make visualizations or do
981
- statistical analyses to answer questions about our data ! Let's suppose we want to
981
+ Now that we have cleaned and wrangled the data, we can make visualizations or do
982
+ statistical analyses to answer questions about it ! Let's suppose we want to
982
983
answer the question "what proportion of people in each city speak English
983
- as their primary language at home in these five cities?" Since our data is
984
+ as their primary language at home in these five cities?" Since the data is
984
985
cleaned already, in a few short lines of code, we can use `ggplot` to create a
985
986
data visualization to answer this question! Here we create a bar plot to represent the proportions for
986
987
each region and color the proportions by language.
@@ -1086,7 +1087,7 @@ output <- data |>
1086
1087
1087
1088
### Using ` |> ` to combine ` filter ` and ` select `
1088
1089
1089
- Let's work with our tidy ` tidy_lang ` data set from Section \@ ref(separate),
1090
+ Let's work with the tidy ` tidy_lang ` data set from Section \@ ref(separate),
1090
1091
which contains the number of Canadians reporting their primary language at home
1091
1092
and work for five major cities
1092
1093
(Toronto, Montréal, Vancouver, Calgary, and Edmonton):
@@ -1125,7 +1126,7 @@ van_data_selected <- tidy_lang |>
1125
1126
van_data_selected
1126
1127
```
1127
1128
1128
- But wait...Why do our ` select ` and ` filter ` function calls
1129
+ But wait...Why do the ` select ` and ` filter ` function calls
1129
1130
look different in these two examples?
1130
1131
Remember: when you use the pipe,
1131
1132
the output of the first function is automatically provided
@@ -1273,8 +1274,8 @@ region_lang_na[["most_at_home"]][1] <- NA
1273
1274
region_lang_na
1274
1275
```
1275
1276
1276
- Now if we apply our ` summarize ` function as above,
1277
- we see that no longer get the minimum and maximum returned,
1277
+ Now if we apply the ` summarize ` function as above,
1278
+ we see that we no longer get the minimum and maximum returned,
1278
1279
but just an ` NA ` instead!
1279
1280
1280
1281
``` {r}
@@ -1409,7 +1410,7 @@ region_lang |>
1409
1410
> ` purrr ` is part of the tidyverse, once we call ` library(tidyverse) ` we
1410
1411
> do not need to load the ` purrr ` package separately.
1411
1412
1412
- Our output looks a bit weird... we passed in a data frame, but our output
1413
+ The output looks a bit weird... we passed in a data frame, but the output
1413
1414
doesn't look like a data frame. As it so happens, it is * not* a data frame, but
1414
1415
rather a plain list:
1415
1416
@@ -1547,7 +1548,7 @@ region_lang |>
1547
1548
```
1548
1549
1549
1550
Now we apply ` rowwise ` before ` mutate ` , to tell R that we would like
1550
- our mutate function to be applied across, and within, a row,
1551
+ the mutate function to be applied across, and within, a row,
1551
1552
as opposed to being applied on a column
1552
1553
(which is the default behavior of ` mutate ` ):
1553
1554
@@ -1561,7 +1562,7 @@ region_lang |>
1561
1562
lang_known)))
1562
1563
```
1563
1564
1564
- We see that we get an additional column added to our data frame,
1565
+ We see that we get an additional column added to the data frame,
1565
1566
named ` maximum ` , which is the maximum value between ` mother_tongue ` ,
1566
1567
` most_at_home ` , ` most_at_work ` and ` lang_known ` for each language
1567
1568
and region.
0 commit comments