@@ -19,14 +19,14 @@ application, providing more practice working through a whole case study.
19
19
20
20
## Chapter learning objectives
21
21
22
- By the end of the chapter, readers will be able to:
23
-
24
- - define the term "tidy data"
25
- - discuss the advantages of storing data in a tidy data format
26
- - define what vectors, lists, and data frames are in R, and describe how they relate to
27
- each other
28
- - describe the common types of data in R and their uses
29
- - recall and use the following functions for their
22
+ By the end of the chapter, readers will be able to do the following :
23
+
24
+ - Define the term "tidy data".
25
+ - Discuss the advantages of storing data in a tidy data format.
26
+ - Define what vectors, lists, and data frames are in R, and describe how they relate to
27
+ each other.
28
+ - Describe the common types of data in R and their uses.
29
+ - Recall and use the following functions for their
30
30
intended data wrangling tasks:
31
31
- ` across `
32
32
- ` c `
@@ -41,7 +41,7 @@ By the end of the chapter, readers will be able to:
41
41
- ` rowwise `
42
42
- ` separate `
43
43
- ` summarize `
44
- - recall and use the following operators for their
44
+ - Recall and use the following operators for their
45
45
intended data wrangling tasks:
46
46
- ` == `
47
47
- ` %in% `
@@ -75,8 +75,8 @@ that is designed to store observations, variables, and their values.
75
75
Most commonly, each column in a data frame corresponds to a variable,
76
76
and each row corresponds to an observation. For example, Figure
77
77
\@ ref(fig:02-obs) displays a data set of city populations. Here, the variables
78
- are "region, year, population;" each of these are properties that can be
79
- collected or measured. The first observation is "Toronto, 2016, 2235145;"
78
+ are "region, year, population"; each of these are properties that can be
79
+ collected or measured. The first observation is "Toronto, 2016, 2235145";
80
80
these are the values that the three variables take for the first entity in the
81
81
data set. There are 13 entities in the data set in total, corresponding to the
82
82
13 rows in Figure \@ ref(fig:02-obs).
@@ -420,7 +420,7 @@ before the maximum can be computed.
420
420
In comparison, if the data were tidy,
421
421
all we would have to do is compute the maximum value for the commuter column.
422
422
To reshape this untidy data set to a tidy (and in this case, wider) format,
423
- we need to create a column called "population", "commuters", and "incorporated."
423
+ we need to create columns called "population", "commuters", and "incorporated."
424
424
This is illustrated in the right table of Figure \@ ref(fig: long-to-wide ).
425
425
426
426
``` {r long-to-wide, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Going from long to wide data.", fig.retina = 2, out.width = "100%"}
@@ -486,8 +486,8 @@ The data above is now tidy! We can go through the three criteria again to check
486
486
that this data is a tidy data set.
487
487
488
488
1 . All the statistical variables are their own columns in the data frame (i.e.,
489
- ` most_at_home ` , and ` most_at_work ` ) have been separated into their own
490
- columns in the data frame.
489
+ ` most_at_home ` , and ` most_at_work ` have been separated into their own
490
+ columns in the data frame) .
491
491
2 . Each observation, (i.e., each language in a region) is in a single row.
492
492
3 . Each value is a single cell (i.e., its row, column position in the data
493
493
frame is not shared with another value).
@@ -567,7 +567,7 @@ analyze. But we aren't done yet! Notice in the table above that the word
567
567
` <chr> ` appears beneath each of the column names. The word under the column name
568
568
indicates the data type of each column. Here all of our variables are
569
569
"character" data types. Recall, character data types are letter(s) or digits(s)
570
- surrounded by quotes. In the previous example in section \@ ref(pivot-wider), the
570
+ surrounded by quotes. In the previous example in Section \@ ref(pivot-wider), the
571
571
` most_at_home ` and ` most_at_work ` variables were ` <dbl> ` (double)&mdash ; you can
572
572
verify this by looking at the tables in the previous sections&mdash ; which is a type
573
573
of numeric data. This change is due to the delimiter (` / ` ) when we read in this
@@ -773,6 +773,8 @@ five_cities <- filter(region_data,
773
773
five_cities
774
774
```
775
775
776
+ \newpage
777
+
776
778
> ** Note:** What's the difference between ` == ` and ` %in% ` ? Suppose we have two
777
779
> vectors, ` vectorA ` and ` vectorB ` . If you type ` vectorA == vectorB ` into R it
778
780
> will compare the vectors element by element. R checks if the first element of
@@ -795,20 +797,20 @@ census_popn <- 35151728
795
797
most_french <- 2669195
796
798
```
797
799
798
- We saw in section \@ ref(filter-and) that
800
+ We saw in Section \@ ref(filter-and) that
799
801
` r format(most_french, scientific = FALSE, big.mark = ",") ` people reported
800
802
speaking French in Montréal as their primary language at home.
801
803
If we are interested in finding the official languages in regions
802
804
with higher numbers of people who speak it as their primary language at home
803
- compared to French in Montréal then we can use ` filter ` to obtain rows
805
+ compared to French in Montréal, then we can use ` filter ` to obtain rows
804
806
where the value of ` most_at_home ` is greater than
805
807
` r format(most_french, scientific = FALSE, big.mark = ",") ` .
806
808
807
809
``` {r}
808
810
filter(official_langs, most_at_home > 2669195)
809
811
```
810
812
811
- ` filter ` returns a data frame with only one row indicating that when
813
+ ` filter ` returns a data frame with only one row, indicating that when
812
814
considering the official languages,
813
815
only English in Toronto is reported by more people
814
816
as their primary language at home
@@ -818,7 +820,7 @@ than French in Montréal according to the 2016 Canadian census.
818
820
819
821
### Using ` mutate ` to modify columns
820
822
In Section \@ ref(separate),
821
- when we first read in the ` "region_lang_top5_cities_messy.csv" ` data
823
+ when we first read in the ` "region_lang_top5_cities_messy.csv" ` data,
822
824
all of the variables were "character" data types. \index{mutate}
823
825
During the tidying process,
824
826
we used the ` convert ` argument from the ` separate ` function
@@ -905,7 +907,7 @@ number, we need context. In particular, how many people were in Toronto when
905
907
this data was collected? From the 2016 Canadian census profile, the population
906
908
of Toronto was reported to be
907
909
` r format(toronto_popn, scientific = FALSE, big.mark = ",") ` people.
908
- The number of people who report that English as their primary language at home
910
+ The number of people who report that English is their primary language at home
909
911
is much more meaningful when we report it in this context.
910
912
We can even go a step further and transform this count to a relative frequency
911
913
or proportion.
@@ -923,9 +925,9 @@ for our five cities of focus in this chapter.
923
925
To accomplish this, we will need to do two tasks
924
926
beforehand:
925
927
926
- 1 . create a vector containing the population values for our cities
927
- 2 . filter the ` official_langs ` data frame
928
- so that we only keep the rows where the language is English
928
+ 1 . Create a vector containing the population values for our cities.
929
+ 2 . Filter the ` official_langs ` data frame
930
+ so that we only keep the rows where the language is English.
929
931
930
932
To create a vector containing the population values for our cities
931
933
(Toronto, Montréal, Vancouver, Calgary, Edmonton),
@@ -963,7 +965,7 @@ same order as the cities were listed in the `english_langs` data frame.
963
965
This is because R will perform the division computation we did by dividing
964
966
each element of the ` most_at_home ` column by each element of the
965
967
` city_pops ` vector, matching them up by position.
966
- Failing to do this would have resulted in the incorrect math to be performed.
968
+ Failing to do this would have resulted in the incorrect math being performed.
967
969
968
970
> ** Note:** In more advanced data wrangling,
969
971
> one might solve this problem in a less error-prone way though using
@@ -1010,7 +1012,7 @@ frame. The basic ways of doing this can become quickly unreadable if there are
1010
1012
many steps. For example, suppose we need to perform three operations on a data
1011
1013
frame called ` data ` : \index{pipe}\index{aaapipesymb@\vert{}>|see{pipe}}
1012
1014
1013
- 1 ) add a new column ` new_col ` that is double another ` old_col `
1015
+ 1 ) add a new column ` new_col ` that is double another ` old_col ` ,
1014
1016
2 ) filter for rows where another column, ` other_col ` , is more than 5, and
1015
1017
3 ) select only the new column ` new_col ` for those rows.
1016
1018
@@ -1060,7 +1062,7 @@ output <- data |>
1060
1062
1061
1063
> ** Note:** You might also have noticed that we split the function calls across
1062
1064
> lines after the pipe, similar to when we did this earlier in the chapter
1063
- > for long function calls. Again this is allowed and recommended, especially when
1065
+ > for long function calls. Again, this is allowed and recommended, especially when
1064
1066
> the piped function calls create a long line of code. Doing this makes
1065
1067
> your code more readable. When you do this, it is important to end each line
1066
1068
> with the pipe operator ` |> ` to tell R that your code is continuing onto the
@@ -1074,9 +1076,9 @@ output <- data |>
1074
1076
> (which in turn imports the ` magrittr ` R package).
1075
1077
> There are some other differences between ` %>% ` and ` |> ` related to
1076
1078
> more advanced R uses, such as sharing and distributing code as R packages,
1077
- > however these are beyond the scope of this textbook.
1079
+ > however, these are beyond the scope of this textbook.
1078
1080
> We have this note in the book to make the reader aware that ` %>% ` exists
1079
- > as it still commonly used in data analysis code and in many data science
1081
+ > as it is still commonly used in data analysis code and in many data science
1080
1082
> books and other resources.
1081
1083
> In most cases these two pipes are interchangeable and either can be used.
1082
1084
@@ -1112,7 +1114,7 @@ van_data_selected
1112
1114
1113
1115
Although this is valid code, there is a more readable approach we could take by
1114
1116
using the pipe, ` |> ` . With the pipe, we do not need to create an intermediate
1115
- object to store the output from ` filter ` . Instead we can directly send the
1117
+ object to store the output from ` filter ` . Instead, we can directly send the
1116
1118
output of ` filter ` to the input of ` select ` :
1117
1119
1118
1120
``` {r}
@@ -1131,12 +1133,12 @@ as the first argument for the function that comes after it.
1131
1133
Therefore you do not specify the first argument in that function call.
1132
1134
In the code above,
1133
1135
the first line is just the ` tidy_lang ` data frame with a pipe.
1134
- The pipe passes the left hand side (` tidy_lang ` ) to the first argument of the function on the right (` filter ` ),
1136
+ The pipe passes the left- hand side (` tidy_lang ` ) to the first argument of the function on the right (` filter ` ),
1135
1137
so in the ` filter ` function you only see the second argument (and beyond).
1136
1138
Then again after ` filter ` there is a pipe, which passes the result of the ` filter ` step
1137
1139
to the first argument of the ` select ` function.
1138
1140
As you can see, both of these approaches&mdash ; with and without pipes&mdash ; give us the same output, but the second
1139
- approach is more clear and readable.
1141
+ approach is clearer and more readable.
1140
1142
1141
1143
### Using ` |> ` with more than two functions
1142
1144
@@ -1152,8 +1154,7 @@ from smallest to largest.
1152
1154
As we saw in Chapter \@ ref(intro),
1153
1155
we can use the ` tidyverse ` ` arrange ` function \index{arrange}
1154
1156
to order the rows in the data frame by the values of one or more columns.
1155
- Here we pass the column name ` most_at_home ` to arrange
1156
- to order the data frame rows by the values in that column, in ascending order.
1157
+ Here we pass the column name ` most_at_home ` to arrange the data frame rows by the values in that column, in ascending order.
1157
1158
1158
1159
``` {r}
1159
1160
large_region_lang <- filter(tidy_lang, most_at_home > 10000) |>
@@ -1237,13 +1238,13 @@ lang_summary <- summarize(region_lang,
1237
1238
max_most_at_home = max(most_at_home))
1238
1239
```
1239
1240
1240
- From this we see that there are some languages in the data set the no one speaks
1241
+ From this we see that there are some languages in the data set that no one speaks
1241
1242
as their primary language at home. We also see that the most commonly spoken
1242
1243
primary language at home is spoken by
1243
1244
` r format(lang_summary$max_most_at_home[1], scientific = FALSE, big.mark = ",") `
1244
1245
people.
1245
1246
1246
- #### Calculating summary statistics when there are ` NA ` s {-}
1247
+ ### Calculating summary statistics when there are ` NA ` s
1247
1248
1248
1249
In data frames in R, the value ` NA ` is often used to denote missing data.
1249
1250
Many of the base R statistical summary functions
@@ -1272,7 +1273,7 @@ region_lang_na[["most_at_home"]][1] <- NA
1272
1273
region_lang_na
1273
1274
```
1274
1275
1275
- Now if we apply our summarize function as above,
1276
+ Now if we apply our ` summarize ` function as above,
1276
1277
we see that no longer get the minimum and maximum returned,
1277
1278
but just an ` NA ` instead!
1278
1279
@@ -1299,7 +1300,7 @@ For example, we can use `group_by` to group the regions of the `tidy_lang` data
1299
1300
reporting the language as the primary language at home
1300
1301
for each of the regions in the data set.
1301
1302
1302
- (ref: summarize-groupby ) ` summarize ` and ` group_by ` is useful for calculating summary statistics on one or more column(s) for each group. It creates a new data frame&mdash ; with one row for each group&mdash ; containing the summary statistic(s) for each column being summarized. It also creates a column listing the value of the grouping variable. The darker, top row of each table represents the column headers. The grey , blue and green colored rows correspond to the rows that belong to each of the three groups being represented in this cartoon example.
1303
+ (ref: summarize-groupby ) ` summarize ` and ` group_by ` is useful for calculating summary statistics on one or more column(s) for each group. It creates a new data frame&mdash ; with one row for each group&mdash ; containing the summary statistic(s) for each column being summarized. It also creates a column listing the value of the grouping variable. The darker, top row of each table represents the column headers. The gray , blue, and green colored rows correspond to the rows that belong to each of the three groups being represented in this cartoon example.
1303
1304
1304
1305
``` {r summarize-groupby, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "(ref:summarize-groupby)", fig.retina = 2, out.width = "100%"}
1305
1306
image_read("img/summarize/summarize.002.jpeg") |>
@@ -1320,7 +1321,7 @@ group_by(region_lang, region) |>
1320
1321
```
1321
1322
1322
1323
Notice that ` group_by ` on its own doesn't change the way the data looks.
1323
- In the output below the grouped data set looks the same,
1324
+ In the output below, the grouped data set looks the same,
1324
1325
and it doesn't * appear* to be grouped by ` region ` .
1325
1326
Instead, ` group_by ` simply changes how other functions work with the data,
1326
1327
as we saw with ` summarize ` above.
@@ -1366,7 +1367,7 @@ region_lang |>
1366
1367
summarize(across(mother_tongue:lang_known, max))
1367
1368
```
1368
1369
1369
- > ** Note:** Similarly to when we use base R statistical summary functions
1370
+ > ** Note:** Similar to when we use base R statistical summary functions
1370
1371
> (e.g., ` max ` , ` min ` , ` mean ` , ` sum ` , etc) with ` summarize ` alone,
1371
1372
> the use of the ` summarize ` + ` across ` functions paired
1372
1373
> with base R statistical summary functions
@@ -1392,7 +1393,8 @@ Let's again find the maximum value of each column of the
1392
1393
`region_lang` data frame, but using `map` with the `max` function this time.
1393
1394
`map` takes two arguments:
1394
1395
an object (a vector, data frame or list) that you want to apply the function to,
1395
- and the function that you would like to apply to each column.
1396
+ and the function that you would like to apply to each column.
1397
+
1396
1398
Note that `map` does not have an argument
1397
1399
to specify *which* columns to apply the function to.
1398
1400
Therefore, we will do this before calling `map` using the `select` function.
@@ -1422,7 +1424,7 @@ So what do we do? Should we convert this to a data frame? We could, but a
1422
1424
simpler alternative is to just use a different ` map ` function. There
1423
1425
are quite a few to choose from, they all work similarly, but
1424
1426
their name reflects the type of output you want from the mapping operation.
1425
- Table \@ ref(tab: map-table ) lists the commonly- used ` map ` functions as well
1427
+ Table \@ ref(tab: map-table ) lists the commonly used ` map ` functions as well
1426
1428
as their output type. \index{map!map\_\* functions}
1427
1429
1428
1430
Table: (#tab: map-table ) The ` map ` functions in R.
@@ -1446,14 +1448,16 @@ region_lang |>
1446
1448
map_dfr(max)
1447
1449
```
1448
1450
1449
- > ** Note:** Similarly to when we use base R statistical summary functions
1451
+ \newpage
1452
+
1453
+ > ** Note:** Similar to when we use base R statistical summary functions
1450
1454
> (e.g., ` max ` , ` min ` , ` mean ` , ` sum ` , etc.) with ` summarize ` ,
1451
1455
> ` map ` functions paired with base R statistical summary functions
1452
1456
> also return ` NA ` values when we apply them to columns that
1453
1457
> contain ` NA ` values. \index{missing data}
1454
1458
>
1455
1459
> To avoid this, again we need to add the argument ` na.rm = TRUE ` .
1456
- > When we use this with ` map ` we do this by adding a ` , `
1460
+ > When we use this with ` map ` , we do this by adding a ` , `
1457
1461
> and then ` na.rm = TRUE ` after specifying the function, as illustrated below:
1458
1462
>
1459
1463
> ``` {r}
@@ -1545,7 +1549,7 @@ region_lang |>
1545
1549
Now we apply ` rowwise ` before ` mutate ` , to tell R that we would like
1546
1550
our mutate function to be applied across, and within, a row,
1547
1551
as opposed to being applied on a column
1548
- (which is the default behaviour of ` mutate ` ):
1552
+ (which is the default behavior of ` mutate ` ):
1549
1553
1550
1554
``` {r}
1551
1555
region_lang |>
@@ -1629,7 +1633,7 @@ found in Chapter \@ref(move-to-your-own-machine).
1629
1633
To learn more about these functions and meet a few more useful
1630
1634
functions, we recommend you check out [ this
1631
1635
chapter] ( http://stat545.com/block010_dplyr-end-single-table.html#where-were-we )
1632
- of the Data wrangling, exploration, and analysis with R book.
1636
+ of the data wrangling, exploration, and analysis with R book.
1633
1637
- The [ ` dplyr ` page on the tidyverse website] ( https://dplyr.tidyverse.org/ ) is
1634
1638
another resource to learn more about the functions in this
1635
1639
chapter, the full set of arguments you can use, and other related functions.
@@ -1638,7 +1642,7 @@ found in Chapter \@ref(move-to-your-own-machine).
1638
1642
- Check out the [ tidyselect
1639
1643
page] ( https://tidyselect.r-lib.org/reference/select_helpers.html ) for a
1640
1644
comprehensive list of ` select ` helpers.
1641
- - [ R for Data Science] ( https://r4ds.had.co.nz/ ) has a few chapters related to
1645
+ - [ * R for Data Science* ] ( https://r4ds.had.co.nz/ ) has a few chapters related to
1642
1646
data wrangling that go into more depth than this book. For example, the
1643
1647
[ tidy data] ( https://r4ds.had.co.nz/tidy-data.html ) chapter covers tidy data,
1644
1648
` pivot_longer ` /` pivot_wider ` and ` separate ` , but also covers missing values
0 commit comments