@@ -136,11 +136,11 @@ Table: (#tab:datatype-table) Basic data types in R
136
136
| factor | fct | used to represent data with a limited number of values (usually categories) | a ` color ` variable with levels ` red ` , ` green ` and ` orange ` |
137
137
138
138
\index{data types}
139
- \index{character}\index{chr|see{character}}
140
- \index{integer}\index{int|see{integer}}
141
- \index{double}\index{dbl|see{double}}
142
- \index{logical}\index{lgl|see{logical}}
143
- \index{factor}\index{fct|see{factor}}
139
+ \index{data types! character (chr) }\index{chr|see{character}}
140
+ \index{data types! integer (int) }\index{int|see{integer}}
141
+ \index{data types! double (dbl) }\index{dbl|see{double}}
142
+ \index{data types! logical (lgl) }\index{lgl|see{logical}}
143
+ \index{data types! factor (fct) }\index{fct|see{factor}}
144
144
It is important in R to make sure you represent your data with the correct type.
145
145
Many of the ` tidyverse ` functions we use in this book treat
146
146
the various data types differently. You should use integers and double types
@@ -216,6 +216,7 @@ Vectors, data frames and lists are basic types of *data structure* in R, which
216
216
are core to most data analyses. We summarize them in Table
217
217
\@ref(tab:datastructure-table). There are several other data structures in the R programming
218
218
language (*e.g.,* matrices), but these are beyond the scope of this book.
219
+ \index{data structures!vector}\index{data structures!list}\index{data structures!data frame}
219
220
220
221
Table: (#tab:datastructure-table) Basic data structures in R
221
222
@@ -669,11 +670,12 @@ the second is a *logical statement* to use when filtering the rows.
669
670
This section will highlight more advanced usage of the ` filter ` function.
670
671
In particular, this section provides an in-depth treatment of the variety of logical statements
671
672
one can use in the ` filter ` function to select subsets of rows.
673
+ \index{logical statement|see{logical operator}}
672
674
673
675
### Extracting rows that have a certain value with ` == `
674
676
Suppose we are only interested in the subset of rows in ` tidy_lang ` corresponding to the
675
677
official languages of Canada (English and French).
676
- We can ` filter ` for these rows by using the * equivalency operator* (` == ` )
678
+ We can ` filter ` for these rows by using the * equivalency operator* (` == ` ) \index{logical operator!equivalency}
677
679
to compare the values of the ` category ` column
678
680
with the value ` "Official languages" ` .
679
681
With these arguments, ` filter ` returns a data frame with all the columns
@@ -690,7 +692,7 @@ official_langs
690
692
### Extracting rows that do not have a certain value with ` != `
691
693
692
694
What if we want all the other language categories in the data set * except* for
693
- those in the ` "Official languages" ` category? We can accomplish this with the ` != `
695
+ those in the ` "Official languages" ` category? We can accomplish this with the ` != ` \index{logical operator!inequivalency}
694
696
operator, which means "not equal to". So if we want to find all the rows
695
697
where the ` category ` does * not* equal ` "Official languages" ` we write the code
696
698
below.
@@ -709,7 +711,7 @@ We can do this with the comma symbol (`,`), which in the case of `filter`
709
711
is interpreted by R as "and".
710
712
We write the code as shown below to filter the ` official_langs ` data frame
711
713
to subset the rows where ` region == "Montréal" `
712
- * and* the ` language == "French" ` .
714
+ * and* the ` language == "French" ` . \index{logical operator!and}
713
715
714
716
``` {r}
715
717
filter(official_langs, region == "Montréal", language == "French")
@@ -735,7 +737,7 @@ Instead, we can use the vertical pipe (`|`) logical operator,
735
737
which gives us the cases where one condition * or*
736
738
another condition * or* both are satisfied.
737
739
In the code below, we ask R to return the rows
738
- where the ` region ` columns are equal to "Calgary" * or* "Edmonton".
740
+ where the ` region ` columns are equal to "Calgary" * or* "Edmonton". \index{logical operator!or}
739
741
740
742
``` {r}
741
743
filter(official_langs, region == "Calgary" | region == "Edmonton")
@@ -760,7 +762,7 @@ region_data
760
762
761
763
To get the population of the five cities
762
764
we can filter the data set using the ` %in% ` operator.
763
- The ` %in% ` operator is used to see if an element belongs to a vector.
765
+ The ` %in% ` operator is used to see if an element belongs to a vector. \index{logical operator!containment}
764
766
Here we are filtering for rows where the value in the ` region ` column
765
767
matches any of the five cities we are intersted in: Toronto, Montréal,
766
768
Vancouver, Calgary, and Edmonton.
@@ -804,7 +806,8 @@ where the value of `most_at_home` is greater than
804
806
` r format(most_french, scientific = FALSE, big.mark = ",") ` .
805
807
We use the ` > ` symbol to look for values * above* a threshold, and the ` < ` symbol
806
808
to look for values * below* a threshold. The ` >= ` and ` <= ` symbols similarly look
807
- for * equal to or above* a threshold and * equal to or below* a threshold.
809
+ for * equal to or above* a threshold and * equal to or below* a
810
+ threshold. \index{logical operator!greater than}\index{logical operator!less than}
808
811
809
812
``` {r}
810
813
filter(official_langs, most_at_home > 2669195)
@@ -973,38 +976,6 @@ Failing to do this would have resulted in the incorrect math being performed.
973
976
> We link to resources that discuss this in the additional
974
977
> resources at the end of this chapter.
975
978
976
-
977
- <!--
978
- #### Creating a visualization with tidy data {-}
979
-
980
- Now that we have cleaned and wrangled the data, we can make visualizations or do
981
- statistical analyses to answer questions about it! Let's suppose we want to
982
- answer the question "what proportion of people in each city speak English
983
- as their primary language at home in these five cities?" Since the data is
984
- cleaned already, in a few short lines of code, we can use `ggplot` to create a
985
- data visualization to answer this question! Here we create a bar plot to represent the proportions for
986
- each region and color the proportions by language.
987
-
988
- > Don't worry too much about the code to make this plot for now. We will cover
989
- > visualizations in detail in Chapter \@ref(viz).
990
-
991
- ```{r 02-plot, out.width = "100%", fig.cap = "Bar plot of proportions of Canadians reporting English as the most often spoken language at home."}
992
- ggplot(english_langs,
993
- aes(
994
- x = region,
995
- y = most_at_home_proportion
996
- )
997
- ) +
998
- geom_bar(stat = "identity") +
999
- xlab("Region") +
1000
- ylab("Proportion of Canadians that speak English most often at home")
1001
- ```
1002
-
1003
- From this visualization, we can see that in Calgary, Edmonton, Toronto, and
1004
- Vancouver, English was reported as the most common primary language used at
1005
- home. However, in Montréal, this does not seem to be the case!
1006
- -->
1007
-
1008
979
## Combining functions using the pipe operator, ` |> `
1009
980
1010
981
In R, we often have to call multiple functions in a sequence to process a data
@@ -1425,7 +1396,7 @@ simpler alternative is to just use a different `map` function. There
1425
1396
are quite a few to choose from, they all work similarly, but
1426
1397
their name reflects the type of output you want from the mapping operation.
1427
1398
Table \@ ref(tab: map-table ) lists the commonly used ` map ` functions as well
1428
- as their output type. \index{map!map\_\* functions}
1399
+ as their output type. \index{map!map functions}
1429
1400
1430
1401
Table: (#tab: map-table ) The ` map ` functions in R.
1431
1402
0 commit comments