index wrangling

trevorcampbell · trevorcampbell · commit 3a602f8ed6b7 · 2023-11-16T11:04:17.000-08:00
diff --git a/source/wrangling.Rmd b/source/wrangling.Rmd
@@ -136,11 +136,11 @@ Table: (#tab:datatype-table) Basic data types in R
 | factor | fct | used to represent data with a limited number of values (usually categories) | a `color` variable with levels `red`, `green` and `orange` |
 
 \index{data types}
-\index{character}\index{chr|see{character}}
-\index{integer}\index{int|see{integer}}
-\index{double}\index{dbl|see{double}}
-\index{logical}\index{lgl|see{logical}}
-\index{factor}\index{fct|see{factor}}
+\index{data types!character (chr)}\index{chr|see{character}}
+\index{data types!integer (int)}\index{int|see{integer}}
+\index{data types!double (dbl)}\index{dbl|see{double}}
+\index{data types!logical (lgl)}\index{lgl|see{logical}}
+\index{data types!factor (fct)}\index{fct|see{factor}}
 It is important in R to make sure you represent your data with the correct type.
 Many of the `tidyverse` functions we use in this book treat
 the various data types differently. You should use integers and double types
@@ -216,6 +216,7 @@ Vectors, data frames and lists are basic types of *data structure* in R, which
 are core to most data analyses. We summarize them in Table
 \@ref(tab:datastructure-table). There are several other data structures in the R programming
 language (*e.g.,* matrices), but these are beyond the scope of this book.
+\index{data structures!vector}\index{data structures!list}\index{data structures!data frame}
 
 Table: (#tab:datastructure-table) Basic data structures in R
 
@@ -669,11 +670,12 @@ the second is a *logical statement* to use when filtering the rows.
 This section will highlight more advanced usage of the `filter` function.
 In particular, this section provides an in-depth treatment of the variety of logical statements
 one can use in the `filter` function to select subsets of rows.
+\index{logical statement|see{logical operator}}
 
 ### Extracting rows that have a certain value with `==`
 Suppose we are only interested in the subset of rows in `tidy_lang` corresponding to the
 official languages of Canada (English and French).
-We can `filter` for these rows by using the *equivalency operator* (`==`)
+We can `filter` for these rows by using the *equivalency operator* (`==`) \index{logical operator!equivalency}
 to compare the values of the `category` column
 with the value `"Official languages"`.
 With these arguments, `filter` returns a data frame with all the columns
@@ -690,7 +692,7 @@ official_langs
 ### Extracting rows that do not have a certain value with `!=`
 
 What if we want all the other language categories in the data set *except* for
-those in the `"Official languages"` category? We can accomplish this with the `!=`
+those in the `"Official languages"` category? We can accomplish this with the `!=` \index{logical operator!inequivalency}
 operator, which means "not equal to". So if we want to find all the rows
 where the `category` does *not* equal `"Official languages"` we write the code
 below.
@@ -709,7 +711,7 @@ We can do this with the comma symbol (`,`), which in the case of `filter`
 is interpreted by R as "and".
 We write the code as shown below to filter the `official_langs` data frame
 to subset the rows where `region == "Montréal"`
-*and* the `language == "French"`.
+*and* the `language == "French"`. \index{logical operator!and}
 
 ``` {r}
 filter(official_langs, region == "Montréal", language == "French")
@@ -735,7 +737,7 @@ Instead, we can use the vertical pipe (`|`) logical operator,
 which gives us the cases where one condition *or*
 another condition *or* both are satisfied.
 In the code below, we ask R to return the rows
-where the `region` columns are equal to "Calgary" *or* "Edmonton".
+where the `region` columns are equal to "Calgary" *or* "Edmonton". \index{logical operator!or}
 
 ``` {r}
 filter(official_langs, region == "Calgary" | region == "Edmonton")
@@ -760,7 +762,7 @@ region_data
 
 To get the population of the five cities
 we can filter the data set using the `%in%` operator.
-The `%in%` operator is used to see if an element belongs to a vector.
+The `%in%` operator is used to see if an element belongs to a vector. \index{logical operator!containment}
 Here we are filtering for rows where the value in the `region` column
 matches any of the five cities we are intersted in: Toronto, Montréal,
 Vancouver, Calgary, and Edmonton.
@@ -804,7 +806,8 @@ where the value of `most_at_home` is greater than
 `r format(most_french, scientific = FALSE, big.mark = ",")`.
 We use the `>` symbol to look for values *above* a threshold, and the `<` symbol
 to look for values *below* a threshold. The `>=` and `<=` symbols similarly look
-for *equal to or above* a threshold and *equal to or below* a threshold.
+for *equal to or above* a threshold and *equal to or below* a
+threshold. \index{logical operator!greater than}\index{logical operator!less than}
 
 ``` {r}
 filter(official_langs, most_at_home > 2669195)
@@ -973,38 +976,6 @@ Failing to do this would have resulted in the incorrect math being performed.
 > We link to resources that discuss this in the additional
 > resources at the end of this chapter.
 
-
-<!--
-#### Creating a visualization with tidy data {-}
-
-Now that we have cleaned and wrangled the data, we can make visualizations or do
-statistical analyses to answer questions about it! Let's suppose we want to
-answer the question "what proportion of people in each city speak English
-as their primary language at home in these five cities?" Since the data is
-cleaned already, in a few short lines of code, we can use `ggplot` to create a
-data visualization to answer this question! Here we create a bar plot to represent the proportions for
-each region and color the proportions by language.
-
-> Don't worry too much about the code to make this plot for now. We will cover
-> visualizations in detail in Chapter \@ref(viz).
-
-```{r 02-plot, out.width = "100%", fig.cap = "Bar plot of proportions of Canadians reporting English as the most often spoken language at home."}
-ggplot(english_langs,
-  aes(
-    x = region,
-    y = most_at_home_proportion
-  )
- ) +
-  geom_bar(stat = "identity") +
-  xlab("Region") +
-  ylab("Proportion of Canadians that speak English most often at home")
-```
-
-From this visualization, we can see that in Calgary, Edmonton, Toronto, and
-Vancouver, English was reported as the most common primary language used at
-home.  However, in Montréal, this does not seem to be the case!
--->
-
 ## Combining functions using the pipe operator, `|>`
 
 In R, we often have to call multiple functions in a sequence to process a data
@@ -1425,7 +1396,7 @@ simpler alternative is to just use a different `map` function. There
 are quite a few to choose from, they all work similarly, but
 their name reflects the type of output you want from the mapping operation.
 Table \@ref(tab:map-table) lists the commonly used `map` functions as well
-as their output type. \index{map!map\_\* functions}
+as their output type. \index{map!map functions}
 
 Table: (#tab:map-table) The `map` functions in R.