Skip to content

Commit 3a602f8

Browse files
index wrangling
1 parent e4b5de2 commit 3a602f8

File tree

1 file changed

+15
-44
lines changed

1 file changed

+15
-44
lines changed

source/wrangling.Rmd

Lines changed: 15 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -136,11 +136,11 @@ Table: (#tab:datatype-table) Basic data types in R
136136
| factor | fct | used to represent data with a limited number of values (usually categories) | a `color` variable with levels `red`, `green` and `orange` |
137137

138138
\index{data types}
139-
\index{character}\index{chr|see{character}}
140-
\index{integer}\index{int|see{integer}}
141-
\index{double}\index{dbl|see{double}}
142-
\index{logical}\index{lgl|see{logical}}
143-
\index{factor}\index{fct|see{factor}}
139+
\index{data types!character (chr)}\index{chr|see{character}}
140+
\index{data types!integer (int)}\index{int|see{integer}}
141+
\index{data types!double (dbl)}\index{dbl|see{double}}
142+
\index{data types!logical (lgl)}\index{lgl|see{logical}}
143+
\index{data types!factor (fct)}\index{fct|see{factor}}
144144
It is important in R to make sure you represent your data with the correct type.
145145
Many of the `tidyverse` functions we use in this book treat
146146
the various data types differently. You should use integers and double types
@@ -216,6 +216,7 @@ Vectors, data frames and lists are basic types of *data structure* in R, which
216216
are core to most data analyses. We summarize them in Table
217217
\@ref(tab:datastructure-table). There are several other data structures in the R programming
218218
language (*e.g.,* matrices), but these are beyond the scope of this book.
219+
\index{data structures!vector}\index{data structures!list}\index{data structures!data frame}
219220
220221
Table: (#tab:datastructure-table) Basic data structures in R
221222
@@ -669,11 +670,12 @@ the second is a *logical statement* to use when filtering the rows.
669670
This section will highlight more advanced usage of the `filter` function.
670671
In particular, this section provides an in-depth treatment of the variety of logical statements
671672
one can use in the `filter` function to select subsets of rows.
673+
\index{logical statement|see{logical operator}}
672674

673675
### Extracting rows that have a certain value with `==`
674676
Suppose we are only interested in the subset of rows in `tidy_lang` corresponding to the
675677
official languages of Canada (English and French).
676-
We can `filter` for these rows by using the *equivalency operator* (`==`)
678+
We can `filter` for these rows by using the *equivalency operator* (`==`) \index{logical operator!equivalency}
677679
to compare the values of the `category` column
678680
with the value `"Official languages"`.
679681
With these arguments, `filter` returns a data frame with all the columns
@@ -690,7 +692,7 @@ official_langs
690692
### Extracting rows that do not have a certain value with `!=`
691693

692694
What if we want all the other language categories in the data set *except* for
693-
those in the `"Official languages"` category? We can accomplish this with the `!=`
695+
those in the `"Official languages"` category? We can accomplish this with the `!=` \index{logical operator!inequivalency}
694696
operator, which means "not equal to". So if we want to find all the rows
695697
where the `category` does *not* equal `"Official languages"` we write the code
696698
below.
@@ -709,7 +711,7 @@ We can do this with the comma symbol (`,`), which in the case of `filter`
709711
is interpreted by R as "and".
710712
We write the code as shown below to filter the `official_langs` data frame
711713
to subset the rows where `region == "Montréal"`
712-
*and* the `language == "French"`.
714+
*and* the `language == "French"`. \index{logical operator!and}
713715

714716
``` {r}
715717
filter(official_langs, region == "Montréal", language == "French")
@@ -735,7 +737,7 @@ Instead, we can use the vertical pipe (`|`) logical operator,
735737
which gives us the cases where one condition *or*
736738
another condition *or* both are satisfied.
737739
In the code below, we ask R to return the rows
738-
where the `region` columns are equal to "Calgary" *or* "Edmonton".
740+
where the `region` columns are equal to "Calgary" *or* "Edmonton". \index{logical operator!or}
739741

740742
``` {r}
741743
filter(official_langs, region == "Calgary" | region == "Edmonton")
@@ -760,7 +762,7 @@ region_data
760762

761763
To get the population of the five cities
762764
we can filter the data set using the `%in%` operator.
763-
The `%in%` operator is used to see if an element belongs to a vector.
765+
The `%in%` operator is used to see if an element belongs to a vector. \index{logical operator!containment}
764766
Here we are filtering for rows where the value in the `region` column
765767
matches any of the five cities we are intersted in: Toronto, Montréal,
766768
Vancouver, Calgary, and Edmonton.
@@ -804,7 +806,8 @@ where the value of `most_at_home` is greater than
804806
`r format(most_french, scientific = FALSE, big.mark = ",")`.
805807
We use the `>` symbol to look for values *above* a threshold, and the `<` symbol
806808
to look for values *below* a threshold. The `>=` and `<=` symbols similarly look
807-
for *equal to or above* a threshold and *equal to or below* a threshold.
809+
for *equal to or above* a threshold and *equal to or below* a
810+
threshold. \index{logical operator!greater than}\index{logical operator!less than}
808811

809812
``` {r}
810813
filter(official_langs, most_at_home > 2669195)
@@ -973,38 +976,6 @@ Failing to do this would have resulted in the incorrect math being performed.
973976
> We link to resources that discuss this in the additional
974977
> resources at the end of this chapter.
975978
976-
977-
<!--
978-
#### Creating a visualization with tidy data {-}
979-
980-
Now that we have cleaned and wrangled the data, we can make visualizations or do
981-
statistical analyses to answer questions about it! Let's suppose we want to
982-
answer the question "what proportion of people in each city speak English
983-
as their primary language at home in these five cities?" Since the data is
984-
cleaned already, in a few short lines of code, we can use `ggplot` to create a
985-
data visualization to answer this question! Here we create a bar plot to represent the proportions for
986-
each region and color the proportions by language.
987-
988-
> Don't worry too much about the code to make this plot for now. We will cover
989-
> visualizations in detail in Chapter \@ref(viz).
990-
991-
```{r 02-plot, out.width = "100%", fig.cap = "Bar plot of proportions of Canadians reporting English as the most often spoken language at home."}
992-
ggplot(english_langs,
993-
aes(
994-
x = region,
995-
y = most_at_home_proportion
996-
)
997-
) +
998-
geom_bar(stat = "identity") +
999-
xlab("Region") +
1000-
ylab("Proportion of Canadians that speak English most often at home")
1001-
```
1002-
1003-
From this visualization, we can see that in Calgary, Edmonton, Toronto, and
1004-
Vancouver, English was reported as the most common primary language used at
1005-
home. However, in Montréal, this does not seem to be the case!
1006-
-->
1007-
1008979
## Combining functions using the pipe operator, `|>`
1009980

1010981
In R, we often have to call multiple functions in a sequence to process a data
@@ -1425,7 +1396,7 @@ simpler alternative is to just use a different `map` function. There
14251396
are quite a few to choose from, they all work similarly, but
14261397
their name reflects the type of output you want from the mapping operation.
14271398
Table \@ref(tab:map-table) lists the commonly used `map` functions as well
1428-
as their output type. \index{map!map\_\* functions}
1399+
as their output type. \index{map!map functions}
14291400

14301401
Table: (#tab:map-table) The `map` functions in R.
14311402

0 commit comments

Comments
 (0)