-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path5_Basic_Data_Frame_Operations.Rmd
More file actions
230 lines (151 loc) · 7.37 KB
/
5_Basic_Data_Frame_Operations.Rmd
File metadata and controls
230 lines (151 loc) · 7.37 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
title: 'Module 5: Basic Data Frame Operations'
author: "Jasmine Hughes"
date: "9/12/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE)
```
# Overview
> It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. (Dasu and Johnson, 2003)
Thus before you can even get to doing any sort of sophisticated analysis or plotting, you'll generally first need to:
1. ***Manipulating*** data frames, e.g., filtering, summarizing, and conducting calculations across groups.
2. ***Tidying*** data into the appropriate format
There are two competing schools of thought within the R community.
* We should stick to the base R functions to do manipulating and tidying; `tidyverse` uses syntax that's unlike base R and is superfluous.
* We should start teaching students to manipulate data using `tidyverse` tools because they are straightfoward to use, more readable than base R, and speed up the tidying process.
I'll introduce you to the `tidyverse` tools. If your R tasks are data analysis, graphing, modeling and statistics, you can accomplish most of what you need using the tidyverse. Tidyverse also often has helpful error messages.
My own view: I use both base R and tidyverse, but will prefer tidyverse if I need to do more than ~3 operations on a dataset. The tidyverse is a major reason why R is so popular for data analysis, and many people that mainly working in python will sometimes switch to R just for data cleaning & graphing.
> The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
# Data frame Manipulation using `dplyr`
The [`dplyr`](https://cran.r-project.org/web/packages/dplyr/dplyr.pdf) package provides a number of very useful functions for manipulating data frames. Code written with `dplyr` is often easier to read than code written in "base R".
We're going to learn how to subset data frames: how do we *filter* for the rows we want? How do we *select* the columns we want?
```{r}
# If you haven't yet installed dplyr, install it now!
# install.packages('dplyr')
library(dplyr)
# Lets also load the gap data set:
gap <- read.csv("data/gapminder-FiveYearData.csv", stringsAsFactors = FALSE)
head(gap)
```
## `dplyr::select`
Imagine that we just received the gapminder dataset, but are only interested in a few variables in it. We could use the `select()` function to keep only the columns corresponding to variables we select.
```{r}
year_country_gdp_dplyr <- select(gap, year, country, gdpPercap)
head(year_country_gdp_dplyr)
```
```{r}
knitr::include_graphics("img/dplyr-fig1.png", dpi = 400)
```
If we open up `year_country_gdp`, we'll see that it only contains the year, country and gdpPercap.
This is equivalent to the base R subsetting function that you may have noticed before:
```{r}
year_country_gdp_base <- gap[,c("year", "country", "gdpPercap")]
head(year_country_gdp_base)
```
We can even check that these two data frames are equivalent:
```{r}
# checking equivalence: TRUE indicates an exact match between these objects
all.equal(year_country_gdp_dplyr, year_country_gdp_base)
```
Let's take a look at the help documentation for select:
```{r}
?select
```
The arguments for select are:
* `.data` this is a data frame, and this is always the first argument for all tidyverse data frame manipulation functions/
* `...` this is a place holder for any number of other arguments that can optionally be passed on. `select` is expecting the names (unquoted) of the columns you would like to keep or extract.
## `dplyr::filter`
Now let's say we're only interested in African countries. We can use `filter` to select only the rows where `continent` is `Africa`.
```{r}
gap_africa <- filter(gap, continent == "Africa")
head(gap_africa, 15)
```
`filter` works on *rows* while `select` works on *columns*
`filter` expects a logical expression, and will *keep* all rows where that logical expression evaluates as `TRUE`.
```{r}
nrow(gap) # whole
nrow(year_country_gdp_dplyr) # select
nrow(gap_africa) # filter
ncol(gap) # whole
ncol(year_country_gdp_dplyr) # select
ncol(gap_africa) # filter
```
# Combining dplyr operations
What if we wanted only the country, year and gdp of African countries?
```{r}
gap_africa <- filter(gap, continent == "Africa")
africa_country_year_gdp <- select(gap_africa, country, year, gdpPercap)
head(africa_country_year_gdp)
```
That's not bad, but if we needed to do many different types of data manipulations on one data frame, it can get confusing keeping track of all the different "intermediate" data frames:
# Piping with `dplyr`
```{r}
knitr::include_graphics("img/magrittr_hex.png", dpi = 400)
```
Above, we used what's called "normal" grammar, but the strengths of `dplyr` lie in combining several functions using *pipes*.
Pipes take the input on the left side of the `%>%` symbol and pass it in as the first argument to the function on the right side. This is why `dplyr` data manipulation functions take `.data` as their first argument!
coffee beans %>% harvest() %>% roast() %>% grind() %>% add("hot water") %>% filter() %>% drink().
Since the pipe grammar is unlike anything we've seen in R before, let's repeat what we've done above using pipes.
```{r}
africa_country_year_gdp_piped <- gap %>%
filter(continent == "Africa" & year == 1957) %>%
select(country, year, gdpPercap)
head(africa_country_year_gdp_piped, 15)
```
First we summon the gapminder dataframe and pass it on to the next step using the pipe symbol `%>%`
The second steps is the `filter()` function. Then, we take that output and send it to `select`.
In this case we don't specify which data object we use in the call to `select()` or to `filter()` since we've piped it in.
```{r}
identical(africa_country_year_gdp_piped, africa_country_year_gdp)
```
**Aside**: There is a good chance you have encountered pipes before in the shell/terminal. In R, a pipe symbol is `%>%` while in the shell it is `|.` But the concept is the same!
# `dplyr::arrange`
Often you may want to re-order data to quickly see what the smallest or largest values of a column are. You can do this interactively using `View()`. dplyr also has a function for doing this programmatically using `arrange`.
For example, let's reorder the rows of `gap` by population:
```{r}
gap %>%
arrange(pop) %>%
head()
```
By default, `arrange` orders from lowest to highest. We can use `desc` to order it in reverse.
```{r}
gap %>%
arrange(desc(pop)) %>%
head()
```
`arrange` also allows specification of multiple columns:
```{r}
gap %>%
arrange(year, desc(pop)) %>%
head()
```
# Breakout
### `dplyr`
1. Use `dplyr` to create a data frame containing data from before 1975 with every column except `continent`.
```{r}
gap %>%
select(-continent) %>%
filter(year <= 1975)
select(gap, -continent)
```
2. How many countries had a life expectancy greater than 60 in the year 1982?
```{r}
le1982 <- gap %>%
filter(year == 1982 & lifeExp > 60)
nrow(le1982)
```
3. Use `dplyr` and `grepl()` to filter for countries with the letter "z" in their name.
```{r}
?grepl
gap %>%
filter(grepl("z", country) | grepl("Z", country))
gap %>%
filter(grepl("z", tolower(country)))
```
4. Use `dplyr` to sort the gap data set in reverse alphabetical order.
```{r}
gap %>%
arrange(desc(country))
```