Skip to content

Commit 4f9ed6f

Browse files
authored
Merge pull request #52 from UBC-DSCI/data-reproducibility
Data reproducibility
2 parents f3f6942 + e5948e7 commit 4f9ed6f

File tree

79 files changed

+5616
-2985
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

79 files changed

+5616
-2985
lines changed

.gitignore

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,8 @@
55
**.DS_Store
66
*.sw*
77
_bookdown_files
8-
.rstudio/*
8+
<<<<<<< HEAD
9+
**.ipynb_checkpoints
10+
=======
11+
.rstudio/*
12+
>>>>>>> dev

01-reading.Rmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -425,8 +425,8 @@ so that we can share it with others or use it for another step in the analysis.
425425
The default arguments for this file are to use a comma (`,`) as the delimiter and include column names. Below we demonstrate creating a new version of the US state-level
426426
property, income, population and voting data from 2015 and 2016 that does not contain the territory of Puerto Rico, and then writing this to a `.csv` file:
427427

428-
```{r}
429-
state_data <- filter(us_data, state != "PR")
428+
```
429+
state_data <- filter(us_data, state != "Puerto Rico")
430430
write_csv(state_data, "data/us_states_only.csv")
431431
```
432432

02-wrangling.Rmd

Lines changed: 17 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,9 @@ Data is often stored in a wider, not tidy, format because this format is often m
106106

107107
```{r 02-tidyverse, warning=FALSE, message=FALSE}
108108
library(tidyverse)
109-
hist_vote_wide <- read_csv("data/historical_vote_wide.csv")
109+
hist_vote_wide <- read_csv("data/us_vote.csv")
110+
hist_vote_wide <- select(hist_vote_wide, election_year, winner, runnerup)
111+
hist_vote_wide <- tail(hist_vote_wide, 10)
110112
hist_vote_wide
111113
```
112114

@@ -262,12 +264,12 @@ us_data
262264
```
263265

264266
Suppose we want to create a subset of the data with only the values for median income and median property value for the state of
265-
California ("CA"). To do this, we can use the functions `filter` and `select`. First we use `filter` to create a data frame called `ca_prop_data` that
267+
California. To do this, we can use the functions `filter` and `select`. First we use `filter` to create a data frame called `ca_prop_data` that
266268
contains only values for the state of California. We then use `select` on this data frame to keep only the median income and
267269
median property value variables:
268270

269271
```{r}
270-
ca_prop_data <- filter(us_data, state == "CA")
272+
ca_prop_data <- filter(us_data, state == "California")
271273
ca_inc_prop <- select(ca_prop_data, med_income, med_prop_val)
272274
ca_inc_prop
273275
```
@@ -277,7 +279,8 @@ we do not need to create an intermediate object to store the output from `filter
277279
output of `filter` to the input of `select`:
278280

279281
```{r}
280-
ca_inc_prop <- filter(us_data, state == "CA") %>% select(med_income, med_prop_val)
282+
ca_inc_prop <- filter(us_data, state == "California") %>%
283+
select(med_income, med_prop_val)
281284
ca_inc_prop
282285
```
283286

@@ -295,15 +298,15 @@ example, we can pipe together three functions to order the states by commute tim
295298
is less than 1 million people:
296299

297300
```{r}
298-
small_state_commutes <- filter(us_data, population < 1000000) %>%
299-
select(state, mean_commute_minutes) %>%
300-
arrange(mean_commute_minutes)
301+
small_state_commutes <- filter(us_data, pop < 1000000) %>%
302+
select(state, avg_commute) %>%
303+
arrange(avg_commute)
301304
small_state_commutes
302305
```
303306

304307
> **Note:**: `arrange` is a function that takes the name of a data frame and one or more column(s), and returns a
305308
> data frame where the rows are ordered by those columns in ascending order. Here we used only one column for sorting
306-
> (`mean_commute_minutes`), but more than one can also be used. To do this, list additional columns separated by commas.
309+
> (`avg_commute`), but more than one can also be used. To do this, list additional columns separated by commas.
307310
> The order they are listed in indicates the order in which they will be used for sorting. This is much like how an English
308311
> dictionary sorts words: first by the first letter, then by the second letter, and so on.
309312
>
@@ -323,9 +326,9 @@ and mean commute time for all US states:
323326

324327
```{r}
325328
us_commute_time_summary <- summarize(us_data,
326-
min_mean_commute = min(mean_commute_minutes),
327-
max_mean_commute = max(mean_commute_minutes),
328-
mean_mean_commute = mean(mean_commute_minutes))
329+
min_mean_commute = min(avg_commute),
330+
max_mean_commute = max(avg_commute),
331+
mean_mean_commute = mean(avg_commute))
329332
us_commute_time_summary
330333
```
331334

@@ -341,9 +344,9 @@ columns separated by commas.
341344

342345
```{r}
343346
us_commute_time_summary_by_party <- group_by(us_data, party) %>%
344-
summarize(min_mean_commute = min(mean_commute_minutes),
345-
max_mean_commute = max(mean_commute_minutes),
346-
mean_mean_commute = mean(mean_commute_minutes))
347+
summarize(min_mean_commute = min(avg_commute),
348+
max_mean_commute = max(avg_commute),
349+
mean_mean_commute = mean(avg_commute))
347350
us_commute_time_summary_by_party
348351
```
349352

03-viz.Rmd

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -112,24 +112,26 @@ options(warn=-1)
112112

113113

114114
### The Mauna Loa CO2 data set
115-
This data set contains the atmospheric concentration of carbon dioxide (CO2, in parts per million) at the Mauna Loa research station in Hawaii
116-
from the years 1959-1997. **Question:** Does the concentration of atmospheric CO2 change over time, and are there any interesting patterns to note?
115+
The [Mauna Loa CO2 data set](https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html), curated by [Dr. Pieter Tans, NOAA/GML](https://www.esrl.noaa.gov/gmd/staff/Pieter.Tans/) and [Dr. Ralph Keeling, Scripps Institution of Oceanography](https://scrippsco2.ucsd.edu/)
116+
records the atmospheric concentration of carbon dioxide (CO2, in parts per million) at the Mauna Loa research station in Hawaii from 1959 onwards. **Question:** Does the concentration of atmospheric CO2 change over time, and are there any interesting patterns to note?
117117
```{r 03-data-co2, warning=FALSE, message=FALSE}
118118
# mauna loa carbon dioxide data
119-
co2_df <- read_csv("data/maunaloa.csv")
119+
co2_df <- read_csv("data/mauna_loa.csv") %>%
120+
filter(ppm > 0, date_decimal < 2000)
120121
head(co2_df)
121122
```
122123

123124
Since we are investigating a relationship between two variables (CO2 concentration and date), a scatter plot is a good place to start. Scatter plots
124-
show the data as individual points with `x` (horizonal axis) and `y` (vertical axis) coordinates. Here, we will use the date as the `x` coordinate
125+
show the data as individual points with `x` (horizonal axis) and `y` (vertical axis) coordinates. Here, we will use the decimal
126+
date as the `x` coordinate
125127
and CO2 concentration as the `y` coordinate. When using the `ggplot2` library, we create the plot object with the `ggplot` function; there are
126128
a few basic aspects of a plot that we need to specify:
127129

128130
- the *data*: the name of the dataframe object that we would like to visualize
129131
- here, we specify the `co2_df` dataframe
130132
- the *aesthetic mapping*: tells `ggplot` how the columns in the dataframe map to properties of the visualization
131133
- to create an aesthetic mapping, we use the `aes` function
132-
- here, we set the plot `x` axis to the `date` variable, and the plot `y` axis to the `concentration` variable
134+
- here, we set the plot `x` axis to the `date_decimal` variable, and the plot `y` axis to the `ppm` variable
133135
- the *geometric object*: specifies how the mapped data should be displayed
134136
- to create a geometric object, we use a `geom_*` function (see the [ggplot reference](https://ggplot2.tidyverse.org/reference/) for a list of geometric objects)
135137
- here, we use the `geom_point` function to visualize our data as a scatterplot
@@ -138,7 +140,7 @@ There are many other possible arguments we could pass to the aesthetic mapping a
138140
the purposes of quickly testing things out to see what they look like, though, we can just go with the default settings:
139141

140142
```{r 03-data-co2-scatter, warning=FALSE, message=FALSE}
141-
co2_scatter <- ggplot(co2_df, aes(x = date, y = concentration)) +
143+
co2_scatter <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) +
142144
geom_point()
143145
co2_scatter
144146
```
@@ -150,7 +152,7 @@ that the data are ordered by their `x` coordinate, and connect the sequence of `
150152
the default arguments:
151153

152154
```{r 03-data-co2-line, warning=FALSE, message=FALSE}
153-
co2_line <- ggplot(co2_df, aes(x = date, y = concentration)) +
155+
co2_line <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) +
154156
geom_line()
155157
co2_line
156158
```
@@ -165,7 +167,7 @@ visual noise to remove. But there are a few things we must do to improve clarity
165167
In order to add axis labels we use the `xlab` and `ylab` functions. To change the font size we use the `theme` function with the `text` argument:
166168

167169
```{r 03-data-co2-line-2, warning=FALSE, message=FALSE}
168-
co2_line <- ggplot(co2_df, aes(x = date, y = concentration)) +
170+
co2_line <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) +
169171
geom_line() +
170172
xlab('Year') +
171173
ylab('Atmospheric CO2 (ppm)') +
@@ -181,7 +183,7 @@ We can transform the axis by passing the `trans` argument, and set limits by pas
181183
will use the `scale_x_continuous` function with the `limits` argument to zoom in on just five years of data (say, 1990-1995):
182184

183185
```{r 03-data-co2-line-3, warning=FALSE, message=FALSE}
184-
co2_line <- ggplot(co2_df, aes(x = date, y = concentration)) +
186+
co2_line <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) +
185187
geom_line() +
186188
xlab('Year') +
187189
ylab('Atmospheric CO2 (ppm)') +

0 commit comments

Comments
 (0)