UBC-DSCI
diff --git a/‎.gitignore
Lines changed: 5 additions & 1 deletion b/‎.gitignore
Lines changed: 5 additions & 1 deletion
diff --git a/‎01-reading.Rmd
Lines changed: 2 additions & 2 deletions b/‎01-reading.Rmd
Lines changed: 2 additions & 2 deletions
diff --git a/‎02-wrangling.Rmd
Lines changed: 17 additions & 14 deletions b/‎02-wrangling.Rmd
Lines changed: 17 additions & 14 deletions
diff --git a/‎03-viz.Rmd
Lines changed: 11 additions & 9 deletions b/‎03-viz.Rmd
Lines changed: 11 additions & 9 deletions
@@ -5,4 +5,8 @@
 **.DS_Store
 *.sw*
 _bookdown_files
-.rstudio/*
+<<<<<<< HEAD
+**.ipynb_checkpoints
+=======
+.rstudio/*
+>>>>>>> dev
@@ -425,8 +425,8 @@ so that we can share it with others or use it for another step in the analysis.
 The default arguments for this file are to use a comma (`,`) as the delimiter and include column names. Below we demonstrate creating a new version of the US state-level 
 property, income, population and voting data from 2015 and 2016 that does not contain the territory of Puerto Rico, and then writing this to a `.csv` file:
 
-```{r}
-state_data <- filter(us_data, state != "PR")
+```
+state_data <- filter(us_data, state != "Puerto Rico")
 write_csv(state_data, "data/us_states_only.csv")
 ```
 
 
@@ -106,7 +106,9 @@ Data is often stored in a wider, not tidy, format because this format is often m
 
 ```{r 02-tidyverse, warning=FALSE, message=FALSE}
 library(tidyverse)
-hist_vote_wide <- read_csv("data/historical_vote_wide.csv")
+hist_vote_wide <- read_csv("data/us_vote.csv") 
+hist_vote_wide <- select(hist_vote_wide, election_year, winner, runnerup)
+hist_vote_wide <- tail(hist_vote_wide, 10)
 hist_vote_wide
 ```
 
@@ -262,12 +264,12 @@ us_data
 ```
 
 Suppose we want to create a subset of the data with only the values for median income and median property value for the state of 
-California ("CA"). To do this, we can use the functions `filter` and `select`. First we use `filter` to create a data frame called `ca_prop_data` that 
+California. To do this, we can use the functions `filter` and `select`. First we use `filter` to create a data frame called `ca_prop_data` that 
 contains only values for the state of California. We then use `select` on this data frame to keep only the median income and 
 median property value variables:
 
 ```{r}
-ca_prop_data <- filter(us_data, state == "CA")
+ca_prop_data <- filter(us_data, state == "California")
 ca_inc_prop <- select(ca_prop_data, med_income, med_prop_val)
 ca_inc_prop
 ```
@@ -277,7 +279,8 @@ we do not need to create an intermediate object to store the output from `filter
 output of `filter` to the input of `select`:
 
 ```{r}
-ca_inc_prop <- filter(us_data, state == "CA") %>% select(med_income, med_prop_val)
+ca_inc_prop <- filter(us_data, state == "California") %>% 
+                    select(med_income, med_prop_val)
 ca_inc_prop
 ```
 
@@ -295,15 +298,15 @@ example, we can pipe together three functions to order the states by commute tim
 is less than 1 million people:
 
 ```{r}
-small_state_commutes <- filter(us_data, population < 1000000) %>% 
-  select(state, mean_commute_minutes) %>% 
-  arrange(mean_commute_minutes)
+small_state_commutes <- filter(us_data, pop < 1000000) %>% 
+  select(state, avg_commute) %>% 
+  arrange(avg_commute)
 small_state_commutes
 ```
 
 > **Note:**: `arrange` is a function that takes the name of a data frame and one or more column(s), and returns a 
 > data frame where the rows are ordered by those columns in ascending order. Here we used only one column for sorting 
-> (`mean_commute_minutes`), but more than one can also be used. To do this, list additional columns separated by commas. 
+> (`avg_commute`), but more than one can also be used. To do this, list additional columns separated by commas. 
 > The order they are listed in indicates the order in which they will be used for sorting. This is much like how an English
 > dictionary sorts words: first by the first letter, then by the second letter, and so on.
 >
@@ -323,9 +326,9 @@ and mean commute time for all US states:
 
 ```{r}
 us_commute_time_summary <- summarize(us_data, 
-                                  min_mean_commute = min(mean_commute_minutes),
-                                  max_mean_commute = max(mean_commute_minutes),
-                                  mean_mean_commute = mean(mean_commute_minutes))
+                                  min_mean_commute = min(avg_commute),
+                                  max_mean_commute = max(avg_commute),
+                                  mean_mean_commute = mean(avg_commute))
 us_commute_time_summary
 ```
 
@@ -341,9 +344,9 @@ columns separated by commas.
 
 ```{r}
 us_commute_time_summary_by_party <- group_by(us_data, party) %>% 
-  summarize(min_mean_commute = min(mean_commute_minutes),
-            max_mean_commute = max(mean_commute_minutes),
-            mean_mean_commute = mean(mean_commute_minutes))
+  summarize(min_mean_commute = min(avg_commute),
+            max_mean_commute = max(avg_commute),
+            mean_mean_commute = mean(avg_commute))
 us_commute_time_summary_by_party
 ```
 
 
@@ -112,24 +112,26 @@ options(warn=-1)
 
 
 ### The Mauna Loa CO2 data set 
- This data set contains the atmospheric concentration of carbon dioxide (CO2, in parts per million) at the Mauna Loa research station in Hawaii 
-from the years 1959-1997. **Question:** Does the concentration of atmospheric CO2 change over time, and are there any interesting patterns to note?
+ The [Mauna Loa CO2 data set](https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html), curated by [Dr. Pieter Tans, NOAA/GML](https://www.esrl.noaa.gov/gmd/staff/Pieter.Tans/) and [Dr. Ralph Keeling, Scripps Institution of Oceanography](https://scrippsco2.ucsd.edu/)
+records the atmospheric concentration of carbon dioxide (CO2, in parts per million) at the Mauna Loa research station in Hawaii from 1959 onwards. **Question:** Does the concentration of atmospheric CO2 change over time, and are there any interesting patterns to note?
 ```{r 03-data-co2, warning=FALSE, message=FALSE}
 # mauna loa carbon dioxide data 
-co2_df <- read_csv("data/maunaloa.csv")
+co2_df <- read_csv("data/mauna_loa.csv") %>%
+		filter(ppm > 0, date_decimal < 2000)
 head(co2_df)
 ```
 
 Since we are investigating a relationship between two variables (CO2 concentration and date), a scatter plot is a good place to start. Scatter plots 
-show the data as individual points with `x` (horizonal axis) and `y` (vertical axis) coordinates. Here, we will use the date as the `x` coordinate 
+show the data as individual points with `x` (horizonal axis) and `y` (vertical axis) coordinates. Here, we will use the decimal
+ date as the `x` coordinate 
 and CO2 concentration as the `y` coordinate. When using the `ggplot2` library, we create the plot object with the `ggplot` function; there are 
 a few basic aspects of a plot that we need to specify:
 
 - the *data*: the name of the dataframe object that we would like to visualize 
     - here, we specify the `co2_df` dataframe
 - the *aesthetic mapping*: tells `ggplot` how the columns in the dataframe map to properties of the visualization
     - to create an aesthetic mapping, we use the `aes` function
-    - here, we set the plot `x` axis to the `date` variable, and the plot `y` axis to the `concentration` variable
+    - here, we set the plot `x` axis to the `date_decimal` variable, and the plot `y` axis to the `ppm` variable
 - the *geometric object*: specifies how the mapped data should be displayed
     - to create a geometric object, we use a `geom_*` function (see the [ggplot reference](https://ggplot2.tidyverse.org/reference/) for a list of geometric objects)
     - here, we use the `geom_point` function to visualize our data as a scatterplot
@@ -138,7 +140,7 @@ There are many other possible arguments we could pass to the aesthetic mapping a
 the purposes of quickly testing things out to see what they look like, though, we can just go with the default settings:
 
 ```{r 03-data-co2-scatter, warning=FALSE, message=FALSE}
-co2_scatter <- ggplot(co2_df, aes(x = date, y = concentration)) + 
+co2_scatter <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) + 
 		geom_point() 
 co2_scatter
 ```
@@ -150,7 +152,7 @@ that the data are ordered by their `x` coordinate, and connect the sequence of `
 the default arguments: 
 
 ```{r 03-data-co2-line, warning=FALSE, message=FALSE}
-co2_line <- ggplot(co2_df, aes(x = date, y = concentration)) + 
+co2_line <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) + 
 		geom_line() 
 co2_line
 ```
@@ -165,7 +167,7 @@ visual noise to remove. But there are a few things we must do to improve clarity
 In order to add axis labels we use the `xlab` and `ylab` functions. To change the font size we use the `theme` function with the `text` argument:
 
 ```{r 03-data-co2-line-2, warning=FALSE, message=FALSE}
-co2_line <- ggplot(co2_df, aes(x = date, y = concentration)) + 
+co2_line <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) + 
                    geom_line() +
                    xlab('Year') +
                    ylab('Atmospheric CO2 (ppm)') + 
@@ -181,7 +183,7 @@ We can transform the axis by passing the `trans` argument, and set limits by pas
 will use the `scale_x_continuous` function with the `limits` argument to zoom in on just five years of data (say, 1990-1995):
 
 ```{r 03-data-co2-line-3, warning=FALSE, message=FALSE}
-co2_line <- ggplot(co2_df, aes(x = date, y = concentration)) + 
+co2_line <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) + 
                    geom_line() +
                    xlab('Year') +
                    ylab('Atmospheric CO2 (ppm)') +