diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..48cbb4b
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,11 @@
+.html
+.Rproj.user
+.Rhistory
+.RData
+*.Rproj
+.DS_Store
+news.html
+README.html
+*key.Rmd
+*key.html
+
diff --git a/.gitignore.orig b/.gitignore.orig
new file mode 100644
index 0000000..1d59b3b
--- /dev/null
+++ b/.gitignore.orig
@@ -0,0 +1,12 @@
+.Rproj.user
+.Rhistory
+.RData
+*.Rproj
+.DS_Store
+<<<<<<< HEAD
+*.html
+*.orig
+=======
+news.html
+README.html
+>>>>>>> mydplyr/master
diff --git a/01_intro_to_r/intro_to_r.Rmd b/01_intro_to_r/intro_to_r.Rmd
new file mode 100644
index 0000000..a33fd31
--- /dev/null
+++ b/01_intro_to_r/intro_to_r.Rmd
@@ -0,0 +1,450 @@
+---
+title: "Introduction to R and RStudio"
+output:
+ html_document:
+ css: ../lab.css
+ highlight: pygments
+ theme: cerulean
+ toc: true
+ toc_float: true
+---
+
+```{r global_options, include=FALSE}
+knitr::opts_chunk$set(eval = FALSE)
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+```
+
+
+## The RStudio Interface
+
+The goal of this lab is to introduce you to R and RStudio, which you'll be using
+throughout the course both to learn the statistical concepts discussed in the
+course and to analyze real data and come to informed conclusions. To clarify
+which is which: R is the name of the programming language itself and RStudio
+is a convenient interface.
+
+As the labs progress, you are encouraged to explore beyond what the labs dictate;
+a willingness to experiment will make you a much better programmer. Before we
+get to that stage, however, you need to build some basic fluency in R. Today we
+begin with the fundamental building blocks of R and RStudio: the interface,
+reading in data, and basic commands.
+
+Go ahead and launch RStudio. You should see a window that looks like the image
+shown below.
+
+
+
+
+
+The panel on the lower left is where the action happens. It's called the *console*.
+Everytime you launch RStudio, it will have the same text at the top of the
+console telling you the version of R that you're running. Below that information
+is the *prompt*. As its name suggests, this prompt is really a request: a
+request for a command. Initially, interacting with R is all about typing commands
+and interpreting the output. These commands and their syntax have evolved over
+decades (literally) and now provide what many users feel is a fairly natural way
+to access data and organize, describe, and invoke statistical computations.
+
+The panel in the upper right contains your *workspace* as well as a history of
+the commands that you've previously entered.
+
+Any plots that you generate will show up in the panel in the lower right corner.
+This is also where you can browse your files, access help, manage packages, etc.
+
+### R Packages
+
+R is an open-source programming language, meaning that users can contribute
+packages that make our lives easier, and we can use them for free. For this lab,
+and many others in the future, we will use the following R packages:
+
+- `dplyr`: for data wrangling
+- `ggplot2`: for data visualization
+- `oilabs`: for data and custom functions with the OpenIntro labs
+
+If these packages are not already available in your R environment,
+install them by typing the following three lines of code into
+the console of your RStudio session, pressing the enter/return key after each one.
+Note that you can check to see which packages (and which versions) are installed by
+inspecting the *Packages* tab in the lower right panel of RStudio.
+
+```{r install-packages, message = FALSE, eval=FALSE}
+install.packages("dplyr")
+install.packages("ggplot2")
+install.packages("oilabs")
+```
+
+You may need to select a server from which to download; any of them will work.
+Next, you need to load these packages in your working environment. We do this with
+the `library` function. Run the following three lines in your console.
+
+```{r load-packages, message = FALSE, eval=TRUE}
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+```
+
+Note that you only need to *install* packages once, but you need to *load*
+them each time you relaunch RStudio.
+
+
+### Creating a reproducible lab report
+
+We will be using R Markdown to create reproducible lab reports. See the
+following videos describing why and how:
+
+[**Why use R Markdown for Lab Reports?**](https://youtu.be/lNWVQ2oxNho)
+
+
+[**Using R Markdown for Lab Reports in RStudio**](https://youtu.be/o0h-eVABe9M)
+
+
+Going forward you should refrain from typing your code directly in the console, and
+instead type any code (final correct answer, or anything you're just trying out) in
+the R Markdown file and run the chunk using either the Run button on the chunk
+(green sideways triangle) or by highlighting the code and clicking Run on the top
+right corner of the R Markdown editor. If at any point you need to start over, you
+can Run All Chunks above the chunk you're working in by clicking on the down
+arrow in the code chunk.
+
+## Dr. Arbuthnot's Baptism Records
+
+To get you started, run the following command to load the data.
+
+```{r load-abrbuthnot-data, eval=TRUE}
+data(arbuthnot)
+```
+
+You can do this by
+
+- clicking on the green arrow at the top right of the code chunk in the R Markdown (Rmd)
+file, or
+- putting your cursor on this line, and hit the **Run** button on the upper right
+corner of the pane, or
+- hitting `Ctrl-Shift-Enter`, or
+- typing the code in the console.
+
+This command instructs R to load some data:
+the Arbuthnot baptism counts for boys and girls. You should see that the
+workspace area in the upper righthand corner of the RStudio window now lists a
+data set called `arbuthnot` that has 82 observations on 3 variables. As you
+interact with R, you will create a series of objects. Sometimes you load them as
+we have done here, and sometimes you create them yourself as the byproduct of a
+computation or some analysis you have performed.
+
+The Arbuthnot data set refers to Dr. John Arbuthnot, an 18th century
+physician, writer, and mathematician. He was interested in the ratio of newborn
+boys to newborn girls, so he gathered the baptism records for children born in
+London for every year from 1629 to 1710. We can view the data by
+typing its name into the console.
+
+```{r view-data}
+arbuthnot
+```
+
+However printing the whole dataset in the console is not that useful.
+One advantage of RStudio is that it comes with a built-in data viewer. Click on
+the name `arbuthnot` in the *Environment* pane (upper right window) that lists
+the objects in your workspace. This will bring up an alternative display of the
+data set in the *Data Viewer* (upper left window). You can close the data viewer
+by clicking on the `x` in the upper lefthand corner.
+
+What you should see are four columns of numbers, each row representing a
+different year: the first entry in each row is simply the row number (an index
+we can use to access the data from individual years if we want), the second is
+the year, and the third and fourth are the numbers of boys and girls baptized
+that year, respectively. Use the scrollbar on the right side of the console
+window to examine the complete data set.
+
+Note that the row numbers in the first column are not part of Arbuthnot's data.
+R adds them as part of its printout to help you make visual comparisons. You can
+think of them as the index that you see on the left side of a spreadsheet. In
+fact, the comparison to a spreadsheet will generally be helpful. R has stored
+Arbuthnot's data in a kind of spreadsheet or table called a *data frame*.
+
+You can see the dimensions of this data frame as well as the names of the variables and the first few observations by typing:
+
+```{r glimpse-data}
+glimpse(arbuthnot)
+```
+
+This command should output the following
+
+```{r glimpse-data-result, echo=FALSE, eval=TRUE}
+glimpse(arbuthnot)
+```
+
+We can see that there are 82 observations and 3 variables in this dataset. The variable names are `year`, `boys`, and `girls`. At this point, you might notice
+that many of the commands in R look a lot like functions from math class; that
+is, invoking R commands means supplying a function with some number of arguments.
+The `glimpse` command, for example, took a single argument, the name of a data frame.
+
+## Some Exploration
+
+Let's start to examine the data a little more closely. We can access the data in
+a single column of a data frame separately using a command like
+
+```{r view-boys}
+arbuthnot$boys
+```
+
+This command will only show the number of boys baptized each year. The dollar
+sign basically says "go to the data frame that comes before me, and find the
+variable that comes after me".
+
+1. What command would you use to extract just the counts of girls baptized? Try
+ it!
+
+Notice that the way R has printed these data is different. When we looked at the
+complete data frame, we saw 82 rows, one on each line of the display. These data
+are no longer structured in a table with other variables, so they are displayed
+one right after another. Objects that print out in this way are called *vectors*;
+they represent a set of numbers. R has added numbers in [brackets] along the left
+side of the printout to indicate locations within the vector. For example, 5218
+follows [1], indicating that 5218 is the first entry in the vector. And if [43]
+starts a line, then that would mean the first number on that line would represent
+the 43rd entry in the vector.
+
+
+### Data visualization
+
+R has some powerful functions for making graphics. We can create a simple plot
+of the number of girls baptized per year with the command
+
+```{r plot-girls-vs-year}
+qplot(x = year, y = girls, data = arbuthnot)
+```
+
+The `qplot()` function (meaning "quick plot") considers the type of data you have
+provided it and makes the decision to visualize it with a scatterplot. The plot
+should appear under the *Plots* tab of the lower right panel of RStudio. Notice
+that the command above again looks like a function, this time with three arguments
+separated by commas. The first two arguments in the `qplot()` function specify
+the variables for the x-axis and the y-axis and the third provides the name of the
+data set where they can be found. If we wanted to connect the data points with
+lines, we could add a fourth argument to specify the geometry that we'd like.
+
+```{r plot-girls-vs-year-line}
+qplot(x = year, y = girls, data = arbuthnot, geom = "line")
+```
+
+You might wonder how you are supposed to know that it was possible to add that
+fourth argument. Thankfully, R documents all of its functions extensively. To
+read what a function does and learn the arguments that are available to you,
+just type in a question mark followed by the name of the function that you're
+interested in. Try the following.
+
+```{r plot-help, tidy = FALSE}
+?qplot
+```
+
+Notice that the help file replaces the plot in the lower right panel. You can
+toggle between plots and help files using the tabs at the top of that panel.
+
+2. Is there an apparent trend in the number of girls baptized over the years?
+How would you describe it? (To ensure that your lab report is comprehensive,
+be sure to include the code needed to make the plot as well as your written
+interpretation.)
+
+### R as a big calculator
+
+Now, suppose we want to plot the total number of baptisms. To compute this, we
+could use the fact that R is really just a big calculator. We can type in
+mathematical expressions like
+
+```{r calc-total-bapt-numbers}
+5218 + 4683
+```
+
+to see the total number of baptisms in 1629. We could repeat this once for each
+year, but there is a faster way. If we add the vector for baptisms for boys to
+that of girls, R will compute all sums simultaneously.
+
+```{r calc-total-bapt-vars}
+arbuthnot$boys + arbuthnot$girls
+```
+
+What you will see are 82 numbers (in that packed display, because we aren’t
+looking at a data frame here), each one representing the sum we’re after. Take a
+look at a few of them and verify that they are right.
+
+### Adding a new variable to the data frame
+
+We'll be using this new vector to generate some plots, so we'll want to save it
+as a permanent column in our data frame.
+
+```{r calc-total-bapt-vars-save}
+arbuthnot <- arbuthnot %>%
+ mutate(total = boys + girls)
+```
+
+The `%>%` operator is called the **piping**
+operator. It takes the output of the previous expression and pipes it into
+the first argument of the function in the following one.
+To continue our analogy with mathematical functions, `x %>% f(y)` is
+equivalent to `f(x, y)`.
+
+
+**A note on piping: ** Note that we can read these three lines of code as the following:
+
+*"Take the `arbuthnot` dataset and **pipe** it into the `mutate` function.
+Mutate the `arbuthnot` data set by creating a new variable called `total` that is the sum of the variables
+called `boys` and `girls`. Then assign the resulting dataset to the object
+called `arbuthnot`, i.e. overwrite the old `arbuthnot` dataset with the new one
+containing the new variable."*
+
+This is equivalent to going through each row and adding up the `boys`
+and `girls` counts for that year and recording that value in a new column called
+`total`.
+
+
+
+**Where is the new variable? ** When you make changes to variables in your dataset,
+click on the name of the dataset again to update it in the data viewer.
+
+
+You'll see that there is now a new column called `total` that has been tacked on
+to the data frame. The special symbol `<-` performs an *assignment*, taking the
+output of one line of code and saving it into an object in your workspace. In
+this case, you already have an object called `arbuthnot`, so this command updates
+that data set with the new mutated column.
+
+We can make a plot of the total number of baptisms per year with the command
+
+```{r plot-total-vs-year}
+qplot(x = year, y = total, data = arbuthnot, geom = "line")
+```
+
+Similarly to how we computed the total number of births, we can compute the ratio
+of the number of boys to the number of girls baptized in 1629 with
+
+```{r calc-prop-boys-to-girls-numbers}
+5218 / 4683
+```
+
+or we can act on the complete columns with the expression
+
+```{r calc-prop-boys-to-girls-vars}
+arbuthnot <- arbuthnot %>%
+ mutate(boy_to_girl_ratio = boys / girls)
+```
+
+We can also compute the proportion of newborns that are boys in 1629
+
+```{r calc-prop-boys-numbers}
+5218 / (5218 + 4683)
+```
+
+or this may also be computed for all years simultaneously and append it to the dataset:
+
+```{r calc-prop-boys-vars}
+arbuthnot <- arbuthnot %>%
+ mutate(boy_ratio = boys / total)
+```
+
+Note that we are using the new `total` variable we created earlier in our calculations.
+
+3. Now, generate a plot of the proportion of boys born over time. What do you see?
+
+
+**Tip: ** If you use the up and down arrow keys, you can scroll through your
+previous commands, your so-called command history. You can also access it
+by clicking on the history tab in the upper right panel. This will save
+you a lot of typing in the future.
+
+
+Finally, in addition to simple mathematical operators like subtraction and
+division, you can ask R to make comparisons like greater than, `>`, less than,
+`<`, and equality, `==`. For example, we can ask if boys outnumber girls in each
+year with the expression
+
+```{r boys-more-than-girls}
+arbuthnot <- arbuthnot %>%
+ mutate(more_boys = boys > girls)
+```
+
+This command add a new variable to the `arbuthnot` dataframe containing the values
+of either `TRUE` if that year had more boys than girls, or `FALSE` if that year
+did not (the answer may surprise you). This variable contains a different kind of
+data than we have encountered so far. All other columns in the `arbuthnot` data
+frame have values that are numerical (the year, the number of boys and girls). Here,
+we've asked R to create *logical* data, data where the values are either `TRUE`
+or `FALSE`. In general, data analysis will involve many different kinds of data
+types, and one reason for using R is that it is able to represent and compute
+with many of them.
+
+* * *
+
+## More Practice
+
+In the previous few pages, you recreated some of the displays and preliminary
+analysis of Arbuthnot's baptism data. Your assignment involves repeating these
+steps, but for present day birth records in the United States. Load the
+present day data with the following command.
+
+```{r load-present-data}
+data(present)
+```
+
+The data are stored in a data frame called `present`.
+
+4. What years are included in this data set? What are the dimensions of the
+ data frame? What are the variable (column) names?
+
+5. How do these counts compare to Arbuthnot's? Are they of a similar magnitude?
+
+6. Make a plot that displays the proportion of boys born over time. What do you see?
+ Does Arbuthnot's observation about boys being born in greater proportion than girls
+ hold up in the U.S.? Include the plot in your response. *Hint:* You should be
+ able to reuse your code from Ex 3 above, just replace the dataframe name.
+
+7. In what year did we see the most total number of births in the U.S.? *Hint:*
+ First calculate the totals and save it as a new variable. Then, sort your
+ dataset in descending order based on the total column. You can do this
+ interactively in the data viewer by clicking on the arrows next to the
+ variable names. To include the sorted result in your report you will need
+ to use two new functions: `arrange` (for sorting). We can arrange the data
+ in a descending order with another function: `desc` (for descending order).
+ Sample code provided below.
+
+```{r eval=FALSE}
+present %>%
+ arrange(desc(total))
+```
+
+These data come from reports by the Centers for Disease Control. You can learn more about them
+by bringing up the help file using the command `?present`.
+
+
+
+This is a product of OpenIntro that is released under a
+[Creative Commons Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0).
+This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel
+from a lab written by Mark Hansen of UCLA Statistics.
+
+
+* * *
+
+## Resources for learning R and working in RStudio
+
+That was a short introduction to R and RStudio, but we will provide you with more
+functions and a more complete sense of the language as the course progresses.
+
+In this course we will be using R packages called `dplyr` for data wrangling
+and `ggplot2` for data visualization. If you are googling for R code, make sure
+to also include these package names in your search query. For example, instead
+of googling "scatterplot in R", google "scatterplot in R with ggplot2".
+
+These cheatsheets may come in handy throughout the semester:
+
+- [RMarkdown cheatsheet](http://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf)
+- [Data wrangling cheatsheet](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
+- [Data visualization cheatsheet](http://www.rstudio.com/wp-content/uploads/2015/12/ggplot2-cheatsheet-2.0.pdf)
+
+Chester Ismay has put together a resource for new users of R, RStudio, and R Markdown
+[here](https://ismayc.github.io/rbasics-book). It includes examples showing working with R Markdown files
+in RStudio recorded as GIFs.
+
+Note that some of the code on these cheatsheets may be too advanced for this course,
+however majority of it will become useful throughout the semester.
diff --git a/01_intro_to_r/intro_to_r.html b/01_intro_to_r/intro_to_r.html
new file mode 100644
index 0000000..b8dc721
--- /dev/null
+++ b/01_intro_to_r/intro_to_r.html
@@ -0,0 +1,447 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+Introduction to R and RStudio
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Introduction to R and RStudio
+
+
+
+
+
+
The RStudio Interface
+
The goal of this lab is to introduce you to R and RStudio, which you’ll be using throughout the course both to learn the statistical concepts discussed in the course and to analyze real data and come to informed conclusions. To clarify which is which: R is the name of the programming language itself and RStudio is a convenient interface.
+
As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R. Today we begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.
+
Go ahead and launch RStudio. You should see a window that looks like the image shown below.
+
+
The panel on the lower left is where the action happens. It’s called the console. Everytime you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is really a request: a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.
+
The panel in the upper right contains your workspace as well as a history of the commands that you’ve previously entered.
+
Any plots that you generate will show up in the panel in the lower right corner. This is also where you can browse your files, access help, manage packages, etc.
+
+
R Packages
+
R is an open-source programming language, meaning that users can contribute packages that make our lives easier, and we can use them for free. For this lab, and many others in the future, we will use the following R packages:
+
+
dplyr: for data wrangling
+
ggplot2: for data visualization
+
oilabs: for data and custom functions with the OpenIntro labs
+
+
If these packages are not already available in your R environment, install them by typing the following three lines of code into the console of your RStudio session, pressing the enter/return key after each one. Note that you can check to see which packages (and which versions) are installed by inspecting the Packages tab in the lower right panel of RStudio.
You may need to select a server from which to download; any of them will work. Next, you need to load these packages in your working environment. We do this with the library function. Run the following three lines in your console.
+
library(dplyr)
+library(ggplot2)
+library(oilabs)
+
Note that you only need to install packages once, but you need to load them each time you relaunch RStudio.
+
+
+
Creating a reproducible lab report
+
We will be using R Markdown to create reproducible lab reports. See the following videos describing why and how:
Going forward you should refrain from typing your code directly in the console, and instead type any code (final correct answer, or anything you’re just trying out) in the R Markdown file and run the chunk using either the Run button on the chunk (green sideways triangle) or by highlighting the code and clicking Run on the top right corner of the R Markdown editor. If at any point you need to start over, you can Run All Chunks above the chunk you’re working in by clicking on the down arrow in the code chunk.
+
+
+
+
Dr. Arbuthnot’s Baptism Records
+
To get you started, run the following command to load the data.
+
data(arbuthnot)
+
You can do this by
+
+
clicking on the green arrow at the top right of the code chunk in the R Markdown (Rmd) file, or
+
putting your cursor on this line, and hit the Run button on the upper right corner of the pane, or
+
hitting Ctrl-Shift-Enter, or
+
typing the code in the console.
+
+
This command instructs R to load some data: the Arbuthnot baptism counts for boys and girls. You should see that the workspace area in the upper righthand corner of the RStudio window now lists a data set called arbuthnot that has 82 observations on 3 variables. As you interact with R, you will create a series of objects. Sometimes you load them as we have done here, and sometimes you create them yourself as the byproduct of a computation or some analysis you have performed.
+
The Arbuthnot data set refers to Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710. We can view the data by typing its name into the console.
+
arbuthnot
+
However printing the whole dataset in the console is not that useful. One advantage of RStudio is that it comes with a built-in data viewer. Click on the name arbuthnot in the Environment pane (upper right window) that lists the objects in your workspace. This will bring up an alternative display of the data set in the Data Viewer (upper left window). You can close the data viewer by clicking on the x in the upper lefthand corner.
+
What you should see are four columns of numbers, each row representing a different year: the first entry in each row is simply the row number (an index we can use to access the data from individual years if we want), the second is the year, and the third and fourth are the numbers of boys and girls baptized that year, respectively. Use the scrollbar on the right side of the console window to examine the complete data set.
+
Note that the row numbers in the first column are not part of Arbuthnot’s data. R adds them as part of its printout to help you make visual comparisons. You can think of them as the index that you see on the left side of a spreadsheet. In fact, the comparison to a spreadsheet will generally be helpful. R has stored Arbuthnot’s data in a kind of spreadsheet or table called a data frame.
+
You can see the dimensions of this data frame as well as the names of the variables and the first few observations by typing:
We can see that there are 82 observations and 3 variables in this dataset. The variable names are year, boys, and girls. At this point, you might notice that many of the commands in R look a lot like functions from math class; that is, invoking R commands means supplying a function with some number of arguments. The glimpse command, for example, took a single argument, the name of a data frame.
+
+
+
Some Exploration
+
Let’s start to examine the data a little more closely. We can access the data in a single column of a data frame separately using a command like
+
arbuthnot$boys
+
This command will only show the number of boys baptized each year. The dollar sign basically says “go to the data frame that comes before me, and find the variable that comes after me”.
+
+
What command would you use to extract just the counts of girls baptized? Try it!
+
+
Notice that the way R has printed these data is different. When we looked at the complete data frame, we saw 82 rows, one on each line of the display. These data are no longer structured in a table with other variables, so they are displayed one right after another. Objects that print out in this way are called vectors; they represent a set of numbers. R has added numbers in [brackets] along the left side of the printout to indicate locations within the vector. For example, 5218 follows [1], indicating that 5218 is the first entry in the vector. And if [43] starts a line, then that would mean the first number on that line would represent the 43rd entry in the vector.
+
+
Data visualization
+
R has some powerful functions for making graphics. We can create a simple plot of the number of girls baptized per year with the command
+
qplot(x = year, y = girls, data = arbuthnot)
+
The qplot() function (meaning “quick plot”) considers the type of data you have provided it and makes the decision to visualize it with a scatterplot. The plot should appear under the Plots tab of the lower right panel of RStudio. Notice that the command above again looks like a function, this time with three arguments separated by commas. The first two arguments in the qplot() function specify the variables for the x-axis and the y-axis and the third provides the name of the data set where they can be found. If we wanted to connect the data points with lines, we could add a fourth argument to specify the geometry that we’d like.
+
qplot(x = year, y = girls, data = arbuthnot, geom ="line")
+
You might wonder how you are supposed to know that it was possible to add that fourth argument. Thankfully, R documents all of its functions extensively. To read what a function does and learn the arguments that are available to you, just type in a question mark followed by the name of the function that you’re interested in. Try the following.
+
?qplot
+
Notice that the help file replaces the plot in the lower right panel. You can toggle between plots and help files using the tabs at the top of that panel.
+
+
Is there an apparent trend in the number of girls baptized over the years? How would you describe it? (To ensure that your lab report is comprehensive, be sure to include the code needed to make the plot as well as your written interpretation.)
+
+
+
+
R as a big calculator
+
Now, suppose we want to plot the total number of baptisms. To compute this, we could use the fact that R is really just a big calculator. We can type in mathematical expressions like
+
5218 +4683
+
to see the total number of baptisms in 1629. We could repeat this once for each year, but there is a faster way. If we add the vector for baptisms for boys to that of girls, R will compute all sums simultaneously.
+
arbuthnot$boys +arbuthnot$girls
+
What you will see are 82 numbers (in that packed display, because we aren’t looking at a data frame here), each one representing the sum we’re after. Take a look at a few of them and verify that they are right.
+
+
+
Adding a new variable to the data frame
+
We’ll be using this new vector to generate some plots, so we’ll want to save it as a permanent column in our data frame.
+
arbuthnot <-arbuthnot %>%
+mutate(total = boys +girls)
+
The %>% operator is called the piping operator. It takes the output of the previous expression and pipes it into the first argument of the function in the following one. To continue our analogy with mathematical functions, x %>% f(y) is equivalent to f(x, y).
+
+
A note on piping: Note that we can read these three lines of code as the following:
+
“Take the arbuthnot dataset and pipe it into the mutate function. Mutate the arbuthnot data set by creating a new variable called total that is the sum of the variables called boys and girls. Then assign the resulting dataset to the object called arbuthnot, i.e. overwrite the old arbuthnot dataset with the new one containing the new variable.”
+
This is equivalent to going through each row and adding up the boys and girls counts for that year and recording that value in a new column called total.
+
+
+
Where is the new variable? When you make changes to variables in your dataset, click on the name of the dataset again to update it in the data viewer.
+
+
You’ll see that there is now a new column called total that has been tacked on to the data frame. The special symbol <- performs an assignment, taking the output of one line of code and saving it into an object in your workspace. In this case, you already have an object called arbuthnot, so this command updates that data set with the new mutated column.
+
We can make a plot of the total number of baptisms per year with the command
+
qplot(x = year, y = total, data = arbuthnot, geom ="line")
+
Similarly to how we computed the total number of births, we can compute the ratio of the number of boys to the number of girls baptized in 1629 with
+
5218 /4683
+
or we can act on the complete columns with the expression
+
arbuthnot <-arbuthnot %>%
+mutate(boy_to_girl_ratio = boys /girls)
+
We can also compute the proportion of newborns that are boys in 1629
+
5218 /(5218 +4683)
+
or this may also be computed for all years simultaneously and append it to the dataset:
+
arbuthnot <-arbuthnot %>%
+mutate(boy_ratio = boys /total)
+
Note that we are using the new total variable we created earlier in our calculations.
+
+
Now, generate a plot of the proportion of boys born over time. What do you see?
+
+
+
Tip: If you use the up and down arrow keys, you can scroll through your previous commands, your so-called command history. You can also access it by clicking on the history tab in the upper right panel. This will save you a lot of typing in the future.
+
+
Finally, in addition to simple mathematical operators like subtraction and division, you can ask R to make comparisons like greater than, >, less than, <, and equality, ==. For example, we can ask if boys outnumber girls in each year with the expression
+
arbuthnot <-arbuthnot %>%
+mutate(more_boys = boys >girls)
+
This command add a new variable to the arbuthnot dataframe containing the values of either TRUE if that year had more boys than girls, or FALSE if that year did not (the answer may surprise you). This variable contains a different kind of data than we have encountered so far. All other columns in the arbuthnot data frame have values that are numerical (the year, the number of boys and girls). Here, we’ve asked R to create logical data, data where the values are either TRUE or FALSE. In general, data analysis will involve many different kinds of data types, and one reason for using R is that it is able to represent and compute with many of them.
+
+
+
+
+
More Practice
+
In the previous few pages, you recreated some of the displays and preliminary analysis of Arbuthnot’s baptism data. Your assignment involves repeating these steps, but for present day birth records in the United States. Load the present day data with the following command.
+
data(present)
+
The data are stored in a data frame called present.
+
+
What years are included in this data set? What are the dimensions of the data frame? What are the variable (column) names?
+
How do these counts compare to Arbuthnot’s? Are they of a similar magnitude?
+
Make a plot that displays the proportion of boys born over time. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in your response. Hint: You should be able to reuse your code from Ex 3 above, just replace the dataframe name.
+
In what year did we see the most total number of births in the U.S.? Hint: First calculate the totals and save it as a new variable. Then, sort your dataset in descending order based on the total column. You can do this interactively in the data viewer by clicking on the arrows next to the variable names. To include the sorted result in your report you will need to use two new functions: arrange (for sorting). We can arrange the data in a descending order with another function: desc (for descending order). Sample code provided below.
+
+
present %>%
+arrange(desc(total))
+
These data come from reports by the Centers for Disease Control. You can learn more about them by bringing up the help file using the command ?present.
+
+
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.
+
+
+
+
+
Resources for learning R and working in RStudio
+
That was a short introduction to R and RStudio, but we will provide you with more functions and a more complete sense of the language as the course progresses.
+
In this course we will be using R packages called dplyr for data wrangling and ggplot2 for data visualization. If you are googling for R code, make sure to also include these package names in your search query. For example, instead of googling “scatterplot in R”, google “scatterplot in R with ggplot2”.
+
These cheatsheets may come in handy throughout the semester:
Chester Ismay has put together a resource for new users of R, RStudio, and R Markdown here. It includes examples showing working with R Markdown files in RStudio recorded as GIFs.
+
Note that some of the code on these cheatsheets may be too advanced for this course, however majority of it will become useful throughout the semester.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/intro_to_r/more/arbuthnot-readme.txt b/01_intro_to_r/more/arbuthnot-readme.txt
similarity index 100%
rename from intro_to_r/more/arbuthnot-readme.txt
rename to 01_intro_to_r/more/arbuthnot-readme.txt
diff --git a/intro_to_r/more/arbuthnot.r b/01_intro_to_r/more/arbuthnot.r
similarity index 100%
rename from intro_to_r/more/arbuthnot.r
rename to 01_intro_to_r/more/arbuthnot.r
diff --git a/intro_to_r/more/present-readme.txt b/01_intro_to_r/more/present-readme.txt
similarity index 100%
rename from intro_to_r/more/present-readme.txt
rename to 01_intro_to_r/more/present-readme.txt
diff --git a/intro_to_r/more/present-reference.pdf b/01_intro_to_r/more/present-reference.pdf
similarity index 100%
rename from intro_to_r/more/present-reference.pdf
rename to 01_intro_to_r/more/present-reference.pdf
diff --git a/intro_to_r/more/present.R b/01_intro_to_r/more/present.R
similarity index 100%
rename from intro_to_r/more/present.R
rename to 01_intro_to_r/more/present.R
diff --git a/01_intro_to_r/more/r-interface-2016.png b/01_intro_to_r/more/r-interface-2016.png
new file mode 100644
index 0000000..73f51c2
Binary files /dev/null and b/01_intro_to_r/more/r-interface-2016.png differ
diff --git a/01_intro_to_r/more/rmd-from-template.png b/01_intro_to_r/more/rmd-from-template.png
new file mode 100644
index 0000000..8abb51f
Binary files /dev/null and b/01_intro_to_r/more/rmd-from-template.png differ
diff --git a/02_intro_to_data/intro_to_data.Rmd b/02_intro_to_data/intro_to_data.Rmd
new file mode 100644
index 0000000..d2fe430
--- /dev/null
+++ b/02_intro_to_data/intro_to_data.Rmd
@@ -0,0 +1,369 @@
+---
+title: "Introduction to data"
+output:
+ html_document:
+ theme: cerulean
+ highlight: pygments
+ css: ../lab.css
+ toc: true
+ toc_float: true
+---
+
+```{r global-options, include=FALSE}
+knitr::opts_chunk$set(eval = FALSE)
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+```
+
+Some define statistics as the field that focuses on turning information into
+knowledge. The first step in that process is to summarize and describe the raw
+information -- the data. In this lab we explore flights, specifically a random
+sample of domestic flights that departed from the three major
+New York City airport in 2013. We will generate simple graphical and numerical
+summaries of data on these flights and explore delay times. As this is a large
+data set, along the way you'll also learn the indispensable skills of data
+processing and subsetting.
+
+
+## Getting started
+
+### Load packages
+
+In this lab we will explore the data using the `dplyr` package and visualize it
+using the `ggplot2` package for data visualization. The data can be found in the
+companion package for OpenIntro labs, `oilabs`.
+
+Let's load the packages.
+
+```{r load-packages, message=FALSE}
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+```
+
+### Creating a reproducible lab report
+
+Remember that we will be using R Markdown to create reproducible lab reports.
+See the following video describing how to get started with creating these
+reports for this lab, and all future labs:
+
+[**Basic R Markdown with an OpenIntro Lab**](https://www.youtube.com/watch?v=Pdc368lS2hk)
+
+
+### The data
+
+The [Bureau of Transportation Statistics](http://www.rita.dot.gov/bts/about/)
+(BTS) is a statistical agency that is a part of the Research and Innovative
+Technology Administration (RITA). As its name implies, BTS collects and makes
+available transportation data, such as the flights data we will be working with
+in this lab.
+
+We begin by loading the `nycflights` data frame. Type the following in your console
+to load the data:
+
+```{r load-data, eval = TRUE}
+data(nycflights)
+```
+
+The data set `nycflights` that shows up in your workspace is a *data matrix*,
+with each row representing an *observation* and each column representing a
+*variable*. R calls this data format a **data frame**, which is a term that will
+be used throughout the labs. For this data set, each *observation* is a single flight.
+
+To view the names of the variables, type the command
+
+```{r names}
+names(nycflights)
+```
+
+This returns the names of the variables in this data frame. The **codebook**
+(description of the variables) can be accessed by pulling up the help file:
+
+```{r}
+?nycflights
+```
+
+One of the variables refers to the carrier (i.e. airline) of the flight, which
+is coded according to the following system.
+
+- `carrier`: Two letter carrier abbreviation.
+ + `9E`: Endeavor Air Inc.
+ + `AA`: American Airlines Inc.
+ + `AS`: Alaska Airlines Inc.
+ + `B6`: JetBlue Airways
+ + `DL`: Delta Air Lines Inc.
+ + `EV`: ExpressJet Airlines Inc.
+ + `F9`: Frontier Airlines Inc.
+ + `FL`: AirTran Airways Corporation
+ + `HA`: Hawaiian Airlines Inc.
+ + `MQ`: Envoy Air
+ + `OO`: SkyWest Airlines Inc.
+ + `UA`: United Air Lines Inc.
+ + `US`: US Airways Inc.
+ + `VX`: Virgin America
+ + `WN`: Southwest Airlines Co.
+ + `YV`: Mesa Airlines Inc.
+
+
+A very useful function for taking a quick peek at your data frame and viewing
+its dimensions and data types is `str`, which stands for **str**ucture.
+
+```{r str}
+str(nycflights)
+```
+
+The `nycflights` data frame is a massive trove of information. Let's think about
+some questions we might want to answer with these data:
+
+- How delayed were flights that were headed to Los Angeles?
+- How do departure delays vary over months?
+- Which of the three major NYC airports has a better on time percentage for
+departing flights?
+
+
+## Analysis
+
+### Lab report
+To record your analysis in a reproducible format, you can adapt the general Lab
+Report template from the `oilabs` package. Watch the video above to learn how.
+
+### Departure delays
+
+Let's start by examing the distribution of departure delays of all flights with a
+histogram.
+
+```{r hist-dep-delay}
+qplot(x = dep_delay, data = nycflights, geom = "histogram")
+```
+
+This function says to plot the `dep_delay` variable from the `nycflights` data
+frame on the x-axis. It also defines a `geom` (short for geometric object),
+which describes the type of plot you will produce.
+
+Histograms are generally a very good way to see the shape of a single
+distribution of numerical data, but that shape can change depending on how the
+data is split between the different bins. You can easily define the binwidth you
+want to use:
+
+```{r hist-dep-delay-bins}
+qplot(x = dep_delay, data = nycflights, geom = "histogram", binwidth = 15)
+qplot(x = dep_delay, data = nycflights, geom = "histogram", binwidth = 150)
+```
+
+1. Look carefully at these three histograms. How do they compare? Are features
+revealed in one that are obscured in another?
+
+If we want to focus only on departure delays of flights headed to Los Angeles,
+we need to first `filter` the data for flights with that destination (`dest == "LAX"`)
+and then make a histogram of the departure delays of only those flights.
+
+```{r lax-flights-hist}
+lax_flights <- nycflights %>%
+ filter(dest == "LAX")
+qplot(x = dep_delay, data = lax_flights, geom = "histogram")
+```
+
+Let's decipher these two commands (OK, so it might look like three lines, but
+the first two physical lines of code are actually part of the same command. It's
+common to add a break to a new line after `%>%` to help readability).
+
+- Command 1: Take the `nycflights` data frame, `filter` for flights headed to LAX, and
+save the result as a new data frame called `lax_flights`.
+ + `==` means "if it's equal to".
+ + `LAX` is in quotation marks since it is a character string.
+- Command 2: Basically the same `qplot` call from earlier for making a histogram,
+except that it uses the smaller data frame for flights headed to LAX instead of all
+flights.
+
+
+**Logical operators: ** Filtering for certain observations (e.g. flights from a
+particular airport) is often of interest in data frames where we might want to
+examine observations with certain characteristics separately from the rest of
+the data. To do so we use the `filter` function and a series of
+**logical operators**. The most commonly used logical operators for data
+analysis are as follows:
+
+- `==` means "equal to"
+- `!=` means "not equal to"
+- `>` or `<` means "greater than" or "less than"
+- `>=` or `<=` means "greater than or equal to" or "less than or equal to"
+
+
+We can also obtain numerical summaries for these flights:
+
+```{r lax-flights-summ}
+lax_flights %>%
+ summarise(mean_dd = mean(dep_delay), median_dd = median(dep_delay), n = n())
+```
+
+Note that in the `summarise` function we created a list of three different
+numerical summaries that we were interested in. The names of these elements are
+user defined, like `mean_dd`, `median_dd`, `n`, and you could customize these names
+as you like (just don't use spaces in your names). Calculating these summary
+statistics also require that you know the function calls. Note that `n()` reports
+the sample size.
+
+
+**Summary statistics: ** Some useful function calls for summary statistics for a
+single numerical variable are as follows:
+
+- `mean`
+- `median`
+- `sd`
+- `var`
+- `IQR`
+- `min`
+- `max`
+
+Note that each of these functions take a single vector as an argument, and
+returns a single value.
+
+
+We can also filter based on multiple criteria. Suppose we are interested in
+flights headed to San Francisco (SFO) in February:
+
+```{r}
+sfo_feb_flights <- nycflights %>%
+ filter(dest == "SFO", month == 2)
+```
+
+Note that we can separate the conditions using commas if we want flights that
+are both headed to SFO **and** in February. If we are interested in either
+flights headed to SFO **or** in February we can use the `|` instead of the comma.
+
+1. Create a new data frame that includes flights headed to SFO in February,
+ and save this data frame as `sfo_feb_flights`. How many flights
+ meet these criteria?
+
+1. Describe the distribution of the **arrival** delays of these flights using a
+ histogram and appropriate summary statistics. **Hint:** The summary
+ statistics you use should depend on the shape of the distribution.
+
+Another useful technique is quickly calculating summary
+statistics for various groups in your data frame. For example, we can modify the
+above command using the `group_by` function to get the same summary stats for
+each origin airport:
+
+```{r summary-custom-list-origin}
+sfo_feb_flights %>%
+ group_by(origin) %>%
+ summarise(median_dd = median(dep_delay), iqr_dd = IQR(dep_delay), n_flights = n())
+```
+
+Here, we first grouped the data by `origin`, and then calculated the summary
+statistics.
+
+1. Calculate the median and interquartile range for `arr_delay`s of flights in
+ in the `sfo_feb_flights` data frame, grouped by carrier. Which carrier
+ has the most variable arrival delays?
+
+### Departure delays over months
+
+Which month would you expect to have the highest average delay departing from an
+NYC airport?
+
+Let's think about how we would answer this question:
+
+- First, calculate monthly averages for departure delays. With the new language
+we are learning, we need to
+ + `group_by` months, then
+ + `summarise` mean departure delays.
+- Then, we need to `arrange` these average delays in `desc`ending order
+
+```{r mean-dep-delay-months}
+nycflights %>%
+ group_by(month) %>%
+ summarise(mean_dd = mean(dep_delay)) %>%
+ arrange(desc(mean_dd))
+```
+
+1. Suppose you really dislike departure delays, and you want to schedule
+ your travel in a month that minimizes your potential departure delay leaving
+ NYC. One option is to choose the month with the lowest mean departure delay.
+ Another option is to choose the month with the lowest median departure delay.
+ What are the pros and cons of these two choices?
+
+
+
+### On time departure rate for NYC airports
+
+Suppose you will be flying out of NYC and want to know which of the
+three major NYC airports has the best on time departure rate of departing flights.
+Suppose also that for you a flight that is delayed for less than 5 minutes is
+basically "on time". You consider any flight delayed for 5 minutes of more to be
+"delayed".
+
+In order to determine which airport has the best on time departure rate,
+we need to
+
+- first classify each flight as "on time" or "delayed",
+- then group flights by origin airport,
+- then calculate on time departure rates for each origin airport,
+- and finally arrange the airports in descending order for on time departure
+percentage.
+
+Let's start with classifying each flight as "on time" or "delayed" by
+creating a new variable with the `mutate` function.
+
+```{r dep-type}
+nycflights <- nycflights %>%
+ mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
+```
+
+The first argument in the `mutate` function is the name of the new variable
+we want to create, in this case `dep_type`. Then if `dep_delay < 5` we classify
+the flight as `"on time"` and `"delayed"` if not, i.e. if the flight is delayed
+for 5 or more minutes.
+
+Note that we are also overwriting the `nycflights` data frame with the new
+version of this data frame that includes the new `dep_type` variable.
+
+We can handle all the remaining steps in one code chunk:
+
+```{r}
+nycflights %>%
+ group_by(origin) %>%
+ summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
+ arrange(desc(ot_dep_rate))
+```
+
+1. If you were selecting an airport simply based on on time departure
+ percentage, which NYC airport would you choose to fly out of?
+
+We can also visualize the distribution of on on time departure rate across
+the three airports using a segmented bar plot.
+
+```{r}
+qplot(x = origin, fill = dep_type, data = nycflights, geom = "bar")
+```
+
+* * *
+
+## More Practice
+
+1. Mutate the data frame so that it includes a new variable that contains the
+ average speed, `avg_speed` traveled by the plane for each flight (in mph).
+ **Hint:** Average speed can be calculated as distance divided by
+ number of hours of travel, and note that `air_time` is given in minutes.
+
+1. Make a scatterplot of `avg_speed` vs. `distance`. Describe the relationship
+ between average speed and distance.
+ **Hint:** Use `geom = "point"`.
+
+1. Replicate the following plot. **Hint:** The data frame plotted only
+ contains flights from American Airlines, Delta Airlines, and United
+ Airlines, and the points are `color`ed by `carrier`. Once you replicate
+ the plot, determine (roughly) what the cutoff point is for departure
+ delays where you can still expect to get to your destination on time.
+
+```{r echo=FALSE, eval=TRUE, fig.width=7, fig.height=4}
+dl_aa_ua <- nycflights %>%
+ filter(carrier == "AA" | carrier == "DL" | carrier == "UA")
+qplot(x = dep_delay, y = arr_delay, data = dl_aa_ua, color = carrier)
+```
\ No newline at end of file
diff --git a/02_intro_to_data/intro_to_data.html b/02_intro_to_data/intro_to_data.html
new file mode 100644
index 0000000..1ec9381
--- /dev/null
+++ b/02_intro_to_data/intro_to_data.html
@@ -0,0 +1,476 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+Introduction to data
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Introduction to data
+
+
+
+
+
Some define statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information – the data. In this lab we explore flights, specifically a random sample of domestic flights that departed from the three major New York City airport in 2013. We will generate simple graphical and numerical summaries of data on these flights and explore delay times. As this is a large data set, along the way you’ll also learn the indispensable skills of data processing and subsetting.
+
+
Getting started
+
+
Load packages
+
In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data can be found in the companion package for OpenIntro labs, oilabs.
+
Let’s load the packages.
+
library(dplyr)
+library(ggplot2)
+library(oilabs)
+
+
+
Creating a reproducible lab report
+
Remember that we will be using R Markdown to create reproducible lab reports. See the following video describing how to get started with creating these reports for this lab, and all future labs:
The Bureau of Transportation Statistics (BTS) is a statistical agency that is a part of the Research and Innovative Technology Administration (RITA). As its name implies, BTS collects and makes available transportation data, such as the flights data we will be working with in this lab.
+
We begin by loading the nycflights data frame. Type the following in your console to load the data:
+
data(nycflights)
+
The data set nycflights that shows up in your workspace is a data matrix, with each row representing an observation and each column representing a variable. R calls this data format a data frame, which is a term that will be used throughout the labs. For this data set, each observation is a single flight.
+
To view the names of the variables, type the command
+
names(nycflights)
+
This returns the names of the variables in this data frame. The codebook (description of the variables) can be accessed by pulling up the help file:
+
?nycflights
+
One of the variables refers to the carrier (i.e. airline) of the flight, which is coded according to the following system.
+
+
carrier: Two letter carrier abbreviation.
+
+
9E: Endeavor Air Inc.
+
AA: American Airlines Inc.
+
AS: Alaska Airlines Inc.
+
B6: JetBlue Airways
+
DL: Delta Air Lines Inc.
+
EV: ExpressJet Airlines Inc.
+
F9: Frontier Airlines Inc.
+
FL: AirTran Airways Corporation
+
HA: Hawaiian Airlines Inc.
+
MQ: Envoy Air
+
OO: SkyWest Airlines Inc.
+
UA: United Air Lines Inc.
+
US: US Airways Inc.
+
VX: Virgin America
+
WN: Southwest Airlines Co.
+
YV: Mesa Airlines Inc.
+
+
+
A very useful function for taking a quick peek at your data frame and viewing its dimensions and data types is str, which stands for structure.
+
str(nycflights)
+
The nycflights data frame is a massive trove of information. Let’s think about some questions we might want to answer with these data:
+
+
How delayed were flights that were headed to Los Angeles?
+
How do departure delays vary over months?
+
Which of the three major NYC airports has a better on time percentage for departing flights?
+
+
+
+
+
Analysis
+
+
Lab report
+
To record your analysis in a reproducible format, you can adapt the general Lab Report template from the oilabs package. Watch the video above to learn how.
+
+
+
Departure delays
+
Let’s start by examing the distribution of departure delays of all flights with a histogram.
+
qplot(x = dep_delay, data = nycflights, geom ="histogram")
+
This function says to plot the dep_delay variable from the nycflights data frame on the x-axis. It also defines a geom (short for geometric object), which describes the type of plot you will produce.
+
Histograms are generally a very good way to see the shape of a single distribution of numerical data, but that shape can change depending on how the data is split between the different bins. You can easily define the binwidth you want to use:
+
qplot(x = dep_delay, data = nycflights, geom ="histogram", binwidth =15)
+qplot(x = dep_delay, data = nycflights, geom ="histogram", binwidth =150)
+
+
Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another?
+
+
If we want to focus only on departure delays of flights headed to Los Angeles, we need to first filter the data for flights with that destination (dest == "LAX") and then make a histogram of the departure delays of only those flights.
Let’s decipher these two commands (OK, so it might look like three lines, but the first two physical lines of code are actually part of the same command. It’s common to add a break to a new line after %>% to help readability).
+
+
Command 1: Take the nycflights data frame, filter for flights headed to LAX, and save the result as a new data frame called lax_flights.
+
+
== means “if it’s equal to”.
+
LAX is in quotation marks since it is a character string.
+
+
Command 2: Basically the same qplot call from earlier for making a histogram, except that it uses the smaller data frame for flights headed to LAX instead of all flights.
+
+
+
Logical operators: Filtering for certain observations (e.g. flights from a particular airport) is often of interest in data frames where we might want to examine observations with certain characteristics separately from the rest of the data. To do so we use the filter function and a series of logical operators. The most commonly used logical operators for data analysis are as follows:
+
+
== means “equal to”
+
!= means “not equal to”
+
> or < means “greater than” or “less than”
+
>= or <= means “greater than or equal to” or “less than or equal to”
+
+
+
We can also obtain numerical summaries for these flights:
+
lax_flights %>%
+summarise(mean_dd =mean(dep_delay), median_dd =median(dep_delay), n =n())
+
Note that in the summarise function we created a list of three different numerical summaries that we were interested in. The names of these elements are user defined, like mean_dd, median_dd, n, and you could customize these names as you like (just don’t use spaces in your names). Calculating these summary statistics also require that you know the function calls. Note that n() reports the sample size.
+
+
Summary statistics: Some useful function calls for summary statistics for a single numerical variable are as follows:
+
+
mean
+
median
+
sd
+
var
+
IQR
+
min
+
max
+
+
Note that each of these functions take a single vector as an argument, and returns a single value.
+
+
We can also filter based on multiple criteria. Suppose we are interested in flights headed to San Francisco (SFO) in February:
Note that we can separate the conditions using commas if we want flights that are both headed to SFO and in February. If we are interested in either flights headed to SFO or in February we can use the | instead of the comma.
+
+
Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?
+
Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.
+
+
Another useful technique is quickly calculating summary statistics for various groups in your data frame. For example, we can modify the above command using the group_by function to get the same summary stats for each origin airport:
Here, we first grouped the data by origin, and then calculated the summary statistics.
+
+
Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?
+
+
+
+
Departure delays over months
+
Which month would you expect to have the highest average delay departing from an NYC airport?
+
Let’s think about how we would answer this question:
+
+
First, calculate monthly averages for departure delays. With the new language we are learning, we need to
+
+
group_by months, then
+
summarise mean departure delays.
+
+
Then, we need to arrange these average delays in descending order
Suppose you really dislike departure delays, and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?
+
+
+
+
+
On time departure rate for NYC airports
+
Suppose you will be flying out of NYC and want to know which of the three major NYC airports has the best on time departure rate of departing flights. Suppose also that for you a flight that is delayed for less than 5 minutes is basically “on time”. You consider any flight delayed for 5 minutes of more to be “delayed”.
+
In order to determine which airport has the best on time departure rate, we need to
+
+
first classify each flight as “on time” or “delayed”,
+
then group flights by origin airport,
+
then calculate on time departure rates for each origin airport,
+
and finally arrange the airports in descending order for on time departure percentage.
+
+
Let’s start with classifying each flight as “on time” or “delayed” by creating a new variable with the mutate function.
The first argument in the mutate function is the name of the new variable we want to create, in this case dep_type. Then if dep_delay < 5 we classify the flight as "on time" and "delayed" if not, i.e. if the flight is delayed for 5 or more minutes.
+
Note that we are also overwriting the nycflights data frame with the new version of this data frame that includes the new dep_type variable.
+
We can handle all the remaining steps in one code chunk:
If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?
+
+
We can also visualize the distribution of on on time departure rate across the three airports using a segmented bar plot.
+
qplot(x = origin, fill = dep_type, data = nycflights, geom ="bar")
+
+
+
+
+
More Practice
+
+
Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.
+
Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom = "point".
+
Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/normal_distribution/more/Body.csv b/03_normal_distribution/more/Body.csv
similarity index 100%
rename from normal_distribution/more/Body.csv
rename to 03_normal_distribution/more/Body.csv
diff --git a/normal_distribution/more/bdims.RData b/03_normal_distribution/more/bdims.RData
similarity index 100%
rename from normal_distribution/more/bdims.RData
rename to 03_normal_distribution/more/bdims.RData
diff --git a/normal_distribution/more/description-of-bdims.txt b/03_normal_distribution/more/description-of-bdims.txt
similarity index 100%
rename from normal_distribution/more/description-of-bdims.txt
rename to 03_normal_distribution/more/description-of-bdims.txt
diff --git a/normal_distribution/more/histQQmatch.pdf b/03_normal_distribution/more/histQQmatch.pdf
similarity index 100%
rename from normal_distribution/more/histQQmatch.pdf
rename to 03_normal_distribution/more/histQQmatch.pdf
diff --git a/03_normal_distribution/more/histQQmatch.png b/03_normal_distribution/more/histQQmatch.png
new file mode 100644
index 0000000..65b7718
Binary files /dev/null and b/03_normal_distribution/more/histQQmatch.png differ
diff --git a/03_normal_distribution/more/histQQmatchgg.png b/03_normal_distribution/more/histQQmatchgg.png
new file mode 100644
index 0000000..a28b3fd
Binary files /dev/null and b/03_normal_distribution/more/histQQmatchgg.png differ
diff --git a/normal_distribution/more/qqnormsim.R b/03_normal_distribution/more/qqnormsim.R
similarity index 100%
rename from normal_distribution/more/qqnormsim.R
rename to 03_normal_distribution/more/qqnormsim.R
diff --git a/normal_distribution/normal_distribution.Rmd b/03_normal_distribution/normal_distribution.Rmd
similarity index 53%
rename from normal_distribution/normal_distribution.Rmd
rename to 03_normal_distribution/normal_distribution.Rmd
index e772911..46d95bb 100644
--- a/normal_distribution/normal_distribution.Rmd
+++ b/03_normal_distribution/normal_distribution.Rmd
@@ -2,11 +2,17 @@
title: "The normal distribution"
output:
html_document:
- theme: cerulean
- highlight: pygments
css: ../lab.css
+ highlight: pygments
+ theme: cerulean
+ toc: true
+ toc_float: true
---
+```{r echo = FALSE}
+knitr::opts_chunk$set(eval = FALSE)
+```
+
In this lab we'll investigate the probability distribution that is most central
to statistics: the normal distribution. If we are confident that our data are
nearly normal, that opens the door to many powerful statistical methods. Here
@@ -17,36 +23,41 @@ learn how to generate random numbers from a normal distribution.
This week we'll be working with measurements of body dimensions. This data set
contains measurements from 247 men and 260 women, most of whom were considered
-healthy young adults.
-
-```{r load-data, eval=FALSE}
-download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
-load("bdims.RData")
+healthy young adults. Let's take a quick peek at the first few rows of the data.
+
+```{r load-data}
+library(mosaic)
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+data(bdims)
+head(bdims)
```
-Let's take a quick peek at the first few rows of the data.
+You'll see that for every observation we have 25 measurements, many of which are
+either diameters or girths. You can learn about what the variable names mean by
+bringing up the help page.
-```{r head-data, eval=FALSE}
-head(bdims)
+```{r help-bdims}
+?bdims
```
-You'll see that for every observation we have 25 measurements, many of which are
-either diameters or girths. A key to the variable names can be found at
-[http://www.openintro.org/stat/data/bdims.php](http://www.openintro.org/stat/data/bdims.php),
-but we'll be focusing on just three columns to get started: weight in kg (`wgt`),
-height in cm (`hgt`), and `sex` (`1` indicates male, `0` indicates female).
+We'll be focusing on just three columns to get started: weight in kg (`wgt`),
+height in cm (`hgt`), and `sex` (`m` indicates male, `f` indicates female).
Since males and females tend to have different body dimensions, it will be
useful to create two additional data sets: one with only men and another with
only women.
-```{r male-female, eval=FALSE}
-mdims <- subset(bdims, sex == 1)
-fdims <- subset(bdims, sex == 0)
+```{r male-female}
+mdims <- bdims %>%
+ filter(sex == "m")
+fdims <- bdims %>%
+ filter(sex == "f")
```
-1. Make a histogram of men's heights and a histogram of women's heights. How
- would you compare the various aspects of the two distributions?
+1. Make a plot (or plots) to visualize the distributions of men's and women's heights.
+ How do their centers, shapes, and spreads compare?
## The normal distribution
@@ -60,43 +71,38 @@ This normal curve should have the same mean and standard deviation as the data.
We'll be working with women's heights, so let's store them as a separate object
and then calculate some statistics that will be referenced later.
-```{r female-hgt-mean-sd, eval=FALSE}
-fhgtmean <- mean(fdims$hgt)
-fhgtsd <- sd(fdims$hgt)
+
+```{r female-hgt-mean-sd}
+fhgtmean <- mean(~hgt, data = fdims)
+fhgtsd <- sd(~hgt, data = fdims)
```
Next we make a density histogram to use as the backdrop and use the `lines`
function to overlay a normal probability curve. The difference between a
frequency histogram and a density histogram is that while in a frequency
histogram the *heights* of the bars add up to the total number of observations,
-in a density histogram the *areas* of the bars add up to 1. The area of each bar
+in a density histogram the *areas* of the bars add up to 1. The area of each bar
can be calculated as simply the height *times* the width of the bar. Using a
-density histogram allows us to properly overlay a normal distribution curve over
-the histogram since the curve is a normal probability density function.
-Frequency and density histograms both display the same exact shape; they only
-differ in their y-axis. You can verify this by comparing the frequency histogram
-you constructed earlier and the density histogram created by the commands below.
-
-```{r hist-height, eval=FALSE}
-hist(fdims$hgt, probability = TRUE)
-x <- 140:190
-y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
-lines(x = x, y = y, col = "blue")
+density histogram allows us to properly overlay a normal distribution curve over
+the histogram since the curve is a normal probability density function that also
+has area under the curve of 1. Frequency and density histograms both display the
+same exact shape; they only differ in their y-axis. You can verify this by
+comparing the frequency histogram you constructed earlier and the density
+histogram created by the commands below.
+
+```{r hist-height}
+qplot(x = hgt, data = fdims, geom = "blank") +
+ geom_histogram(aes(y = ..density..)) +
+ stat_function(fun = dnorm, args = c(mean = fhgtmean, sd = fhgtsd), col = "tomato")
```
-After plotting the density histogram with the first command, we create the x-
-and y-coordinates for the normal curve. We chose the `x` range as 140 to 190 in
-order to span the entire range of `fheight`. To create `y`, we use `dnorm` to
-calculate the density of each of those x-values in a distribution that is normal
-with mean `fhgtmean` and standard deviation `fhgtsd`. The final command draws a
-curve on the existing plot (the density histogram) by connecting each of the
-points specified by `x` and `y`. The argument `col` simply sets the color for
+After initializing a blank plot with the first command, the `ggplot2` package
+allows us to add additional layers. The first layer is a density histogram. The
+second layer is a statistical function -- the density of the normal curve, `dnorm`.
+We specify that we want the curve to have the same mean and standard deviation
+as the column of female heights. The argument `col` simply sets the color for
the line to be drawn. If we left it out, the line would be drawn in black.
-The top of the curve is cut off because the limits of the x- and y-axes are set
-to best fit the histogram. To adjust the y-axis you can add a third argument to
-the histogram function: `ylim = c(0, 0.06)`.
-
2. Based on the this plot, does it appear that the data follow a nearly normal
distribution?
@@ -109,47 +115,54 @@ close the histogram is to the curve. An alternative approach involves
constructing a normal probability plot, also called a normal Q-Q plot for
"quantile-quantile".
-```{r qq, eval=FALSE}
-qqnorm(fdims$hgt)
-qqline(fdims$hgt)
+```{r qq}
+qplot(sample = hgt, data = fdims, geom = "qq")
```
-A data set that is nearly normal will result in a probability plot where the
-points closely follow the line. Any deviations from normality leads to
-deviations of these points from the line. The plot for female heights shows
-points that tend to follow the line but with some errant points towards the
-tails. We're left with the same problem that we encountered with the histogram
-above: how close is close enough?
+The x-axis values correspond to the quantiles of a theoretically normal curve
+with mean 0 and standard deviation 1 (i.e., the standard normal distribution). The
+y-axis values correspond to the quantiles of the original unstandardized sample
+data. However, even if we were to standardize the sample data values, the Q-Q
+plot would look identical. A data set that is nearly normal will result in a
+probability plot where the points closely follow a diagonal line. Any deviations from
+normality leads to deviations of these points from that line.
+
+The plot for female heights shows points that tend to follow the line but with
+some errant points towards the tails. We're left with the same problem that we
+encountered with the histogram above: how close is close enough?
A useful way to address this question is to rephrase it as: what do probability
plots look like for data that I *know* came from a normal distribution? We can
answer this by simulating data from a normal distribution using `rnorm`.
-```{r sim-norm, eval=FALSE}
-sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)
+```{r sim-norm}
+sim_norm <- rnorm(n = nrow(fdims), mean = fhgtmean, sd = fhgtsd)
```
The first argument indicates how many numbers you'd like to generate, which we
specify to be the same number of heights in the `fdims` data set using the
-`length` function. The last two arguments determine the mean and standard
+`nrow()` function. The last two arguments determine the mean and standard
deviation of the normal distribution from which the simulated sample will be
generated. We can take a look at the shape of our simulated data set, `sim_norm`,
as well as its normal probability plot.
3. Make a normal probability plot of `sim_norm`. Do all of the points fall on
the line? How does this plot compare to the probability plot for the real
- data?
+ data? (Since `sim_norm` is not a dataframe, it can be put directly into the
+ `sample` argument and the `data` argument can be dropped.)
Even better than comparing the original plot to a single plot generated from a
normal distribution is to compare it to many more plots using the following
-function. It may be helpful to click the zoom button in the plot window.
+function. It shows the Q-Q plot corresponding to the original data in the top
+left corner, and the Q-Q plots of 8 different simulated normal data. It may be
+helpful to click the zoom button in the plot window.
-```{r qqnormsim, eval=FALSE}
-qqnormsim(fdims$hgt)
+```{r qqnormsim}
+qqnormsim(sample = hgt, data = fdims)
```
-4. Does the normal probability plot for `fdims$hgt` look similar to the plots
- created for the simulated data? That is, do plots provide evidence that the
+4. Does the normal probability plot for female heights look similar to the plots
+ created for the simulated data? That is, do the plots provide evidence that the
female heights are nearly normal?
5. Using the same technique, determine whether or not female weights appear to
@@ -172,13 +185,13 @@ exercise.)
If we assume that female heights are normally distributed (a very close
approximation is also okay), we can find this probability by calculating a Z
score and consulting a Z table (also called a normal probability table). In R,
-this is done in one step with the function `pnorm`.
+this is done in one step with the function `pnorm()`.
-```{r pnorm, eval=FALSE}
+```{r pnorm}
1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)
```
-Note that the function `pnorm` gives the area under the normal curve below a
+Note that the function `pnorm()` gives the area under the normal curve below a
given value, `q`, with a given mean and standard deviation. Since we're
interested in the probability that someone is taller than 182 cm, we have to
take one minus that probability.
@@ -188,8 +201,10 @@ probability. If we want to calculate the probability empirically, we simply
need to determine how many observations fall above 182 then divide this number
by the total sample size.
-```{r probability, eval=FALSE}
-sum(fdims$hgt > 182) / length(fdims$hgt)
+```{r probability}
+fdims %>%
+ filter(hgt > 182) %>%
+ summarise(percent = n() / nrow(fdims))
```
Although the probabilities are not exactly the same, they are reasonably close.
@@ -197,16 +212,16 @@ The closer that your distribution is to being normal, the more accurate the
theoretical probabilities will be.
6. Write out two probability questions that you would like to answer; one
- regarding female heights and one regarding female weights. Calculate the
+ regarding female heights and one regarding female weights. Calculate
those probabilities using both the theoretical normal distribution as well
as the empirical distribution (four probabilities in all). Which variable,
height or weight, had a closer agreement between the two methods?
* * *
-## On Your Own
+## More Practice
-- Now let's consider some of the other variables in the body dimensions data
+7. Now let's consider some of the other variables in the body dimensions data
set. Using the figures at the end of the exercises, match the histogram to
its normal probability plot. All of the variables have been standardized
(first subtract the mean, then divide by the standard deviation), so the
@@ -225,17 +240,94 @@ theoretical probabilities will be.
**d.** The histogram for female chest depth (`che.de`) belongs to normal
probability plot letter ____.
-- Note that normal probability plots C and D have a slight stepwise pattern.
+8. Note that normal probability plots C and D have a slight stepwise pattern.
Why do you think this is the case?
-- As you can see, normal probability plots can be used both to assess
+9. As you can see, normal probability plots can be used both to assess
normality and visualize skewness. Make a normal probability plot for female
knee diameter (`kne.di`). Based on this normal probability plot, is this
variable left skewed, symmetric, or right skewed? Use a histogram to confirm
your findings.
-
+```{r hists-and-qqs, echo=FALSE, eval=FALSE}
+sdata <- fdims %>%
+ mutate(sdata = (bii.di - mean(bii.di))/sd(bii.di)) %>%
+ select(sdata)
+p1 <- ggplot(sdata, aes(x = sdata)) +
+ geom_histogram() +
+ ggtitle("Histogram for female bii.di")
+p4 <- qplot(sample = sdata, data = sdata, stat = "qq") +
+ ggtitle("Normal QQ plot B")
+sdata <- fdims %>%
+ mutate(sdata = (elb.di - mean(elb.di))/sd(elb.di)) %>%
+ select(sdata)
+p3 <- ggplot(sdata, aes(x = sdata)) +
+ geom_histogram() +
+ ggtitle("Histogram for female elb.di")
+p6 <- qplot(sample = sdata, data = sdata, stat = "qq") +
+ ggtitle("Normal QQ plot C")
+sdata <- bdims %>%
+ mutate(sdata = (age - mean(age))/sd(age)) %>%
+ select(sdata)
+p5 <- ggplot(sdata, aes(x = sdata)) +
+ geom_histogram() +
+ ggtitle("Histogram for general age")
+p8 <- qplot(sample = sdata, data = sdata, stat = "qq") +
+ ggtitle("Normal QQ plot D")
+sdata <- fdims %>%
+ mutate(sdata = (che.de - mean(che.de))/sd(che.de)) %>%
+ select(sdata)
+p7 <- ggplot(sdata, aes(x = sdata)) +
+ geom_histogram() +
+ ggtitle("Histogram for general age")
+p2 <- qplot(sample = sdata, data = sdata, stat = "qq") +
+ ggtitle("Normal QQ plot A")
+
+multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
+ library(grid)
+
+ # Make a list from the ... arguments and plotlist
+ plots <- c(list(...), plotlist)
+
+ numPlots = length(plots)
+
+ # If layout is NULL, then use 'cols' to determine layout
+ if (is.null(layout)) {
+ # Make the panel
+ # ncol: Number of columns of plots
+ # nrow: Number of rows needed, calculated from # of cols
+ layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
+ ncol = cols, nrow = ceiling(numPlots/cols))
+ }
+
+ if (numPlots==1) {
+ print(plots[[1]])
+
+ } else {
+ # Set up the page
+ grid.newpage()
+ pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
+
+ # Make each plot, in the correct location
+ for (i in 1:numPlots) {
+ # Get the i,j matrix positions of the regions that contain this subplot
+ matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
+
+ print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
+ layout.pos.col = matchidx$col))
+ }
+ }
+}
+
+png("more/histQQmatch.png", height = 1600, width = 1200, res = 150)
+multiplot(p1, p2, p3, p4, p5, p6, p7, p8,
+ layout = matrix(1:8, ncol = 2, byrow = TRUE))
+dev.off()
+```
+
+
+
This is a product of OpenIntro that is released under a
diff --git a/03_normal_distribution/normal_distribution.html b/03_normal_distribution/normal_distribution.html
new file mode 100644
index 0000000..1a378c8
--- /dev/null
+++ b/03_normal_distribution/normal_distribution.html
@@ -0,0 +1,372 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+The normal distribution
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
The normal distribution
+
+
+
+
+
In this lab we’ll investigate the probability distribution that is most central to statistics: the normal distribution. If we are confident that our data are nearly normal, that opens the door to many powerful statistical methods. Here we’ll use the graphical tools of R to assess the normality of our data and also learn how to generate random numbers from a normal distribution.
+
+
The Data
+
This week we’ll be working with measurements of body dimensions. This data set contains measurements from 247 men and 260 women, most of whom were considered healthy young adults. Let’s take a quick peek at the first few rows of the data.
You’ll see that for every observation we have 25 measurements, many of which are either diameters or girths. You can learn about what the variable names mean by bringing up the help page.
+
?bdims
+
We’ll be focusing on just three columns to get started: weight in kg (wgt), height in cm (hgt), and sex (m indicates male, f indicates female).
+
Since males and females tend to have different body dimensions, it will be useful to create two additional data sets: one with only men and another with only women.
Make a plot (or plots) to visualize the distributions of men’s and women’s heights.
+How do their centers, shapes, and spreads compare?
+
+
+
+
The normal distribution
+
In your description of the distributions, did you use words like bell-shaped or normal? It’s tempting to say so when faced with a unimodal symmetric distribution.
+
To see how accurate that description is, we can plot a normal distribution curve on top of a histogram to see how closely the data follow a normal distribution. This normal curve should have the same mean and standard deviation as the data. We’ll be working with women’s heights, so let’s store them as a separate object and then calculate some statistics that will be referenced later.
+
fhgtmean <-mean(~hgt, data = fdims)
+fhgtsd <-sd(~hgt, data = fdims)
+
Next we make a density histogram to use as the backdrop and use the lines function to overlay a normal probability curve. The difference between a frequency histogram and a density histogram is that while in a frequency histogram the heights of the bars add up to the total number of observations, in a density histogram the areas of the bars add up to 1. The area of each bar can be calculated as simply the height times the width of the bar. Using a density histogram allows us to properly overlay a normal distribution curve over the histogram since the curve is a normal probability density function that also has area under the curve of 1. Frequency and density histograms both display the same exact shape; they only differ in their y-axis. You can verify this by comparing the frequency histogram you constructed earlier and the density histogram created by the commands below.
After initializing a blank plot with the first command, the ggplot2 package allows us to add additional layers. The first layer is a density histogram. The second layer is a statistical function – the density of the normal curve, dnorm. We specify that we want the curve to have the same mean and standard deviation as the column of female heights. The argument col simply sets the color for the line to be drawn. If we left it out, the line would be drawn in black.
+
+
Based on the this plot, does it appear that the data follow a nearly normal distribution?
+
+
+
+
Evaluating the normal distribution
+
Eyeballing the shape of the histogram is one way to determine if the data appear to be nearly normally distributed, but it can be frustrating to decide just how close the histogram is to the curve. An alternative approach involves constructing a normal probability plot, also called a normal Q-Q plot for “quantile-quantile”.
+
qplot(sample = hgt, data = fdims, geom ="qq")
+
The x-axis values correspond to the quantiles of a theoretically normal curve with mean 0 and standard deviation 1 (i.e., the standard normal distribution). The y-axis values correspond to the quantiles of the original unstandardized sample data. However, even if we were to standardize the sample data values, the Q-Q plot would look identical. A data set that is nearly normal will result in a probability plot where the points closely follow a diagonal line. Any deviations from normality leads to deviations of these points from that line.
+
The plot for female heights shows points that tend to follow the line but with some errant points towards the tails. We’re left with the same problem that we encountered with the histogram above: how close is close enough?
+
A useful way to address this question is to rephrase it as: what do probability plots look like for data that I know came from a normal distribution? We can answer this by simulating data from a normal distribution using rnorm.
+
sim_norm <-rnorm(n =nrow(fdims), mean = fhgtmean, sd = fhgtsd)
+
The first argument indicates how many numbers you’d like to generate, which we specify to be the same number of heights in the fdims data set using the nrow() function. The last two arguments determine the mean and standard deviation of the normal distribution from which the simulated sample will be generated. We can take a look at the shape of our simulated data set, sim_norm, as well as its normal probability plot.
+
+
Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a dataframe, it can be put directly into the sample argument and the data argument can be dropped.)
+
+
Even better than comparing the original plot to a single plot generated from a normal distribution is to compare it to many more plots using the following function. It shows the Q-Q plot corresponding to the original data in the top left corner, and the Q-Q plots of 8 different simulated normal data. It may be helpful to click the zoom button in the plot window.
+
qqnormsim(sample = hgt, data = fdims)
+
+
Does the normal probability plot for female heights look similar to the plots created for the simulated data? That is, do the plots provide evidence that the female heights are nearly normal?
+
Using the same technique, determine whether or not female weights appear to come from a normal distribution.
+
+
+
+
Normal probabilities
+
Okay, so now you have a slew of tools to judge whether or not a variable is normally distributed. Why should we care?
+
It turns out that statisticians know a lot about the normal distribution. Once we decide that a random variable is approximately normal, we can answer all sorts of questions about that variable related to probability. Take, for example, the question of, “What is the probability that a randomly chosen young adult female is taller than 6 feet (about 182 cm)?” (The study that published this data set is clear to point out that the sample was not random and therefore inference to a general population is not suggested. We do so here only as an exercise.)
+
If we assume that female heights are normally distributed (a very close approximation is also okay), we can find this probability by calculating a Z score and consulting a Z table (also called a normal probability table). In R, this is done in one step with the function pnorm().
+
1 -pnorm(q =182, mean = fhgtmean, sd = fhgtsd)
+
Note that the function pnorm() gives the area under the normal curve below a given value, q, with a given mean and standard deviation. Since we’re interested in the probability that someone is taller than 182 cm, we have to take one minus that probability.
+
Assuming a normal distribution has allowed us to calculate a theoretical probability. If we want to calculate the probability empirically, we simply need to determine how many observations fall above 182 then divide this number by the total sample size.
Although the probabilities are not exactly the same, they are reasonably close. The closer that your distribution is to being normal, the more accurate the theoretical probabilities will be.
+
+
Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?
+
+
+
+
+
More Practice
+
+
Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.
+
a. The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter ____.
+
b. The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter ____.
+
c. The histogram for general age (age) belongs to normal probability plot letter ____.
+
d. The histogram for female chest depth (che.de) belongs to normal probability plot letter ____.
+
Note that normal probability plots C and D have a slight stepwise pattern.
+Why do you think this is the case?
+
As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.
+
+
+
+
+
+
+
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/probability/more/calc_streak.R b/04_probability/more/calc_streak.R
similarity index 100%
rename from probability/more/calc_streak.R
rename to 04_probability/more/calc_streak.R
diff --git a/probability/more/kobe-readme.txt b/04_probability/more/kobe-readme.txt
similarity index 100%
rename from probability/more/kobe-readme.txt
rename to 04_probability/more/kobe-readme.txt
diff --git a/probability/more/kobe.RData b/04_probability/more/kobe.RData
similarity index 100%
rename from probability/more/kobe.RData
rename to 04_probability/more/kobe.RData
diff --git a/probability/more/kobe.csv b/04_probability/more/kobe.csv
similarity index 100%
rename from probability/more/kobe.csv
rename to 04_probability/more/kobe.csv
diff --git a/probability/more/kobe_data.xls b/04_probability/more/kobe_data.xls
similarity index 100%
rename from probability/more/kobe_data.xls
rename to 04_probability/more/kobe_data.xls
diff --git a/probability/probability.Rmd b/04_probability/probability.Rmd
similarity index 66%
rename from probability/probability.Rmd
rename to 04_probability/probability.Rmd
index 0f8e958..cb83570 100644
--- a/probability/probability.Rmd
+++ b/04_probability/probability.Rmd
@@ -2,19 +2,28 @@
title: "Probability"
output:
html_document:
- theme: cerulean
- highlight: pygments
css: ../lab.css
+ highlight: pygments
+ theme: cerulean
+ toc: true
+ toc_float: true
---
-## Hot Hands
+```{r global_options, include=FALSE}
+knitr::opts_chunk$set(eval = FALSE)
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+```
+
+## The Hot Hand
Basketball players who make several baskets in succession are described as
having a *hot hand*. Fans and players have long believed in the hot hand
phenomenon, which refutes the assumption that each shot is independent of the
-next. However, a 1985 paper by Gilovich, Vallone, and Tversky collected evidence
+next. However, [a 1985 paper](http://www.sciencedirect.com/science/article/pii/0010028585900106) by Gilovich, Vallone, and Tversky collected evidence
that contradicted this belief and showed that successive shots are independent
-events ([http://psych.cornell.edu/sites/default/files/Gilo.Vallone.Tversky.pdf](http://psych.cornell.edu/sites/default/files/Gilo.Vallone.Tversky.pdf)). This paper started a great controversy that continues to this day, as you can
+events. This paper started a great controversy that continues to this day, as you can
see by Googling *hot hand basketball*.
We do not expect to resolve this controversy today. However, in this lab we'll
@@ -23,38 +32,41 @@ to (1) think about the effects of independent and dependent events, (2) learn
how to simulate shooting streaks in R, and (3) to compare a simulation to actual
data in order to determine if the hot hand phenomenon appears to be real.
-## Saving your code
+## Getting Started
-Click on File -> New -> R Script. This will open a blank document above the
-console. As you go along you can copy and paste your code here and save it. This
-is a good way to keep track of your code and be able to reuse it later. To run
-your code from this document you can either copy and paste it into the console,
-highlight the code and hit the Run button, or highlight the code and hit
-command+enter or a mac or control+enter on a PC.
+### Load packages
-You'll also want to save this script (code document). To do so click on the disk
-icon. The first time you hit save, RStudio will ask for a file name; you can
-name it anything you like. Once you hit save you'll see the file appear under
-the Files tab in the lower right panel. You can reopen this file anytime by
-simply clicking on it.
+In this lab we will explore the data using the `dplyr` package and visualize it
+using the `ggplot2` package for data visualization. The data can be found in the
+companion package for OpenIntro labs, `oilabs`.
-## Getting Started
+Let's load the packages.
+
+```{r load-packages, message=FALSE}
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+```
+
+### Creating a reproducible lab report
-Our investigation will focus on the performance of one player: Kobe Bryant of
-the Los Angeles Lakers. His performance against the Orlando Magic in the 2009
-NBA finals earned him the title *Most Valuable Player* and many spectators
-commented on how he appeared to show a hot hand. Let's load some data from those
-games and look at the first several rows.
+To create your new lab report, start by opening a new R Markdown document... From Template... then select Lab Report from the `oilabs` package.
-```{r load-data, eval=FALSE}
-download.file("http://www.openintro.org/stat/data/kobe.RData", destfile = "kobe.RData")
-load("kobe.RData")
-head(kobe)
+### Data
+
+Our investigation will focus on the performance of one player: [Kobe Bryant](https://en.wikipedia.org/wiki/Kobe_Bryant) of
+the Los Angeles Lakers. His performance against the Orlando Magic in the [2009
+NBA Finals](https://en.wikipedia.org/wiki/2009_NBA_Finals) earned him the title *Most Valuable Player* and many spectators
+commented on how he appeared to show a hot hand. Let's load some necessary files
+that we will need for this lab.
+
+```{r load-data}
+data(kobe_basket)
```
-In this data frame, every row records a shot taken by Kobe Bryant. If he hit the
-shot (made a basket), a hit, `H`, is recorded in the column named `basket`,
-otherwise a miss, `M`, is recorded.
+This data frame contains 133 observations and 6 variables, where every
+row records a shot taken by Kobe Bryant. The `shot` variable in this dataset
+indicates whether the shot was a hit (`H`) or a miss (`M`).
Just looking at the string of hits and misses, it can be difficult to gauge
whether or not it seems like Kobe was shooting with a hot hand. One way we can
@@ -67,11 +79,7 @@ his nine shot attempts in the first quarter:
\[ \textrm{H M | M | H H M | M | M | M} \]
-To verify this use the following command:
-
-```{r first9, eval=FALSE}
-kobe$basket[1:9]
-```
+You can verify this by viewing the first 8 rows of the data in the data viewer.
Within the nine shot attempts, there are six streaks, which are separated by a
"|" above. Their lengths are one, zero, two, zero, zero, zero (in order of
@@ -80,26 +88,28 @@ occurrence).
1. What does a streak length of 1 mean, i.e. how many hits and misses are in a
streak of 1? What about a streak length of 0?
-The custom function `calc_streak`, which was loaded in with the data, may be
-used to calculate the lengths of all shooting streaks and then look at the
-distribution.
+Counting streak lengths manually for all 133 shots would get tedious, so we'll
+use the custom function `calc_streak` to calculate them, and store the results
+in a data frame called `kobe_streak` as the `length` variable.
-```{r calc-streak-kobe, eval=FALSE}
-kobe_streak <- calc_streak(kobe$basket)
-barplot(table(kobe_streak))
+```{r calc-streak-kobe}
+kobe_streak <- calc_streak(kobe_basket$shot)
```
-Note that instead of making a histogram, we chose to make a bar plot from a
-table of the streak data. A bar plot is preferable here since our variable is
-discrete -- counts -- instead of continuous.
+We can then take a look at the distribution of these streak lengths.
+
+```{r plot-streak-kobe}
+qplot(data = kobe_streak, x = length, geom = "bar")
+```
2. Describe the distribution of Kobe's streak lengths from the 2009 NBA finals.
- What was his typical streak length? How long was his longest streak of baskets?
+ What was his typical streak length? How long was his longest streak of
+ baskets? Make sure to include the accompanying plot in your answer.
## Compared to What?
We've shown that Kobe had some long shooting streaks, but are they long enough
-to support the belief that he had hot hands? What can we compare them to?
+to support the belief that he had a hot hand? What can we compare them to?
To answer these questions, let's return to the idea of *independence*. Two
processes are independent if the outcome of one process doesn't effect the outcome
@@ -135,8 +145,8 @@ same probability of hitting every shot regardless of his past shots: 45%.
Now that we've phrased the situation in terms of independent shots, let's return
to the question: how do we tell if Kobe's shooting streaks are long enough to
-indicate that he has hot hands? We can compare his streak lengths to someone
-without hot hands: an independent shooter.
+indicate that he has a hot hand? We can compare his streak lengths to someone
+without a hot hand: an independent shooter.
## Simulations in R
@@ -146,12 +156,12 @@ ground rules of a random process and then the computer uses random numbers to
generate an outcome that adheres to those rules. As a simple example, you can
simulate flipping a fair coin with the following.
-```{r head-tail, eval=FALSE}
-outcomes <- c("heads", "tails")
-sample(outcomes, size = 1, replace = TRUE)
+```{r head-tail}
+coin_outcomes <- c("heads", "tails")
+sample(coin_outcomes, size = 1, replace = TRUE)
```
-The vector `outcomes` can be thought of as a hat with two slips of paper in it:
+The vector `coin_outcomes` can be thought of as a hat with two slips of paper in it:
one slip says `heads` and the other says `tails`. The function `sample` draws
one slip from the hat and tells us if it was a head or a tail.
@@ -165,25 +175,26 @@ governs how many samples to draw (the `replace = TRUE` argument indicates we put
the slip of paper back in the hat before drawing again). Save the resulting
vector of heads and tails in a new object called `sim_fair_coin`.
-```{r sim-fair-coin, eval=FALSE}
-sim_fair_coin <- sample(outcomes, size = 100, replace = TRUE)
+```{r sim-fair-coin}
+sim_fair_coin <- sample(coin_outcomes, size = 100, replace = TRUE)
```
To view the results of this simulation, type the name of the object and then use
`table` to count up the number of heads and tails.
-```{r table-sim-fair-coin, eval=FALSE}
+```{r table-sim-fair-coin}
sim_fair_coin
table(sim_fair_coin)
```
-Since there are only two elements in `outcomes`, the probability that we "flip"
+Since there are only two elements in `coin_outcomes`, the probability that we "flip"
a coin and it lands heads is 0.5. Say we're trying to simulate an unfair coin
that we know only lands heads 20% of the time. We can adjust for this by adding
an argument called `prob`, which provides a vector of two probability weights.
-```{r sim-unfair-coin, eval=FALSE}
-sim_unfair_coin <- sample(outcomes, size = 100, replace = TRUE, prob = c(0.2, 0.8))
+```{r sim-unfair-coin}
+sim_unfair_coin <- sample(coin_outcomes, size = 100, replace = TRUE,
+ prob = c(0.2, 0.8))
```
`prob=c(0.2, 0.8)` indicates that for the two elements in the `outcomes` vector,
@@ -194,7 +205,25 @@ think of the outcome space as a bag of 10 chips, where 2 chips are labeled
chip that says "head"" is 20%, and "tail" is 80%.
3. In your simulation of flipping the unfair coin 100 times, how many flips
- came up heads?
+ came up heads? Include the code for sampling the unfair coin in your response.
+ Since the markdown file will run the code, and generate a new sample each time
+ you *Knit* it, you should also "set a seed" **before** you sample. Read more
+ about setting a seed below.
+
+
+**A note on setting a seed:** Setting a seed will cause R to select the same
+sample each time you knit your document. This will make sure your results don't
+change each time you knit, and it will also ensure reproducibility of your work
+(by setting the same seed it will be possible to reproduce your results). You can
+set a seed like this:
+```{r set-seed}
+set.seed(35797) # make sure to change the seed
+```
+The number above is completely arbitraty. If you need inspiration, you can use your
+ID, birthday, or just a random string of numbers. The important thing is that you
+use each seed only once. Remember to do this **before** you sample in the exercise
+above.
+
In a sense, we've shrunken the size of the slip of paper that says "heads",
making it less likely to be drawn and we've increased the size of the slip of
@@ -206,7 +235,7 @@ an equal probability of being drawn.
If you want to learn more about `sample` or any other function, recall that you
can always check out its help file.
-```{r help-sample, eval=FALSE,tidy = FALSE}
+```{r help-sample,tidy = FALSE}
?sample
```
@@ -216,9 +245,9 @@ Simulating a basketball player who has independent shots uses the same mechanism
that we use to simulate a coin flip. To simulate a single shot from an
independent shooter with a shooting percentage of 50% we type,
-```{r sim-basket, eval=FALSE}
-outcomes <- c("H", "M")
-sim_basket <- sample(outcomes, size = 1, replace = TRUE)
+```{r sim-basket}
+shot_outcomes <- c("H", "M")
+sim_basket <- sample(shot_outcomes, size = 1, replace = TRUE)
```
To make a valid comparison between Kobe and our simulated independent shooter,
@@ -235,13 +264,7 @@ R overwrites the old object with the new one, so always make sure that you don't
need the information in an old vector before reassigning its name.
With the results of the simulation saved as `sim_basket`, we have the data
-necessary to compare Kobe to our independent shooter. We can look at Kobe's data
-alongside our simulated data.
-
-```{r compare-basket, eval=FALSE}
-kobe$basket
-sim_basket
-```
+necessary to compare Kobe to our independent shooter.
Both data sets represent the results of 133 shot attempts, each with the same
shooting percentage of 45%. We know that our simulated data is from a shooter
@@ -250,26 +273,31 @@ a hot hand.
* * *
-## On your own
+## More Practice
### Comparing Kobe Bryant to the Independent Shooter
-Using `calc_streak`, compute the streak lengths of `sim_basket`.
+5. Using `calc_streak`, compute the streak lengths of `sim_basket`, and
+ save the results in a data frame called `sim_streak`.
-- Describe the distribution of streak lengths. What is the typical streak
+6. Describe the distribution of streak lengths. What is the typical streak
length for this simulated independent shooter with a 45% shooting percentage?
- How long is the player's longest streak of baskets in 133 shots?
+ How long is the player's longest streak of baskets in 133 shots? Make sure
+ to include a plot in your answer.
-- If you were to run the simulation of the independent shooter a second time,
+7. If you were to run the simulation of the independent shooter a second time,
how would you expect its streak distribution to compare to the distribution
from the question above? Exactly the same? Somewhat similar? Totally
different? Explain your reasoning.
-- How does Kobe Bryant's distribution of streak lengths compare to the
+8. How does Kobe Bryant's distribution of streak lengths compare to the
distribution of streak lengths for the simulated shooter? Using this
comparison, do you have evidence that the hot hand model fits Kobe's
shooting patterns? Explain.
+
+
+
This is a product of OpenIntro that is released under a
[Creative Commons Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0).
diff --git a/04_probability/probability.html b/04_probability/probability.html
new file mode 100644
index 0000000..e5de7d6
--- /dev/null
+++ b/04_probability/probability.html
@@ -0,0 +1,395 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+Probability
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Probability
+
+
+
+
+
+
The Hot Hand
+
Basketball players who make several baskets in succession are described as having a hot hand. Fans and players have long believed in the hot hand phenomenon, which refutes the assumption that each shot is independent of the next. However, a 1985 paper by Gilovich, Vallone, and Tversky collected evidence that contradicted this belief and showed that successive shots are independent events. This paper started a great controversy that continues to this day, as you can see by Googling hot hand basketball.
+
We do not expect to resolve this controversy today. However, in this lab we’ll apply one approach to answering questions like this. The goals for this lab are to (1) think about the effects of independent and dependent events, (2) learn how to simulate shooting streaks in R, and (3) to compare a simulation to actual data in order to determine if the hot hand phenomenon appears to be real.
+
+
+
Getting Started
+
+
Load packages
+
In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data can be found in the companion package for OpenIntro labs, oilabs.
+
Let’s load the packages.
+
library(dplyr)
+library(ggplot2)
+library(oilabs)
+
+
+
Creating a reproducible lab report
+
To create your new lab report, start by opening a new R Markdown document… From Template… then select Lab Report from the oilabs package.
+
+
+
Data
+
Our investigation will focus on the performance of one player: Kobe Bryant of the Los Angeles Lakers. His performance against the Orlando Magic in the 2009 NBA Finals earned him the title Most Valuable Player and many spectators commented on how he appeared to show a hot hand. Let’s load some necessary files that we will need for this lab.
+
data(kobe_basket)
+
This data frame contains 133 observations and 6 variables, where every row records a shot taken by Kobe Bryant. The shot variable in this dataset indicates whether the shot was a hit (H) or a miss (M).
+
Just looking at the string of hits and misses, it can be difficult to gauge whether or not it seems like Kobe was shooting with a hot hand. One way we can approach this is by considering the belief that hot hand shooters tend to go on shooting streaks. For this lab, we define the length of a shooting streak to be the number of consecutive baskets made until a miss occurs.
+
For example, in Game 1 Kobe had the following sequence of hits and misses from his nine shot attempts in the first quarter:
+
\[ \textrm{H M | M | H H M | M | M | M} \]
+
You can verify this by viewing the first 8 rows of the data in the data viewer.
+
Within the nine shot attempts, there are six streaks, which are separated by a “|” above. Their lengths are one, zero, two, zero, zero, zero (in order of occurrence).
+
+
What does a streak length of 1 mean, i.e. how many hits and misses are in a streak of 1? What about a streak length of 0?
+
+
Counting streak lengths manually for all 133 shots would get tedious, so we’ll use the custom function calc_streak to calculate them, and store the results in a data frame called kobe_streak as the length variable.
+
kobe_streak <-calc_streak(kobe_basket$shot)
+
We can then take a look at the distribution of these streak lengths.
+
qplot(data = kobe_streak, x = length, geom ="bar")
+
+
Describe the distribution of Kobe’s streak lengths from the 2009 NBA finals. What was his typical streak length? How long was his longest streak of baskets? Make sure to include the accompanying plot in your answer.
+
+
+
+
+
Compared to What?
+
We’ve shown that Kobe had some long shooting streaks, but are they long enough to support the belief that he had a hot hand? What can we compare them to?
+
To answer these questions, let’s return to the idea of independence. Two processes are independent if the outcome of one process doesn’t effect the outcome of the second. If each shot that a player takes is an independent process, having made or missed your first shot will not affect the probability that you will make or miss your second shot.
+
A shooter with a hot hand will have shots that are not independent of one another. Specifically, if the shooter makes his first shot, the hot hand model says he will have a higher probability of making his second shot.
+
Let’s suppose for a moment that the hot hand model is valid for Kobe. During his career, the percentage of time Kobe makes a basket (i.e. his shooting percentage) is about 45%, or in probability notation,
+
\[ P(\textrm{shot 1 = H}) = 0.45 \]
+
If he makes the first shot and has a hot hand (not independent shots), then the probability that he makes his second shot would go up to, let’s say, 60%,
As a result of these increased probabilites, you’d expect Kobe to have longer streaks. Compare this to the skeptical perspective where Kobe does not have a hot hand, where each shot is independent of the next. If he hit his first shot, the probability that he makes the second is still 0.45.
In other words, making the first shot did nothing to effect the probability that he’d make his second shot. If Kobe’s shots are independent, then he’d have the same probability of hitting every shot regardless of his past shots: 45%.
+
Now that we’ve phrased the situation in terms of independent shots, let’s return to the question: how do we tell if Kobe’s shooting streaks are long enough to indicate that he has a hot hand? We can compare his streak lengths to someone without a hot hand: an independent shooter.
+
+
+
Simulations in R
+
While we don’t have any data from a shooter we know to have independent shots, that sort of data is very easy to simulate in R. In a simulation, you set the ground rules of a random process and then the computer uses random numbers to generate an outcome that adheres to those rules. As a simple example, you can simulate flipping a fair coin with the following.
The vector coin_outcomes can be thought of as a hat with two slips of paper in it: one slip says heads and the other says tails. The function sample draws one slip from the hat and tells us if it was a head or a tail.
+
Run the second command listed above several times. Just like when flipping a coin, sometimes you’ll get a heads, sometimes you’ll get a tails, but in the long run, you’d expect to get roughly equal numbers of each.
+
If you wanted to simulate flipping a fair coin 100 times, you could either run the function 100 times or, more simply, adjust the size argument, which governs how many samples to draw (the replace = TRUE argument indicates we put the slip of paper back in the hat before drawing again). Save the resulting vector of heads and tails in a new object called sim_fair_coin.
To view the results of this simulation, type the name of the object and then use table to count up the number of heads and tails.
+
sim_fair_coin
+table(sim_fair_coin)
+
Since there are only two elements in coin_outcomes, the probability that we “flip” a coin and it lands heads is 0.5. Say we’re trying to simulate an unfair coin that we know only lands heads 20% of the time. We can adjust for this by adding an argument called prob, which provides a vector of two probability weights.
prob=c(0.2, 0.8) indicates that for the two elements in the outcomes vector, we want to select the first one, heads, with probability 0.2 and the second one, tails with probability 0.8. Another way of thinking about this is to think of the outcome space as a bag of 10 chips, where 2 chips are labeled “head” and 8 chips “tail”. Therefore at each draw, the probability of drawing a chip that says “head”" is 20%, and “tail” is 80%.
+
+
In your simulation of flipping the unfair coin 100 times, how many flips came up heads? Include the code for sampling the unfair coin in your response. Since the markdown file will run the code, and generate a new sample each time you Knit it, you should also “set a seed” before you sample. Read more about setting a seed below.
+
+
+
A note on setting a seed: Setting a seed will cause R to select the same sample each time you knit your document. This will make sure your results don’t change each time you knit, and it will also ensure reproducibility of your work (by setting the same seed it will be possible to reproduce your results). You can set a seed like this:
+
set.seed(35797) # make sure to change the seed
+
The number above is completely arbitraty. If you need inspiration, you can use your ID, birthday, or just a random string of numbers. The important thing is that you use each seed only once. Remember to do this before you sample in the exercise above.
+
+
In a sense, we’ve shrunken the size of the slip of paper that says “heads”, making it less likely to be drawn and we’ve increased the size of the slip of paper saying “tails”, making it more likely to be drawn. When we simulated the fair coin, both slips of paper were the same size. This happens by default if you don’t provide a prob argument; all elements in the outcomes vector have an equal probability of being drawn.
+
If you want to learn more about sample or any other function, recall that you can always check out its help file.
+
?sample
+
+
+
Simulating the Independent Shooter
+
Simulating a basketball player who has independent shots uses the same mechanism that we use to simulate a coin flip. To simulate a single shot from an independent shooter with a shooting percentage of 50% we type,
To make a valid comparison between Kobe and our simulated independent shooter, we need to align both their shooting percentage and the number of attempted shots.
+
+
What change needs to be made to the sample function so that it reflects a shooting percentage of 45%? Make this adjustment, then run a simulation to sample 133 shots. Assign the output of this simulation to a new object called sim_basket.
+
+
Note that we’ve named the new vector sim_basket, the same name that we gave to the previous vector reflecting a shooting percentage of 50%. In this situation, R overwrites the old object with the new one, so always make sure that you don’t need the information in an old vector before reassigning its name.
+
With the results of the simulation saved as sim_basket, we have the data necessary to compare Kobe to our independent shooter.
+
Both data sets represent the results of 133 shot attempts, each with the same shooting percentage of 45%. We know that our simulated data is from a shooter that has independent shots. That is, we know the simulated shooter does not have a hot hand.
+
+
+
+
More Practice
+
+
Comparing Kobe Bryant to the Independent Shooter
+
+
Using calc_streak, compute the streak lengths of sim_basket, and save the results in a data frame called sim_streak.
+
Describe the distribution of streak lengths. What is the typical streak length for this simulated independent shooter with a 45% shooting percentage? How long is the player’s longest streak of baskets in 133 shots? Make sure to include a plot in your answer.
+
If you were to run the simulation of the independent shooter a second time, how would you expect its streak distribution to compare to the distribution from the question above? Exactly the same? Somewhat similar? Totally different? Explain your reasoning.
+
How does Kobe Bryant’s distribution of streak lengths compare to the distribution of streak lengths for the simulated shooter? Using this comparison, do you have evidence that the hot hand model fits Kobe’s shooting patterns? Explain.
+
+
+
+
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/confidence_intervals/more/AmesHousing.csv b/05_sampling_distributions/more/AmesHousing.csv
similarity index 100%
rename from confidence_intervals/more/AmesHousing.csv
rename to 05_sampling_distributions/more/AmesHousing.csv
diff --git a/confidence_intervals/more/AmesHousing.xls b/05_sampling_distributions/more/AmesHousing.xls
similarity index 100%
rename from confidence_intervals/more/AmesHousing.xls
rename to 05_sampling_distributions/more/AmesHousing.xls
diff --git a/confidence_intervals/more/ames-readme.txt b/05_sampling_distributions/more/ames-readme.txt
similarity index 100%
rename from confidence_intervals/more/ames-readme.txt
rename to 05_sampling_distributions/more/ames-readme.txt
diff --git a/confidence_intervals/more/ames.csv b/05_sampling_distributions/more/ames.csv
similarity index 100%
rename from confidence_intervals/more/ames.csv
rename to 05_sampling_distributions/more/ames.csv
diff --git a/confidence_intervals/more/ames_dataprep.R b/05_sampling_distributions/more/ames_dataprep.R
similarity index 100%
rename from confidence_intervals/more/ames_dataprep.R
rename to 05_sampling_distributions/more/ames_dataprep.R
diff --git a/05_sampling_distributions/sampling_distributions.Rmd b/05_sampling_distributions/sampling_distributions.Rmd
new file mode 100644
index 0000000..5adcdb6
--- /dev/null
+++ b/05_sampling_distributions/sampling_distributions.Rmd
@@ -0,0 +1,363 @@
+---
+title: "Foundations for statistical inference - Sampling distributions"
+runtime: shiny
+output:
+ html_document:
+ css: lab.css
+ highlight: pygments
+ theme: cerulean
+ toc: true
+ toc_float: true
+---
+
+```{r global_options, include=FALSE}
+knitr::opts_chunk$set(eval = FALSE)
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+data(ames)
+```
+
+In this lab, we investigate the ways in which the statistics from a random
+sample of data can serve as point estimates for population parameters. We're
+interested in formulating a *sampling distribution* of our estimate in order
+to learn about the properties of the estimate, such as its distribution.
+
+
+**Setting a seed:** We will take some random samples and build sampling distributions
+in this lab, which means you should set a seed on top of your lab. If this concept
+is new to you, review the lab concerning probability.
+
+
+## Getting Started
+
+### Load packages
+
+In this lab we will explore the data using the `dplyr` package and visualize it
+using the `ggplot2` package for data visualization. The data can be found in the
+companion package for OpenIntro labs, `oilabs`.
+
+Let's load the packages.
+
+```{r load-packages, message=FALSE}
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+```
+
+### Creating a reproducible lab report
+
+To create your new lab report, start by opening a new R Markdown document... From Template... then select Lab Report from the `oilabs` package.
+
+### The data
+
+We consider real estate data from the city of Ames, Iowa. The details of
+every real estate transaction in Ames is recorded by the City Assessor's
+office. Our particular focus for this lab will be all residential home sales
+in Ames between 2006 and 2010. This collection represents our population of
+interest. In this lab we would like to learn about these home sales by taking
+smaller samples from the full population. Let's load the data.
+
+```{r load-data}
+data(ames)
+```
+
+We see that there are quite a few variables in the data set, enough to do a
+very in-depth analysis. For this lab, we'll restrict our attention to just
+two of the variables: the above ground living area of the house in square feet
+(`area`) and the sale price (`price`).
+
+We can explore the distribution of areas of homes in the population of home
+sales visually and with summary statistics. Let's first create a visualization,
+a histogram:
+
+```{r area-hist}
+qplot(data = ames, x = area, binwidth = 250, geom = "histogram")
+```
+
+Let's also obtain some summary statistics. Note that we can do this using the
+`summarise` function. We can calculate as many statistics as we want using this
+function, and just string along the results. Some of the functions below should
+be self explanatory (like `mean`, `median`, `sd`, `IQR`, `min`, and `max`). A
+new function here is the `quantile` function which we can use to calculate
+values corresponding to specific percentile cutoffs in the distribution. For
+example `quantile(x, 0.25)` will yield the cutoff value for the 25th percentile (Q1)
+in the distribution of x. Finding these values are useful for describing the
+distribution, as we can use them for descriptions like *"the middle 50% of the
+homes have areas between such and such square feet"*.
+
+```{r area-stats}
+ames %>%
+ summarise(mu = mean(area), pop_med = median(area),
+ sigma = sd(area), pop_iqr = IQR(area),
+ pop_min = min(area), pop_max = max(area),
+ pop_q1 = quantile(area, 0.25), # first quartile, 25th percentile
+ pop_q3 = quantile(area, 0.75)) # third quartile, 75th percentile
+```
+
+1. Describe this population distribution using a visualization and these summary
+ statistics. You don't have to use all of the summary statistics in your
+ description, you will need to decide which ones are relevant based on the
+ shape of the distribution. Make sure to include the plot and the summary
+ statistics output in your report along with your narrative.
+
+## The unknown sampling distribution
+
+In this lab we have access to the entire population, but this is rarely the
+case in real life. Gathering information on an entire population is often
+extremely costly or impossible. Because of this, we often take a sample of
+the population and use that to understand the properties of the population.
+
+If we were interested in estimating the mean living area in Ames based on a
+sample, we can use the `sample_n` command to survey the population.
+
+```{r samp1}
+samp1 <- ames %>%
+ sample_n(50)
+```
+
+This command collects a simple random sample of size 50 from the `ames` dataset,
+and assigns the result to `samp1`. This is like going into the City
+Assessor's database and pulling up the files on 50 random home sales. Working
+with these 50 files would be considerably simpler than working with all 2930
+home sales.
+
+1. Describe the distribution of area in this sample. How does it compare to the
+ distribution of the population? **Hint:** the `sample_n` function takes a random
+ sample of observations (i.e. rows) from the dataset, you can still refer to
+ the variables in the dataset with the same names. Code you used in the
+ previous exercise will also be helpful for visualizing and summarizing the sample,
+ however be careful to not label values `mu` and `sigma` anymore since these
+ are sample statistics, not population parameters. You can customize the labels
+ of any of the statistics to indicate that these come from the sample.
+
+If we're interested in estimating the average living area in homes in Ames
+using the sample, our best single guess is the sample mean.
+
+```{r mean-samp1}
+samp1 %>%
+ summarise(x_bar = mean(area))
+```
+
+Depending on which 50 homes you selected, your estimate could be a bit above
+or a bit below the true population mean of `r round(mean(ames$area),2)` square feet. In general,
+though, the sample mean turns out to be a pretty good estimate of the average
+living area, and we were able to get it by sampling less than 3\% of the
+population.
+
+1. Would you expect the mean of your sample to match the mean of another team's
+ sample? Why, or why not? If the answer is no, would you expect the means to
+ just be somewhat different or very different? Ask a neighboring team to confirm
+ your answer.
+
+1. Take a second sample, also of size 50, and call it `samp2`. How does the
+ mean of `samp2` compare with the mean of `samp1`? Suppose we took two
+ more samples, one of size 100 and one of size 1000. Which would you think
+ would provide a more accurate estimate of the population mean?
+
+Not surprisingly, every time we take another random sample, we get a different
+sample mean. It's useful to get a sense of just how much variability we
+should expect when estimating the population mean this way. The distribution
+of sample means, called the *sampling distribution (of the mean)*, can help us understand
+this variability. In this lab, because we have access to the population, we
+can build up the sampling distribution for the sample mean by repeating the
+above steps many times. Here we will generate 15,000 samples and compute the
+sample mean of each. Note that we specify that
+`replace = TRUE` since sampling distributions are constructed by sampling
+with replacement.
+
+```{r loop}
+sample_means50 <- ames %>%
+ rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
+ summarise(x_bar = mean(area))
+
+qplot(data = sample_means50, x = x_bar)
+```
+
+Here we use R to take 15,000 different samples of size 50 from the population, calculate
+the mean of each sample, and store each result in a vector called
+`sample_means50`. Next, we review how this set of code works.
+
+1. How many elements are there in `sample_means50`? Describe the sampling
+ distribution, and be sure to specifically note its center. Make sure to include
+ a plot of the distribution in your answer.
+
+## Interlude: Sampling distributions
+
+The idea behind the `rep_sample_n` function is *repetition*. Earlier we took
+a single sample of size `n` (50) from the population of all houses in Ames. With
+this new function we are able to repeat this sampling procedure `rep` times in order
+to build a distribution of a series of sample statistics, which is called the
+**sampling distribution**.
+
+Note that in practice one rarely gets to build true sampling distributions,
+because we rarely have access to data from the entire population.
+
+Without the `rep_sample_n` function, this would be painful. We would have to
+manually run the following code 15,000 times
+```{r sample-code, eval=FALSE}
+ames %>%
+ sample_n(size = 50) %>%
+ summarise(x_bar = mean(area))
+```
+as well as store the resulting sample means each time in a separate vector.
+
+Note that for each of the 15,000 times we computed a mean, we did so from a
+**different** sample!
+
+1. To make sure you understand how sampling distributions are built, and exactly
+ what the `rep_sample_n` function does, try modifying the code to create a
+ sampling distribution of **25 sample means** from **samples of size 10**,
+ and put them in a data frame named `sample_means_small`. Print the output.
+ How many observations are there in this object called `sample_means_small`?
+ What does each observation represent?
+
+## Sample size and the sampling distribution
+
+Mechanics aside, let's return to the reason we used the `rep_sample_n` function: to
+compute a sampling distribution, specifically, the sampling distribution of the
+mean home area for samples of 50 houses.
+
+```{r hist}
+qplot(data = sample_means50, x = x_bar, geom = "histogram")
+```
+
+The sampling distribution that we computed tells us much about estimating
+the average living area in homes in Ames. Because the sample mean is an
+unbiased estimator, the sampling distribution is centered at the true average
+living area of the population, and the spread of the distribution
+indicates how much variability is incurred by sampling only 50 home sales.
+
+In the remainder of this section we will work on getting a sense of the effect that
+sample size has on our sampling distribution.
+
+1. Use the app below to create sampling distributions of means of `area`s from
+ samples of size 10, 50, and 100. Use 5,000 simulations. What does each
+ observation in the sampling distribution represent? How does the mean, standard
+ error, and shape of the sampling distribution change as the sample size
+ increases? How (if at all) do these values change if you increase the number
+ of simulations? (You do not need to include plots in your answer.)
+
+```{r shiny, echo=FALSE, eval=TRUE}
+shinyApp(
+ ui <- fluidPage(
+
+ # Sidebar with a slider input for number of bins
+ sidebarLayout(
+ sidebarPanel(
+
+ selectInput("selected_var",
+ "Variable:",
+ choices = list("area", "price"),
+ selected = "area"),
+
+ numericInput("n_samp",
+ "Sample size:",
+ min = 1,
+ max = nrow(ames),
+ value = 30),
+
+ numericInput("n_sim",
+ "Number of samples:",
+ min = 1,
+ max = 30000,
+ value = 15000)
+
+ ),
+
+ # Show a plot of the generated distribution
+ mainPanel(
+ plotOutput("sampling_plot"),
+ verbatimTextOutput("sampling_mean"),
+ verbatimTextOutput("sampling_se")
+ )
+ )
+ ),
+
+ # Define server logic required to draw a histogram
+ server <- function(input, output) {
+
+ # create sampling distribution
+ sampling_dist <- reactive({
+ ames[[input$selected_var]] %>%
+ sample(size = input$n_samp * input$n_sim, replace = TRUE) %>%
+ matrix(ncol = input$n_samp) %>%
+ rowMeans() %>%
+ data.frame(x_bar = .)
+ #ames %>%
+ # rep_sample_n(size = input$n_samp, reps = input$n_sim, replace = TRUE) %>%
+ # summarise_(x_bar = mean(input$selected_var))
+ })
+
+ # plot sampling distribution
+ output$sampling_plot <- renderPlot({
+ x_min <- quantile(ames[[input$selected_var]], 0.1)
+ x_max <- quantile(ames[[input$selected_var]], 0.9)
+
+ ggplot(sampling_dist(), aes(x = x_bar)) +
+ geom_histogram() +
+ xlim(x_min, x_max) +
+ ylim(0, input$n_sim * 0.35) +
+ ggtitle(paste0("Sampling distribution of mean ",
+ input$selected_var, " (n = ", input$n_samp, ")")) +
+ xlab(paste("mean", input$selected_var)) +
+ theme(plot.title = element_text(face = "bold", size = 16))
+ })
+
+ # mean of sampling distribution
+ output$sampling_mean <- renderText({
+ paste0("mean of sampling distribution = ", round(mean(sampling_dist()$x_bar), 2))
+ })
+
+ # mean of sampling distribution
+ output$sampling_se <- renderText({
+ paste0("SE of sampling distribution = ", round(sd(sampling_dist()$x_bar), 2))
+ })
+ },
+
+ options = list(height = 500)
+)
+```
+
+
+* * *
+
+## More Practice
+
+So far, we have only focused on estimating the mean living area in homes in
+Ames. Now you'll try to estimate the mean home price.
+
+Note that while you might be able to answer some of these questions using the app
+you are expected to write the required code and produce the necessary plots and
+summary statistics. You are welcome to use the app for exploration.
+
+1. Take a sample of size 15 from the population and calculate the mean `price`
+ of the homes in this sample. Using this sample, what is your best point estimate
+ of the population mean of prices of homes?
+
+1. Since you have access to the population, simulate the sampling
+ distribution of $\overline{price}$ for samples of size 15 by taking 2000
+ samples from the population of size 15 and computing 2000 sample means.
+ Store these means
+ in a vector called `sample_means15`. Plot the data, then describe the
+ shape of this sampling distribution. Based on this sampling distribution,
+ what would you guess the mean home price of the population to be? Finally,
+ calculate and report the population mean.
+
+1. Change your sample size from 15 to 150, then compute the sampling
+ distribution using the same method as above, and store these means in a
+ new vector called `sample_means150`. Describe the shape of this sampling
+ distribution, and compare it to the sampling distribution for a sample
+ size of 15. Based on this sampling distribution, what would you guess to
+ be the mean sale price of homes in Ames?
+
+1. Of the sampling distributions from 2 and 3, which has a smaller spread? If
+ we're concerned with making estimates that are more often close to the
+ true value, would we prefer a sampling distribution with a large or small spread?
+
+
+
+This is a product of OpenIntro that is released under a [Creative Commons
+Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0).
+This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.
+
\ No newline at end of file
diff --git a/06_confidence_intervals/confidence_intervals.Rmd b/06_confidence_intervals/confidence_intervals.Rmd
new file mode 100644
index 0000000..3c38cf4
--- /dev/null
+++ b/06_confidence_intervals/confidence_intervals.Rmd
@@ -0,0 +1,239 @@
+---
+title: 'Foundations for statistical inference - Confidence intervals'
+output:
+ html_document:
+ css: ../lab.css
+ highlight: pygments
+ theme: cerulean
+ toc: true
+ toc_float: true
+---
+
+```{r global_options, include=FALSE}
+knitr::opts_chunk$set(eval = FALSE)
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+```
+
+If you have access to data on an entire population, say the size of every
+house in Ames, Iowa, it's straightforward to answer questions like, "How big
+is the typical house in Ames?" and "How much variation is there in sizes of
+houses?". If you have access to only a sample of the population, as is often
+the case, the task becomes more complicated. What is your best guess for the
+typical size if you only know the sizes of several dozen houses? This sort of
+situation requires that you use your sample to make inference on what your
+population looks like.
+
+
+**Setting a seed:** We will take some random samples and build sampling distributions
+in this lab, which means you should set a seed on top of your lab. If this concept
+is new to you, review the lab concerning probability.
+
+
+## Getting Started
+
+### Load packages
+
+In this lab we will explore the data using the `dplyr` package and visualize it
+using the `ggplot2` package for data visualization. The data can be found in the
+companion package for the OpenIntro labs, `oilabs`.
+
+Let's load the packages.
+
+```{r load-packages, message=FALSE}
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+```
+
+### Creating a reproducible lab report
+
+To create your new lab report, start by opening a new R Markdown document... From Template... then select Lab Report from the `oilabs` package.
+
+### The data
+
+We consider real estate data from the city of Ames, Iowa. This is the same
+dataset used in the previous lab. The details of
+every real estate transaction in Ames is recorded by the City Assessor's
+office. Our particular focus for this lab will be all residential home sales
+in Ames between 2006 and 2010. This collection represents our population of
+interest. In this lab we would like to learn about these home sales by taking
+smaller samples from the full population. Let's load the data.
+
+```{r load-data}
+data(ames)
+```
+
+In this lab we'll start with a simple random sample of size 60 from the
+population.
+
+```{r sample}
+n <- 60
+samp <- sample_n(ames, n)
+```
+
+Note that
+the data set has information on many housing variables, but for the first
+portion of the lab we'll focus on the size of the house, represented by the
+variable `area`.
+
+1. Describe the distribution of house area in your sample. What would you say is the
+ "typical" size within your sample? Also state precisely what you interpreted
+ "typical" to mean.
+
+1. Would you expect another student's distribution to be identical to yours?
+ Would you expect it to be similar? Why or why not?
+
+## Confidence intervals
+
+Return for a moment to the question that first motivated this lab: based on
+this sample, what can we infer about the population? Based only on this single
+sample, the best estimate of the average living area of houses sold in Ames
+would be the sample mean, usually denoted as $\bar{x}$ (here we're calling it
+`x_bar`). That serves as a good **point estimate** but it would be useful
+to also communicate how uncertain we are of that estimate. This uncertainty
+can be quantified using a **confidence interval**.
+
+A confidence interval for a population mean is of the following form
+\[ \bar{x} + z^\star \frac{s}{\sqrt{n}} \]
+
+You should by now be comfortable with calculating the mean and standard deviation of
+a sample in R. And we know that the sample size is 60. So the only remaining building
+block is finding the appropriate critical value for a given confidence level. We can
+use the `qnorm` function for this task, which will give the critical value associated
+with a given percentile under the normal distribution. Remember that confidence levels
+and percentiles are not equivalent. For example, a 95% confidence level refers to the
+middle 95% of the distribution, and the critical value associated with this area will
+correspond to the 97.5th percentile.
+
+We can find the critical value for a 95% confidence interal using
+```{r z_star_95}
+z_star_95 <- qnorm(0.975)
+z_star_95
+```
+which is roughly equal to the value critical value 1.96 that you're likely
+familiar with by now.
+
+Let's finally calculate the confidence interval:
+```{r ci}
+samp %>%
+ summarise(x_bar = mean(area),
+ se = sd(area) / sqrt(n),
+ me = z_star_95 * se,
+ lower = x_bar - me,
+ upper = x_bar + me)
+```
+
+To recap: even though we don't know what the full population looks like, we're 95%
+confident that the true average size of houses in Ames lies between the values `lower`
+and `upper`. There are a few conditions that must be met for this interval to be valid.
+
+1. For the confidence interval to be valid, the sample mean must be normally
+ distributed and have standard error $s / \sqrt{n}$. What conditions must be
+ met for this to be true?
+
+## Confidence levels
+
+1. What does "95% confidence" mean?
+
+In this case we have the rare luxury of knowing the true population mean since we
+have data on the entire population. Let's calculate this value so that
+we can determine if our confidence intervals actually capture it. We'll store it in a
+data frame called `params` (short for population parameters), and name it `mu`.
+
+```{r pop-mean}
+params <- ames %>%
+ summarise(mu = mean(area))
+```
+
+1. Does your confidence interval capture the true average size of houses in
+ Ames? If you are working on this lab in a classroom, does your neighbor's
+ interval capture this value?
+
+1. Each student should have gotten a slightly different confidence interval. What
+ proportion of those intervals would you expect to capture the true population
+ mean? Why?
+
+Using R, we're going to collect many samples to learn more about how sample
+means and confidence intervals vary from one sample to another.
+
+Here is the rough outline:
+
+- Obtain a random sample.
+- Calculate the sample's mean and standard deviation, and use these to calculate
+and store the lower and upper bounds of the confidence intervals.
+- Repeat these steps 50 times.
+
+We can accomplish this using the `rep_sample_n` function. The following lines of
+code takes 50 random samples of size `n` from population (and remember we defined
+$n = 60$ earlier), and computes the upper and lower bounds of the confidence intervals based on these samples.
+
+```{r calculate-50-cis}
+ci <- ames %>%
+ rep_sample_n(size = n, reps = 50, replace = TRUE) %>%
+ summarise(x_bar = mean(area),
+ se = sd(area) / sqrt(n),
+ me = z_star_95 * se,
+ lower = x_bar - me,
+ upper = x_bar + me)
+```
+
+Let's view the first five intervals:
+
+```{r first-five-intervals}
+ci %>%
+ slice(1:5)
+```
+
+Next we'll create a plot similar to Figure 4.8 on page 175 of [OpenIntro Statistics, 3rd
+Edition](https://www.openintro.org/os). The first step will be to create a new variable in
+the `ci` data frame that indicates whether the interval does or does not capture the
+true population mean. Note that capturing this value would mean the lower bound of the
+confidence interval is below the value and upper bound of the confidence interval is
+above the value. Remember that we create new variables using the `mutate` function.
+
+```{r capture-mu}
+ci <- ci %>%
+ mutate(capture_mu = ifelse(lower < params$mu & upper > params$mu, "yes", "no"))
+```
+
+The `ifelse` function is new. It takes three arguments: first is a logical statement,
+second is the value we want if the logical statement yields a true result, and the
+third is the value we want if the logical statement yields a false result.
+
+We now have all the information we need to create the plot.
+Note that the `geom_errorbar()` function
+only understands `y` values, and thus we have used the `coord_flip()` function
+to flip the coordinates of the entire plot back to the more familiar vertical
+orientation.
+
+```{r plot-ci}
+qplot(data = ci, x = replicate, y = x_bar, color = capture_mu) +
+ geom_errorbar(aes(ymin = lower, ymax = upper)) +
+ geom_hline(data = params, aes(yintercept = mu), color = "darkgray") + # draw vertical line
+ coord_flip()
+```
+
+1. What proportion of your confidence intervals include the true population mean? Is
+ this proportion exactly equal to the confidence level? If not, explain why. Make
+ sure to include your plot in your answer.
+
+* * *
+
+## More Practice
+
+1. Pick a confidence level of your choosing, provided it is not 95%. What is
+ the appropriate critical value?
+
+1. Calculate 50 confidence intervals at the confidence level you chose in the
+ previous question, and plot all intervals on one plot, and calculate the proportion
+ of intervals that include the true population mean. How does this percentage compare
+ to the confidence level selected for the intervals? Make
+ sure to include your plot in your answer.
+
+
+This is a product of OpenIntro that is released under a [Creative Commons
+Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0).
+This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.
+
Foundations for statistical inference - Confidence intervals
+
+
+
+
+
If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straightforward to answer questions like, “How big is the typical house in Ames?” and “How much variation is there in sizes of houses?”. If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This sort of situation requires that you use your sample to make inference on what your population looks like.
+
+
Setting a seed: We will take some random samples and build sampling distributions in this lab, which means you should set a seed on top of your lab. If this concept is new to you, review the lab concerning probability.
+
+
+
Getting Started
+
+
Load packages
+
In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data can be found in the companion package for the OpenIntro labs, oilabs.
+
Let’s load the packages.
+
library(dplyr)
+library(ggplot2)
+library(oilabs)
+
+
+
Creating a reproducible lab report
+
To create your new lab report, start by opening a new R Markdown document… From Template… then select Lab Report from the oilabs package.
+
+
+
The data
+
We consider real estate data from the city of Ames, Iowa. This is the same dataset used in the previous lab. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the data.
+
data(ames)
+
In this lab we’ll start with a simple random sample of size 60 from the population.
+
n <-60
+samp <-sample_n(ames, n)
+
Note that the data set has information on many housing variables, but for the first portion of the lab we’ll focus on the size of the house, represented by the variable area.
+
+
Describe the distribution of house area in your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.
+
Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?
+
+
+
+
+
Confidence intervals
+
Return for a moment to the question that first motivated this lab: based on this sample, what can we infer about the population? Based only on this single sample, the best estimate of the average living area of houses sold in Ames would be the sample mean, usually denoted as \(\bar{x}\) (here we’re calling it x_bar). That serves as a good point estimate but it would be useful to also communicate how uncertain we are of that estimate. This uncertainty can be quantified using a confidence interval.
+
A confidence interval for a population mean is of the following form \[ \bar{x} + z^\star \frac{s}{\sqrt{n}} \]
+
You should by now be comfortable with calculating the mean and standard deviation of a sample in R. And we know that the sample size is 60. So the only remaining building block is finding the appropriate critical value for a given confidence level. We can use the qnorm function for this task, which will give the critical value associated with a given percentile under the normal distribution. Remember that confidence levels and percentiles are not equivalent. For example, a 95% confidence level refers to the middle 95% of the distribution, and the critical value associated with this area will correspond to the 97.5th percentile.
+
We can find the critical value for a 95% confidence interal using
+
z_star_95 <-qnorm(0.975)
+z_star_95
+
which is roughly equal to the value critical value 1.96 that you’re likely familiar with by now.
To recap: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the values lower and upper. There are a few conditions that must be met for this interval to be valid.
+
+
For the confidence interval to be valid, the sample mean must be normally distributed and have standard error \(s / \sqrt{n}\). What conditions must be met for this to be true?
+
+
+
+
Confidence levels
+
+
What does “95% confidence” mean?
+
+
In this case we have the rare luxury of knowing the true population mean since we have data on the entire population. Let’s calculate this value so that we can determine if our confidence intervals actually capture it. We’ll store it in a data frame called params (short for population parameters), and name it mu.
+
params <-ames %>%
+summarise(mu =mean(area))
+
+
Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?
+
Each student should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?
+
+
Using R, we’re going to collect many samples to learn more about how sample means and confidence intervals vary from one sample to another.
+
Here is the rough outline:
+
+
Obtain a random sample.
+
Calculate the sample’s mean and standard deviation, and use these to calculate and store the lower and upper bounds of the confidence intervals.
+
Repeat these steps 50 times.
+
+
We can accomplish this using the rep_sample_n function. The following lines of code takes 50 random samples of size n from population (and remember we defined \(n = 60\) earlier), and computes the upper and lower bounds of the confidence intervals based on these samples.
+
ci <-ames %>%
+rep_sample_n(size = n, reps =50, replace =TRUE) %>%
+summarise(x_bar =mean(area),
+ se =sd(area) /sqrt(n),
+ me = z_star_95 *se,
+ lower = x_bar -me,
+ upper = x_bar +me)
+
Let’s view the first five intervals:
+
ci %>%
+slice(1:5)
+
Next we’ll create a plot similar to Figure 4.8 on page 175 of OpenIntro Statistics, 3rd Edition. The first step will be to create a new variable in the ci data frame that indicates whether the interval does or does not capture the true population mean. Note that capturing this value would mean the lower bound of the confidence interval is below the value and upper bound of the confidence interval is above the value. Remember that we create new variables using the mutate function.
+
ci <-ci %>%
+mutate(capture_mu =ifelse(lower <params$mu &upper >params$mu, "yes", "no"))
+
The ifelse function is new. It takes three arguments: first is a logical statement, second is the value we want if the logical statement yields a true result, and the third is the value we want if the logical statement yields a false result.
+
We now have all the information we need to create the plot. Note that the geom_errorbar() function only understands y values, and thus we have used the coord_flip() function to flip the coordinates of the entire plot back to the more familiar vertical orientation.
+
qplot(data = ci, x = replicate, y = x_bar, color = capture_mu) +
+geom_errorbar(aes(ymin = lower, ymax = upper)) +
+geom_hline(data = params, aes(yintercept = mu), color ="darkgray") +# draw vertical line
+coord_flip()
+
+
What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why. Make sure to include your plot in your answer.
+
+
+
+
+
More Practice
+
+
Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?
+
Calculate 50 confidence intervals at the confidence level you chose in the previous question, and plot all intervals on one plot, and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals? Make sure to include your plot in your answer.
+
+
+
+
+
+
+
+
diff --git a/sampling_distributions/more/AmesHousing.csv b/06_confidence_intervals/more/AmesHousing.csv
similarity index 100%
rename from sampling_distributions/more/AmesHousing.csv
rename to 06_confidence_intervals/more/AmesHousing.csv
diff --git a/sampling_distributions/more/AmesHousing.xls b/06_confidence_intervals/more/AmesHousing.xls
similarity index 100%
rename from sampling_distributions/more/AmesHousing.xls
rename to 06_confidence_intervals/more/AmesHousing.xls
diff --git a/sampling_distributions/more/ames-readme.txt b/06_confidence_intervals/more/ames-readme.txt
similarity index 100%
rename from sampling_distributions/more/ames-readme.txt
rename to 06_confidence_intervals/more/ames-readme.txt
diff --git a/confidence_intervals/more/ames.RData b/06_confidence_intervals/more/ames.RData
similarity index 100%
rename from confidence_intervals/more/ames.RData
rename to 06_confidence_intervals/more/ames.RData
diff --git a/sampling_distributions/more/ames.csv b/06_confidence_intervals/more/ames.csv
similarity index 100%
rename from sampling_distributions/more/ames.csv
rename to 06_confidence_intervals/more/ames.csv
diff --git a/sampling_distributions/more/ames_dataprep.R b/06_confidence_intervals/more/ames_dataprep.R
similarity index 100%
rename from sampling_distributions/more/ames_dataprep.R
rename to 06_confidence_intervals/more/ames_dataprep.R
diff --git a/07_inf_for_numerical_data/inf_for_numerical_data.Rmd b/07_inf_for_numerical_data/inf_for_numerical_data.Rmd
new file mode 100644
index 0000000..e137245
--- /dev/null
+++ b/07_inf_for_numerical_data/inf_for_numerical_data.Rmd
@@ -0,0 +1,192 @@
+---
+title: 'Inference for numerical data'
+output:
+ html_document:
+ css: ../lab.css
+ highlight: pygments
+ theme: cerulean
+ toc: true
+ toc_float: true
+---
+
+```{r global_options, include=FALSE}
+knitr::opts_chunk$set(eval = FALSE)
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+```
+
+## Getting Started
+
+### Load packages
+
+In this lab we will explore the data using the `dplyr` package and visualize it
+using the `ggplot2` package for data visualization. The data can be found in the
+companion package for OpenIntro labs, `oilabs`.
+
+Let's load the packages.
+
+```{r load-packages, message=FALSE}
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+```
+
+### Creating a reproducible lab report
+
+To create your new lab report, start by opening a new R Markdown document... From Template... then select Lab Report from the `oilabs` package.
+
+### The data
+
+In 2004, the state of North Carolina released a large data set containing
+information on births recorded in this state. This data set is useful to
+researchers studying the relation between habits and practices of expectant
+mothers and the birth of their children. We will work with a random sample of
+observations from this data set.
+
+Load the `nc` data set into our workspace.
+
+```{r load-data}
+data(nc)
+```
+
+We have observations on 13 different variables, some categorical and some
+numerical. The meaning of each variable can be found by bringing up the help file:
+
+```{r help-nc}
+?nc
+```
+
+
+1. What are the cases in this data set? How many cases are there in our sample?
+
+Remember that you can answer this question by viewing the data in the data viewer or
+by using the following command:
+
+```{r str}
+glimpse(nc)
+```
+
+## Exploratory data analysis
+
+We will first start with analyzing the weight gained by mothers throughout the
+pregnancy: `gained`.
+
+Using visualization and summary statistics, describe the distribution of weight
+gained by mothers during pregnancy. The `favstats` function from `mosaic` can be useful.
+
+```{r summary}
+library(mosaic)
+favstats(~gained, data = nc)
+```
+
+1. How many mothers are we missing weight gain data from?
+
+Next, consider the possible relationship between a mother's smoking habit and the
+weight of her baby. Plotting the data is a useful first step because it helps
+us quickly visualize trends, identify strong associations, and develop research
+questions.
+
+2. Make a side-by-side boxplot of `habit` and `weight`. What does the plot
+highlight about the relationship between these two variables?
+
+The box plots show how the medians of the two distributions compare, but we can
+also compare the means of the distributions using the following to
+first group the data by the `habit` variable, and then calculate the mean
+`weight` in these groups using the `mean` function.
+
+```{r by-means}
+nc %>%
+ group_by(habit) %>%
+ summarise(mean_weight = mean(weight))
+```
+
+There is an observed difference, but is this difference statistically
+significant? In order to answer this question we will conduct a hypothesis test
+.
+
+## Inference
+
+3. Are all conditions necessary for inference satisfied? Comment on each. You can
+compute the group sizes with the `summarize` command above by defining a new variable
+with the definition `n()`.
+
+4. Write the hypotheses for testing if the average weights of babies born to
+smoking and non-smoking mothers are different.
+
+Next, we introduce a new function, `inference`, that we will use for conducting
+hypothesis tests and constructing confidence intervals.
+
+```{r inf-weight-habit-ht, tidy=FALSE}
+inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ht", null = 0,
+ alternative = "twosided", method = "theoretical")
+```
+
+Let's pause for a moment to go through the arguments of this custom function.
+The first argument is `y`, which is the response variable that we are
+interested in: `weight`. The second argument is the explanatory variable,
+`x`, which is the variable that splits the data into two groups, smokers and
+non-smokers: `habit`. The third argument, `data`, is the data frame these
+variables are stored in. Next is `statistic`, which is the sample statistic
+we're using, or similarly, the population parameter we're estimating. In future labs
+we'll also work with "median" and "proportion". Next we decide on the `type` of inference
+we want: a hypothesis test (`"ht"`) or a confidence interval (`"ci"`). When performing a
+hypothesis test, we also need to supply the `null` value, which in this case is `0`,
+since the null hypothesis sets the two population means equal to each other.
+The `alternative` hypothesis can be `"less"`, `"greater"`, or `"twosided"`.
+Lastly, the `method` of inference can be `"theoretical"` or `"simulation"` based.
+
+For more information on the inference function see the help file with `?inference`.
+
+5. Change the `type` argument to `"ci"` to construct and record a confidence
+interval for the difference between the weights of babies born to nonsmoking and
+smoking mothers, and interpret this interval in context of the data. Note that by
+default you'll get a 95% confidence interval. If you want to change the
+confidence level, add a new argument (`conf_level`) which takes on a value
+between 0 and 1. Also note that when doing a confidence interval arguments like
+`null` and `alternative` are not useful, so make sure to remove them.
+
+By default the function reports an interval for ($\mu_{nonsmoker} - \mu_{smoker}$)
+. We can easily change this order by using the `order` argument:
+
+```{r inf-weight-habit-ci, tidy=FALSE}
+inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ci",
+ method = "theoretical", order = c("smoker","nonsmoker"))
+```
+
+* * *
+
+## More Practice
+
+6. Calculate a 95% confidence interval for the average length of pregnancies
+(`weeks`) and interpret it in context. Note that since you're doing inference
+on a single population parameter, there is no explanatory variable, so you can
+omit the `x` variable from the function.
+
+7. Calculate a new confidence interval for the same parameter at the 90%
+confidence level. You can change the confidence level by adding a new argument
+to the function: `conf_level = 0.90`. Comment on the width of this interval versus
+the one obtained in the previous exercise.
+
+8. Conduct a hypothesis test evaluating whether the average weight gained by
+younger mothers is different than the average weight gained by mature mothers.
+
+9. Now, a non-inference task: Determine the age cutoff for younger and mature
+mothers. Use a method of your choice, and explain how your method works.
+
+10. Pick a pair of variables: one numerical (response) and one categorical (explanatory).
+Come up with a research question evaluating the relationship between these variables.
+Formulate the question in a way that it can be answered using a hypothesis test
+and/or a confidence interval. Answer your question using the `inference`
+function, report the statistical results, and also provide an explanation in
+plain language. Be sure to check all assumptions,state your $\alpha$ level, and conclude
+in context. (Note: Picking your own variables, coming up with a research question,
+and analyzing the data to answer this question is basically what you'll need to do for
+your project as well.)
+
+
+This is a product of OpenIntro that is released under a [Creative Commons
+Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0).
+This lab was adapted for OpenIntro by Mine Çetinkaya-Rundel from a lab
+written by the faculty and TAs of UCLA Statistics.
+
\ No newline at end of file
diff --git a/07_inf_for_numerical_data/inf_for_numerical_data.html b/07_inf_for_numerical_data/inf_for_numerical_data.html
new file mode 100644
index 0000000..84c77bf
--- /dev/null
+++ b/07_inf_for_numerical_data/inf_for_numerical_data.html
@@ -0,0 +1,361 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+Inference for numerical data
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Inference for numerical data
+
+
+
+
+
+
Getting Started
+
+
Load packages
+
In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data can be found in the companion package for OpenIntro labs, oilabs.
+
Let’s load the packages.
+
library(dplyr)
+library(ggplot2)
+library(oilabs)
+
+
+
Creating a reproducible lab report
+
To create your new lab report, start by opening a new R Markdown document… From Template… then select Lab Report from the oilabs package.
+
+
+
The data
+
In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.
+
Load the nc data set into our workspace.
+
data(nc)
+
We have observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:
+
?nc
+
+
What are the cases in this data set? How many cases are there in our sample?
+
+
Remember that you can answer this question by viewing the data in the data viewer or by using the following command:
+
glimpse(nc)
+
+
+
+
Exploratory data analysis
+
We will first start with analyzing the weight gained by mothers throughout the pregnancy: gained.
+
Using visualization and summary statistics, describe the distribution of weight gained by mothers during pregnancy. The favstats function from mosaic can be useful.
+
library(mosaic)
+favstats(~gained, data = nc)
+
+
How many mothers are we missing weight gain data from?
+
+
Next, consider the possible relationship between a mother’s smoking habit and the weight of her baby. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.
+
+
Make a side-by-side boxplot of habit and weight. What does the plot highlight about the relationship between these two variables?
+
+
The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following to first group the data by the habit variable, and then calculate the mean weight in these groups using the mean function.
There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test .
+
+
+
Inference
+
+
Are all conditions necessary for inference satisfied? Comment on each. You can compute the group sizes with the summarize command above by defining a new variable with the definition n().
+
Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.
+
+
Next, we introduce a new function, inference, that we will use for conducting hypothesis tests and constructing confidence intervals.
+
inference(y = weight, x = habit, data = nc, statistic ="mean", type ="ht", null =0,
+ alternative ="twosided", method ="theoretical")
+
Let’s pause for a moment to go through the arguments of this custom function. The first argument is y, which is the response variable that we are interested in: weight. The second argument is the explanatory variable, x, which is the variable that splits the data into two groups, smokers and non-smokers: habit. The third argument, data, is the data frame these variables are stored in. Next is statistic, which is the sample statistic we’re using, or similarly, the population parameter we’re estimating. In future labs we’ll also work with “median” and “proportion”. Next we decide on the type of inference we want: a hypothesis test ("ht") or a confidence interval ("ci"). When performing a hypothesis test, we also need to supply the null value, which in this case is 0, since the null hypothesis sets the two population means equal to each other. The alternative hypothesis can be "less", "greater", or "twosided". Lastly, the method of inference can be "theoretical" or "simulation" based.
+
For more information on the inference function see the help file with ?inference.
+
+
Change the type argument to "ci" to construct and record a confidence interval for the difference between the weights of babies born to nonsmoking and smoking mothers, and interpret this interval in context of the data. Note that by default you’ll get a 95% confidence interval. If you want to change the confidence level, add a new argument (conf_level) which takes on a value between 0 and 1. Also note that when doing a confidence interval arguments like null and alternative are not useful, so make sure to remove them.
+
+
By default the function reports an interval for (\(\mu_{nonsmoker} - \mu_{smoker}\)) . We can easily change this order by using the order argument:
+
inference(y = weight, x = habit, data = nc, statistic ="mean", type ="ci",
+ method ="theoretical", order =c("smoker","nonsmoker"))
+
+
+
+
More Practice
+
+
Calculate a 95% confidence interval for the average length of pregnancies (weeks) and interpret it in context. Note that since you’re doing inference on a single population parameter, there is no explanatory variable, so you can omit the x variable from the function.
+
Calculate a new confidence interval for the same parameter at the 90% confidence level. You can change the confidence level by adding a new argument to the function: conf_level = 0.90. Comment on the width of this interval versus the one obtained in the previous exercise.
+
Conduct a hypothesis test evaluating whether the average weight gained by younger mothers is different than the average weight gained by mature mothers.
+
Now, a non-inference task: Determine the age cutoff for younger and mature mothers. Use a method of your choice, and explain how your method works.
+
Pick a pair of variables: one numerical (response) and one categorical (explanatory). Come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Answer your question using the inference function, report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions,state your \(\alpha\) level, and conclude in context. (Note: Picking your own variables, coming up with a research question, and analyzing the data to answer this question is basically what you’ll need to do for your project as well.)
+
+
+
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/inf_for_numerical_data/more/1000births.xlsx b/07_inf_for_numerical_data/more/1000births.xlsx
similarity index 100%
rename from inf_for_numerical_data/more/1000births.xlsx
rename to 07_inf_for_numerical_data/more/1000births.xlsx
diff --git a/inf_for_numerical_data/more/2000births.txt b/07_inf_for_numerical_data/more/2000births.txt
similarity index 100%
rename from inf_for_numerical_data/more/2000births.txt
rename to 07_inf_for_numerical_data/more/2000births.txt
diff --git a/inf_for_numerical_data/more/data-processing-code.R b/07_inf_for_numerical_data/more/data-processing-code.R
similarity index 100%
rename from inf_for_numerical_data/more/data-processing-code.R
rename to 07_inf_for_numerical_data/more/data-processing-code.R
diff --git a/inf_for_numerical_data/more/nc.RData b/07_inf_for_numerical_data/more/nc.RData
similarity index 100%
rename from inf_for_numerical_data/more/nc.RData
rename to 07_inf_for_numerical_data/more/nc.RData
diff --git a/inf_for_numerical_data/more/nc.csv b/07_inf_for_numerical_data/more/nc.csv
similarity index 100%
rename from inf_for_numerical_data/more/nc.csv
rename to 07_inf_for_numerical_data/more/nc.csv
diff --git a/inf_for_numerical_data/more/ncbirths.csv b/07_inf_for_numerical_data/more/ncbirths.csv
similarity index 100%
rename from inf_for_numerical_data/more/ncbirths.csv
rename to 07_inf_for_numerical_data/more/ncbirths.csv
diff --git a/08_inf_for_categorical_data/inf_for_categorical_data.Rmd b/08_inf_for_categorical_data/inf_for_categorical_data.Rmd
new file mode 100644
index 0000000..e672b6f
--- /dev/null
+++ b/08_inf_for_categorical_data/inf_for_categorical_data.Rmd
@@ -0,0 +1,288 @@
+---
+title: "Inference for categorical data"
+runtime: shiny
+output:
+ html_document:
+ css: www/lab.css
+ highlight: pygments
+ theme: cerulean
+ toc: true
+ toc_float: true
+---
+
+```{r global_options, include=FALSE}
+knitr::opts_chunk$set(eval = FALSE)
+```
+
+In August of 2012, news outlets ranging from the [Washington Post](http://www.washingtonpost.com/national/on-faith/poll-shows-atheism-on-the-rise-in-the-us/2012/08/13/90020fd6-e57d-11e1-9739-eef99c5fb285_story.html) to the [Huffington Post](http://www.huffingtonpost.com/2012/08/14/atheism-rise-religiosity-decline-in-america_n_1777031.html) ran a story about the rise of atheism in America. The source for the story was a poll that asked people, "Irrespective of whether you attend a place of worship or not, would you say you are a religious person, not a religious person, or a convinced atheist?" This type of question, which asks people to classify themselves in one way or another, is common in polling and generates categorical data. In this lab we take a look at the atheism survey and explore what's at play when making inference about population proportions using categorical data.
+
+## Getting Started
+
+### Load packages
+
+In this lab we will explore the data using the `dplyr` package and visualize it
+using the `ggplot2` package for data visualization. The data can be found in the
+companion package for OpenIntro labs, `oilabs`.
+
+Let's load the packages.
+
+```{r load-packages, message=FALSE, eval=TRUE}
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+```
+
+### Creating a reproducible lab report
+
+To create your new lab report, start by opening a new R Markdown document... From Template... then select Lab Report from the `oilabs` package.
+
+### The survey
+
+The press release for the poll, conducted by WIN-Gallup International, can be accessed [here](http://www.wingia.com/web/files/richeditor/filemanager/Global_INDEX_of_Religiosity_and_Atheism_PR__6.pdf).
+
+Take a moment to review the report then address the following questions.
+
+1. In the first paragraph, several key findings are reported. Do these
+ percentages appear to be *sample statistics* (derived from the data
+ sample) or *population parameters*? Explain your reasoning.
+
+1. The title of the report is "Global Index of Religiosity and Atheism". To
+ generalize the report's findings to the global human population, what must
+ we assume about the sampling method? Does that seem like a reasonable
+ assumption?
+
+### The data
+
+Turn your attention to Table 6 (pages 15 and 16), which reports the
+sample size and response percentages for all 57 countries. While this is
+a useful format to summarize the data, we will base our analysis on the
+original data set of individual responses to the survey. Load this data
+set into R with the following command.
+
+```{r head-data}
+data(atheism)
+```
+
+1. What does each row of Table 6 correspond to? What does each row of
+ `atheism` correspond to?
+
+To investigate the link between these two ways of organizing this data, take a
+look at the estimated proportion of atheists in the United States. Towards
+the bottom of Table 6, we see that this is 5%. We should be able to come to
+the same number using the `atheism` data.
+
+1. Using the command below, create a new dataframe called `us12` that contains
+ only the rows in `atheism` associated with respondents to the 2012 survey
+ from the United States. Next, calculate the proportion of atheist
+ responses. Does it agree with the percentage in Table 6? If not, why?
+
+```{r us-atheism}
+us12 <- atheism %>%
+ filter(nationality == "United States", year == "2012")
+```
+
+## Inference on proportions
+
+As was hinted earlier, Table 6 provides *statistics*, that is,
+calculations made from the sample of 51,927 people. What we'd like, though, is
+insight into the population *parameters*. You answer the question, "What
+proportion of people in your sample reported being atheists?" with a
+statistic; while the question "What proportion of people on earth would report
+being atheists" is answered with an estimate of the parameter.
+
+The inferential tools for estimating population proportion are analogous to
+those used for means in the last chapter: the confidence interval and the
+hypothesis test.
+
+1. Write out the conditions for inference to construct a 95% confidence
+ interval for the proportion of atheists in the United States in 2012.
+ Are you confident all conditions are met?
+
+If the conditions for inference are reasonable, we can calculate
+the standard error and construct the interval in R.
+
+```{r us-atheism-ci, tidy = FALSE}
+us12 %>%
+ summarize(N = n(), atheist = sum(response == "atheist")) %>%
+ mutate(p_hat = atheist / N,
+ se = sqrt(p_hat * (1 - p_hat) / N),
+ me = qnorm(0.975) * se,
+ lower = p_hat - me,
+ upper = p_hat + me)
+```
+
+Note that since the goal is to construct an interval estimate for a
+proportion, it's necessary to specify what constitutes a "success", which here
+is a response of `"atheist"`.
+
+Although formal confidence intervals and hypothesis tests don't show up in the
+report, suggestions of inference appear at the bottom of page 7: "In general,
+the error margin for surveys of this kind is $\pm$ 3--5% at 95% confidence".
+
+1. Based on the R output, what is the margin of error for the estimate of the
+ proportion of the proportion of atheists in US in 2012?
+
+1. Calculate confidence intervals for the
+ proportion of atheists in 2012 in two other countries of your choice, and
+ report the associated margins of error. Be sure to note whether the
+ conditions for inference are met, and interpet the interval in context of the data.
+ It may be helpful to create new data sets for each of the two countries first, and
+ then use these data sets to construct the confidence
+ intervals.
+
+## How does the proportion affect the margin of error?
+
+Imagine you've set out to survey 1000 people on two questions: are you female?
+and are you left-handed? Since both of these sample proportions were
+calculated from the same sample size, they should have the same margin of
+error, right? Wrong! While the margin of error does change with sample size,
+it is also affected by the proportion.
+
+Think back to the formula for the standard error: $SE = \sqrt{p(1-p)/n}$. This
+is then used in the formula for the margin of error for a 95% confidence
+interval:
+$$
+ME = 1.96\times SE = 1.96\times\sqrt{p(1-p)/n} \,.
+$$
+Since the
+population proportion $p$ is in this $ME$ formula, it should make sense that
+the margin of error is in some way dependent on the population proportion. We
+can visualize this relationship by creating a plot of $ME$ vs. $p$.
+
+Since sample size is irrelevant to this discussion, let's just set it to
+some value ($n = 1000$) and use this value in the following calculations:
+
+```{r n-for-me-plot}
+n <- 1000
+```
+
+The first step is to make a variable `p` that is a sequence from 0 to 1 with
+each number incremented by 0.01. We can then create a variable of the margin of
+error (`me`) associated with each of these values of `p` using the familiar
+approximate formula ($ME = 2 \times SE$).
+
+```{r p-me}
+p <- seq(from = 0, to = 1, by = 0.01)
+me <- 2 * sqrt(p * (1 - p)/n)
+```
+
+Lastly, we plot the two variables against each other to reveal their relationship.
+To do so, we need to first put these variables in a data frame that we can
+call in the `qplot` function.
+
+```{r me-plot}
+dd <- data.frame(p = p, me = me)
+qplot(x = p, y = me, data = dd,
+ ylab = "Margin of Error",
+ xlab = "Population Proportion") +
+ geom_line()
+```
+
+1. Describe the relationship between `p` and `me`. Include the margin of
+ error vs. population proportion plot you constructed in your answer. For
+ a given sample size, for which value of `p` is margin of error maximized?
+
+## Success-failure condition
+
+We have emphasized that you must always check conditions before making
+inference. For inference on proportions, the sample proportion can be assumed
+to be nearly normal if it is based upon a random sample of independent
+observations and if both $np \geq 10$ and $n(1 - p) \geq 10$. This rule of
+thumb is easy enough to follow, but it makes one wonder: what's so special
+about the number 10?
+
+The short answer is: nothing. You could argue that we would be fine with 9 or
+that we really should be using 11. What is the "best" value for such a rule of
+thumb is, at least to some degree, arbitrary. However, when $np$ and $n(1-p)$
+reaches 10 the sampling distribution is sufficiently normal to use confidence
+intervals and hypothesis tests that are based on that approximation.
+
+We can investigate the interplay between $n$ and $p$ and the shape of the
+sampling distribution by using simulations. Play around with the following
+app to investigate how the shape, center, and spread of the distribution of
+$\hat{p}$ changes as $n$ and $p$ changes.
+
+```{r sf-app, echo=FALSE, eval=TRUE}
+inputPanel(
+ numericInput("n", label = "Sample size:", value = 300),
+
+ sliderInput("p", label = "Population proportion:",
+ min = 0, max = 1, value = 0.1, step = 0.01),
+
+ numericInput("x_min", label = "Min for x-axis:", value = 0, min = 0, max = 1),
+ numericInput("x_max", label = "Max for x-axis:", value = 1, min = 0, max = 1)
+)
+
+renderPlot({
+ pp <- data.frame(p_hat = rep(0, 5000))
+ for(i in 1:5000){
+ samp <- sample(c(TRUE, FALSE), input$n, replace = TRUE,
+ prob = c(input$p, 1 - input$p))
+ pp$p_hat[i] <- sum(samp == TRUE) / input$n
+ }
+ bw <- diff(range(pp$p_hat)) / 30
+ ggplot(data = pp, aes(x = p_hat)) +
+ geom_histogram(binwidth = bw) +
+ xlim(input$x_min, input$x_max) +
+ ggtitle(paste0("Distribution of p_hats, drawn from p = ", input$p, ", n = ", input$n))
+})
+```
+
+1. Describe the sampling distribution of sample proportions at $n = 300$ and
+ $p = 0.1$. Be sure to note the center, spread, and shape.
+
+1. Keep $n$ constant and change $p$. How does the shape, center, and spread
+ of the sampling distribution vary as $p$ changes. You might want to adjust
+ min and max for the $x$-axis for a better view of the distribution.
+
+1. Now also change $n$. How does $n$ appear to affect the distribution of $\hat{p}$?
+
+1. If you refer to Table 6, you'll find that Australia has a sample
+ proportion of 0.1 in a sample size of 1040, and that Ecuador has a sample
+ proportion of 0.02 on 400 subjects. Let's suppose for this exercise that
+ these point estimates are actually the truth. Construct their sampling
+ distributions by using these values as inputs in the app. Do you think it
+ is sensible to proceed with inference and report margin of errors, as the
+ report does?
+
+* * *
+
+## More Practice
+
+The question of atheism was asked by WIN-Gallup International in a similar
+survey that was conducted in 2005. (We assume here that sample sizes have
+remained the same.) Table 4 on page 13 of the report summarizes survey results
+from 2005 and 2012 for 39 countries.
+
+
+1. Is there convincing evidence that Spain has seen a change in its atheism index
+ between 2005 and 2012? As always, write out the hypotheses for any tests you
+ conduct and outline the status of the conditions for inference. If you find a
+ significant difference, also quantify this difference with a confidence interval. \
+ *Hint:* Use the difference of two proportions methodology (i.e. find the
+ observed difference, compute the standard error, compute the z-score, etc.)
+
+1. Is there convincing evidence that the US has seen a change in its atheism index
+ between 2005 and 2012? As always, write out the hypotheses for any tests you
+ conduct and outline the status of the conditions for inference. If you find a
+ significant difference, also quantify this difference with a confidence interval.
+
+1. If in fact there has been no change in the atheism index in the countries
+ listed in Table 4, in how many of those countries would you expect to
+ detect a change (at a significance level of 0.05) simply by chance?\
+ *Hint:* Review the definition of the Type 1 error.
+
+1. Suppose you're hired by the local government to estimate the proportion of
+ residents that attend a religious service on a weekly basis. According to
+ the guidelines, the estimate must have a margin of error no greater than
+ 1% with 95% confidence. You have no idea what to expect for $p$. How many
+ people would you have to sample to ensure that you are within the
+ guidelines?\
+ *Hint:* Refer to your plot of the relationship between $p$ and margin of
+ error. This question does not require using the dataset.
+
+
+This is a product of OpenIntro that is released under a [Creative Commons
+Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0).
+This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.
+
diff --git a/inf_for_categorical_data/inf_for_categorical_data.html b/08_inf_for_categorical_data/inf_for_categorical_data.html
similarity index 98%
rename from inf_for_categorical_data/inf_for_categorical_data.html
rename to 08_inf_for_categorical_data/inf_for_categorical_data.html
index 29fe1f8..b451e90 100644
--- a/inf_for_categorical_data/inf_for_categorical_data.html
+++ b/08_inf_for_categorical_data/inf_for_categorical_data.html
@@ -47,7 +47,7 @@
-
+
@@ -76,30 +76,31 @@
Inference for categorical data
-
In August of 2012, news outlets ranging from the Washington Post to the Huffington Post ran a story about the rise of atheism in America. The source for the story was a poll that asked people, “Irrespective of whether you attend a place of worship or not, would you say you are a religious person, not a religious person or a convinced atheist?” This type of question, which asks people to classify themselves in one way or another, is common in polling and generates categorical data. In this lab we take a look at the atheism survey and explore what’s at play when making inference about population proportions using categorical data.
+
In August of 2012, news outlets ranging from the Washington Post to the Huffington Post ran a story about the rise of atheism in America. The source for the story was a poll that asked people “Irrespective of whether you attend a place of worship or not, would you say you are a religious person, not a religious person or a convinced atheist?” This type of question, which asks people to classify themselves in one way or another, is common in polling and generates categorical data. In this lab we take a look at the atheism survey and explore what’s at play when making inference about population proportions using categorical data.
The survey
-
To access the press release for the poll, conducted by WIN-Gallup International, click on the following link:
Take a moment to review the report then address the following questions.
+
Take a moment to review the press release for the poll conducted by WIN-Gallup International then address the following questions.
In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?
-
The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?
+
The title of the report is “Global Index of Religiosity and Atheism.” To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?
The data
-
Turn your attention to Table 6 (pages 15 and 16), which reports the sample size and response percentages for all 57 countries. While this is a useful format to summarize the data, we will base our analysis on the original data set of individual responses to the survey. Load this data set into R with the following command.
Turn your attention to Table 6 (pages 15 and 16), which reports the sample size and response percentages for all 57 countries. While this is a useful format to summarize the data, we will base our analysis on the original data set of individual responses to the survey. Load the necessary packages and the data set into R with the following command.
+
library(oilabs)
+library(mosaic)
+data(atheism)
What does each row of Table 6 correspond to? What does each row of atheism correspond to?
To investigate the link between these two ways of organizing this data, take a look at the estimated proportion of atheists in the United States. Towards the bottom of Table 6, we see that this is 5%. We should be able to come to the same number using the atheism data.
-
Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?
+
Using the command below, create a new data frame called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?
Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?
-
If the conditions for inference are reasonable, we can either calculate the standard error and construct the interval by hand, or allow the inference function to do it for us.
-
inference(us12$response, est ="proportion", type ="ci", method ="theoretical",
- success ="atheist")
-
Note that since the goal is to construct an interval estimate for a proportion, it’s necessary to specify what constitutes a “success”, which here is a response of "atheist".
-
Although formal confidence intervals and hypothesis tests don’t show up in the report, suggestions of inference appear at the bottom of page 7: “In general, the error margin for surveys of this kind is \(\pm\) 3-5% at 95% confidence”.
+
If the conditions for inference are reasonable (check this!), we can calculate the standard error and construct the confidence interval.
Note that since the goal is to construct an interval estimate for a proportion, it’s necessary to specify what constitutes a “success”, which here is a response of "atheist". Secondly, the qnorm function helps us find the width (in terms of the number of standard deviations from the mean) that our confidence interval needs to be in order to achieve a 95% confidence level. Note that since the normal distribution is symmetric, by cutting off the smallest 2.5% and the largest 2.5%, we’re left with the middle 95%. By changing the argument to qnorm we can find intervals that correspond to different confidence levels.
+
Although formal confidence intervals and hypothesis tests don’t show up in the report, suggestions of inference appear at the bottom of page 7: “In general, the error margin for surveys of this kind is \(\pm\) 3-5% at 95% confidence.”
Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?
-
Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference function to construct the confidence intervals.
+
Calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets to construct the confidence intervals.
How does the proportion affect the margin of error?
Imagine you’ve set out to survey 1000 people on two questions: are you female? and are you left-handed? Since both of these sample proportions were calculated from the same sample size, they should have the same margin of error, right? Wrong! While the margin of error does change with sample size, it is also affected by the proportion.
-
Think back to the formula for the standard error: \(SE = \sqrt{p(1-p)/n}\). This is then used in the formula for the margin of error for a 95% confidence interval: \(ME = 1.96\times SE = 1.96\times\sqrt{p(1-p)/n}\). Since the population proportion \(p\) is in this \(ME\) formula, it should make sense that the margin of error is in some way dependent on the population proportion. We can visualize this relationship by creating a plot of \(ME\) vs. \(p\).
-
The first step is to make a vector p that is a sequence from 0 to 1 with each number separated by 0.01. We can then create a vector of the margin of error (me) associated with each of these values of p using the familiar approximate formula (\(ME = 2 \times SE\)). Lastly, we plot the two vectors against each other to reveal their relationship.
+
Think back to the formula for the standard error: \(SE = \sqrt{p(1-p)/n}\). This is then used in the formula for the margin of error for a 95% confidence interval: \(ME = 1.96\times SE = 1.96\times\sqrt{p(1-p)/n}\). Since the population proportion \(p\) is in this \(ME\) formula, it should make sense that the margin of error is in some way dependent on the population proportion. We can visualize this relationship by creating a plot of \(ME\) vs. \(p\) for all values of \(p\) between 0 and 1.
+
The first step is to make a vector p that is a sequence from 0 to 1 with each number separated by 0.01. We can then create a vector of the margin of error (ME) associated with each of these values of p using the familiar approximate formula (\(ME = 1.96 \times SE\)). Lastly, we plot the two vectors against each other to reveal their relationship.
n <-1000
-p <-seq(0, 1, 0.01)
-me <-2 *sqrt(p *(1 -p)/n)
-plot(me ~p, ylab ="Margin of Error", xlab ="Population Proportion")
+p <-seq(from =0, to =1, by =0.01)
+ME <-2 *sqrt(p *(1 -p)/n)
+xyplot(ME ~p, ylab ="Margin of Error", xlab ="Population Proportion")
-
Describe the relationship between p and me.
+
Describe the relationship between p and ME.
@@ -136,24 +139,23 @@
Success-failure condition
The textbook emphasizes that you must always check conditions before making inference. For inference on proportions, the sample proportion can be assumed to be nearly normal if it is based upon a random sample of independent observations and if both \(np \geq 10\) and \(n(1 - p) \geq 10\). This rule of thumb is easy enough to follow, but it makes one wonder: what’s so special about the number 10?
The short answer is: nothing. You could argue that we would be fine with 9 or that we really should be using 11. What is the “best” value for such a rule of thumb is, at least to some degree, arbitrary. However, when \(np\) and \(n(1-p)\) reaches 10 the sampling distribution is sufficiently normal to use confidence intervals and hypothesis tests that are based on that approximation.
We can investigate the interplay between \(n\) and \(p\) and the shape of the sampling distribution by using simulations. To start off, we simulate the process of drawing 5000 samples of size 1040 from a population with a true atheist proportion of 0.1. For each of the 5000 samples we compute \(\hat{p}\) and then plot a histogram to visualize their distribution.
+
Here, we first resample\(n\) times from the list of responses, with the probability of drawing "atheist" equal to \(p = 0.1\). Then we tally the proportion of those responses that are "atheist". We do this 500 times.
p <-0.1
n <-1040
-p_hats <-rep(0, 5000)
+responses <-c("atheist", "non_atheist")
-for(i in 1:5000){
- samp <-sample(c("atheist", "non_atheist"), n, replace =TRUE, prob =c(p, 1-p))
- p_hats[i] <-sum(samp == "atheist")/n
-}
+p_hats <-
+do(5000) *
+responses %>%
+resample(size = n, prob =c(p, 1-p)) %>%
+tally(format ="proportion")
-hist(p_hats, main ="p = 0.1, n = 1040", xlim =c(0, 0.18))
-
These commands build up the sampling distribution of \(\hat{p}\) using the familiar for loop. You can read the sampling procedure for the first line of code inside the for loop as, “take a sample of size \(n\) with replacement from the choices of atheist and non-atheist with probabilities \(p\) and \(1 - p\), respectively.” The second line in the loop says, “calculate the proportion of atheists in this sample and record this value.” The loop allows us to repeat this process 5,000 times to build a good representation of the sampling distribution.
+histogram(~atheist, data=p_hats, main ="p = 0.1, n = 1040")
+
These commands build up the sampling distribution of \(\hat{p}\) using the familiar do loop. You can read the sampling procedure for the inner bit of code as, “take a sample of size \(n\) with replacement from the choices of atheist and non-atheist with probabilities \(p\) and \(1 - p\), respectively.” The tally command says, “calculate the proportion of atheists in this sample and record this value.” The loop allows us to repeat this process 5,000 times to build a good representation of the sampling distribution.
Describe the sampling distribution of sample proportions at \(n = 1040\) and \(p = 0.1\). Be sure to note the center, spread, and shape. Hint: Remember that R has functions such as mean to calculate summary statistics.
-
Repeat the above simulation three more times but with modified sample sizes and proportions: for \(n = 400\) and \(p = 0.1\), \(n = 1040\) and \(p = 0.02\), and \(n = 400\) and \(p = 0.02\). Plot all four histograms together by running the par(mfrow = c(2, 2)) command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. Describe the three new sampling distributions. Based on these limited plots, how does \(n\) appear to affect the distribution of \(\hat{p}\)? How does \(p\) affect the sampling distribution?
-
-
Once you’re done, you can reset the layout of the plotting window by using the command par(mfrow = c(1, 1)) command or clicking on “Clear All” above the plotting window (if using RStudio). Note that the latter will get rid of all your previous plots.
-
-
If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?
+
Repeat the above simulation three more times but with modified sample sizes and proportions: for \(n = 400\) and \(p = 0.1\), \(n = 1040\) and \(p = 0.02\), and \(n = 400\) and \(p = 0.02\). Plot all four histograms. Describe the three new sampling distributions. Based on these limited plots, how does \(n\) appear to affect the distribution of \(\hat{p}\)? How does \(p\) affect the sampling distribution?
+
If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?
@@ -161,7 +163,7 @@
Success-failure condition
On your own
The question of atheism was asked by WIN-Gallup International in a similar survey that was conducted in 2005. (We assume here that sample sizes have remained the same.) Table 4 on page 13 of the report summarizes survey results from 2005 and 2012 for 39 countries.
-
Answer the following two questions using the inference function. As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference.
+
Answer the following two questions. As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference.
a. Is there convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012? Hint: Create a new data set for respondents from Spain. Form confidence intervals for the true proportion of athiests in both years, and determine whether they overlap.
b. Is there convincing evidence that the United States has seen a change in its atheism index between 2005 and 2012?
If in fact there has been no change in the atheism index in the countries listed in Table 4, in how many of those countries would you expect to detect a change (at a significance level of 0.05) simply by chance? Hint: Look in the textbook index under Type 1 error.
diff --git a/inf_for_categorical_data/more/Global_INDEX_of_Religiosity_and_Atheism_PR__6.pdf b/08_inf_for_categorical_data/more/Global_INDEX_of_Religiosity_and_Atheism_PR__6.pdf
similarity index 100%
rename from inf_for_categorical_data/more/Global_INDEX_of_Religiosity_and_Atheism_PR__6.pdf
rename to 08_inf_for_categorical_data/more/Global_INDEX_of_Religiosity_and_Atheism_PR__6.pdf
diff --git a/inf_for_categorical_data/more/atheism.RData b/08_inf_for_categorical_data/more/atheism.RData
similarity index 100%
rename from inf_for_categorical_data/more/atheism.RData
rename to 08_inf_for_categorical_data/more/atheism.RData
diff --git a/inf_for_categorical_data/more/atheism.csv b/08_inf_for_categorical_data/more/atheism.csv
similarity index 100%
rename from inf_for_categorical_data/more/atheism.csv
rename to 08_inf_for_categorical_data/more/atheism.csv
diff --git a/inf_for_categorical_data/more/atheism05.csv b/08_inf_for_categorical_data/more/atheism05.csv
similarity index 100%
rename from inf_for_categorical_data/more/atheism05.csv
rename to 08_inf_for_categorical_data/more/atheism05.csv
diff --git a/inf_for_categorical_data/more/atheism12.csv b/08_inf_for_categorical_data/more/atheism12.csv
similarity index 100%
rename from inf_for_categorical_data/more/atheism12.csv
rename to 08_inf_for_categorical_data/more/atheism12.csv
diff --git a/inf_for_categorical_data/more/dataprep-code.R b/08_inf_for_categorical_data/more/dataprep-code.R
similarity index 100%
rename from inf_for_categorical_data/more/dataprep-code.R
rename to 08_inf_for_categorical_data/more/dataprep-code.R
diff --git a/inf_for_categorical_data/more/table4.csv b/08_inf_for_categorical_data/more/table4.csv
similarity index 100%
rename from inf_for_categorical_data/more/table4.csv
rename to 08_inf_for_categorical_data/more/table4.csv
diff --git a/inf_for_categorical_data/more/table6.csv b/08_inf_for_categorical_data/more/table6.csv
similarity index 100%
rename from inf_for_categorical_data/more/table6.csv
rename to 08_inf_for_categorical_data/more/table6.csv
diff --git a/08_inf_for_categorical_data/www/lab.css b/08_inf_for_categorical_data/www/lab.css
new file mode 100644
index 0000000..567e82e
--- /dev/null
+++ b/08_inf_for_categorical_data/www/lab.css
@@ -0,0 +1,87 @@
+body {
+ counter-reset: li; /* initialize counter named li */
+}
+
+h1 {
+ font-family:Arial, Helvetica, sans-serif;
+ font-weight:bold;
+}
+
+h2 {
+ font-family:Arial, Helvetica, sans-serif;
+ font-weight:bold;
+ margin-top: 24px;
+}
+
+ol {
+ margin-left:0; /* Remove the default left margin */
+ padding-left:0; /* Remove the default left padding */
+}
+ol > li {
+ position:relative; /* Create a positioning context */
+ margin:0 0 10px 2em; /* Give each list item a left margin to make room for the numbers */
+ padding:10px 80px; /* Add some spacing around the content */
+ list-style:none; /* Disable the normal item numbering */
+ border-top:2px solid #317EAC;
+ background:rgba(49, 126, 172, 0.1);
+}
+ol > li:before {
+ content:"Exercise " counter(li); /* Use the counter as content */
+ counter-increment:li; /* Increment the counter by 1 */
+ /* Position and style the number */
+ position:absolute;
+ top:-2px;
+ left:-2em;
+ -moz-box-sizing:border-box;
+ -webkit-box-sizing:border-box;
+ box-sizing:border-box;
+ width:7em;
+ /* Some space between the number and the content in browsers that support
+ generated content but not positioning it (Camino 2 is one example) */
+ margin-right:8px;
+ padding:4px;
+ border-top:2px solid #317EAC;
+ color:#fff;
+ background:#317EAC;
+ font-weight:bold;
+ font-family:"Helvetica Neue", Arial, sans-serif;
+ text-align:center;
+}
+li ol,
+li ul {margin-top:6px;}
+ol ol li:last-child {margin-bottom:0;}
+
+.oyo ul {
+ list-style-type:decimal;
+}
+
+hr {
+ border: 1px solid #357FAA;
+}
+
+div#boxedtext {
+ background-color: rgba(86, 155, 189, 0.2);
+ padding: 20px;
+ margin-bottom: 20px;
+ font-size: 10pt;
+}
+
+div#template {
+ margin-top: 30px;
+ margin-bottom: 30px;
+ color: #808080;
+ border:1px solid #808080;
+ padding: 10px 10px;
+ background-color: rgba(128, 128, 128, 0.2);
+ border-radius: 5px;
+}
+
+div#license {
+ margin-top: 30px;
+ margin-bottom: 30px;
+ color: #4C721D;
+ border:1px solid #4C721D;
+ padding: 10px 10px;
+ background-color: rgba(76, 114, 29, 0.2);
+ border-radius: 5px;
+}
\ No newline at end of file
diff --git a/simple_regression/more/dataprep-code.R b/09_simple_regression/more/dataprep-code.R
similarity index 100%
rename from simple_regression/more/dataprep-code.R
rename to 09_simple_regression/more/dataprep-code.R
diff --git a/simple_regression/more/mlb09.csv b/09_simple_regression/more/mlb09.csv
similarity index 100%
rename from simple_regression/more/mlb09.csv
rename to 09_simple_regression/more/mlb09.csv
diff --git a/simple_regression/more/mlb11-readme.txt b/09_simple_regression/more/mlb11-readme.txt
similarity index 100%
rename from simple_regression/more/mlb11-readme.txt
rename to 09_simple_regression/more/mlb11-readme.txt
diff --git a/simple_regression/more/mlb11.RData b/09_simple_regression/more/mlb11.RData
similarity index 100%
rename from simple_regression/more/mlb11.RData
rename to 09_simple_regression/more/mlb11.RData
diff --git a/simple_regression/more/mlb11.csv b/09_simple_regression/more/mlb11.csv
similarity index 100%
rename from simple_regression/more/mlb11.csv
rename to 09_simple_regression/more/mlb11.csv
diff --git a/simple_regression/more/plot_ss.R b/09_simple_regression/more/plot_ss.R
similarity index 100%
rename from simple_regression/more/plot_ss.R
rename to 09_simple_regression/more/plot_ss.R
diff --git a/simple_regression/simple_regression.Rmd b/09_simple_regression/simple_regression.Rmd
similarity index 66%
rename from simple_regression/simple_regression.Rmd
rename to 09_simple_regression/simple_regression.Rmd
index 5e552f8..5d865b0 100644
--- a/simple_regression/simple_regression.Rmd
+++ b/09_simple_regression/simple_regression.Rmd
@@ -5,15 +5,18 @@ output:
css: ../lab.css
highlight: pygments
theme: cerulean
- pdf_document: default
---
+```{r global_options, include=FALSE}
+knitr::opts_chunk$set(eval = FALSE)
+```
+
## Batter up
-The movie [Moneyball](http://en.wikipedia.org/wiki/Moneyball_(film)) focuses on
+The movie [*Moneyball*](http://en.wikipedia.org/wiki/Moneyball_(film)) focuses on
the "quest for the secret of success in baseball". It follows a low-budget team,
the Oakland Athletics, who believed that underused statistics, such as a player's
-ability to get on base, betterpredict the ability to score runs than typical
+ability to get on base, better predict the ability to score runs than typical
statistics like home runs, RBIs (runs batted in), and batting average. Obtaining
players who excelled in these underused statistics turned out to be much more
affordable for the team.
@@ -24,21 +27,40 @@ of other player statistics. Our aim will be to summarize these relationships
both graphically and numerically in order to find which variable, if any, helps
us best predict a team's runs scored in a season.
-## The data
+## Getting Started
+
+### Load packages
+
+In this lab we will explore the data using the `dplyr` package and visualize it
+using the `ggplot2` package for data visualization. The data can be found in the
+companion package for OpenIntro labs, `oilabs`.
+
+Let's load the packages.
+
+```{r load-packages, message=FALSE}
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+```
+
+### Creating a reproducible lab report
+
+To create your new lab report, start by opening a new R Markdown document... From Template... then select Lab Report from the `oilabs` package.
+
+### The data
Let's load up the data for the 2011 season.
-```{r load-data, eval=FALSE}
-download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
-load("mlb11.RData")
+```{r load-data}
+data(mlb11)
```
-In addition to runs scored, there are seven traditionally used variables in the
+In addition to runs scored, there are seven [traditionally-used variables](https://en.wikipedia.org/wiki/Baseball_statistics#Commonly_used_statistics) in the
data set: at-bats, hits, home runs, batting average, strikeouts, stolen bases,
and wins. There are also three newer variables: on-base percentage, slugging
percentage, and on-base plus slugging. For the first portion of the analysis
we'll consider the seven traditional variables. At the end of the lab, you'll
-work with the newer variables on your own.
+work with the three newer variables on your own.
1. What type of plot would you use to display the relationship between `runs`
and one of the other numerical variables? Plot this relationship using the
@@ -49,12 +71,20 @@ work with the newer variables on your own.
If the relationship looks linear, we can quantify the strength of the
relationship with the correlation coefficient.
-```{r cor, eval=FALSE}
-cor(mlb11$runs, mlb11$at_bats)
+```{r cor}
+mlb11 %>%
+ summarise(cor(runs, at_bats))
```
## Sum of squared residuals
+
+In this section you will use an interactive function to investigate what we mean by "sum
+of squared residuals". You will need to run this function in your console, not in your
+markdown document. Running the function also requires that the `mlb11` dataset is loaded
+in your environment.
+
+
Think back to the way that we described the distribution of a single variable.
Recall that we discussed characteristics such as center, spread, and shape. It's
also useful to be able to describe the relationship of two numerical variables,
@@ -71,7 +101,7 @@ function to select the line that you think does the best job of going through
the cloud of points.
```{r plotss-atbats-runs, eval=FALSE}
-plot_ss(x = mlb11$at_bats, y = mlb11$runs)
+plot_ss(x = at_bats, y = runs, data = mlb11)
```
After running this command, you'll be prompted to click two points on the plot
@@ -89,7 +119,7 @@ the sum of squared residuals. To visualize the squared residuals, you can rerun
the plot command and add the argument `showSquares = TRUE`.
```{r plotss-atbats-runs-squares, eval=FALSE}
-plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)
+plot_ss(x = at_bats, y = runs, data = mlb11, showSquares = TRUE)
```
Note that the output from the `plot_ss` function provides you with the slope and
@@ -106,7 +136,7 @@ line that minimizes the sum of squared residuals, through trial and error.
Instead we can use the `lm` function in R to fit the linear model (a.k.a.
regression line).
-```{r m1, eval=FALSE}
+```{r m1}
m1 <- lm(runs ~ at_bats, data = mlb11)
```
@@ -119,7 +149,7 @@ The output of `lm` is an object that contains all of the information we need
about the linear model that was just fit. We can access this information using
the summary function.
-```{r summary-m1, eval=FALSE}
+```{r summary-m1}
summary(m1)
```
@@ -131,7 +161,7 @@ With this table, we can write down the least squares regression line for the
linear model:
\[
- \hat{y} = -2789.2429 + 0.6305 * atbats
+ \hat{y} = -2789.2429 + 0.6305 \times at\_bats
\]
One last piece of information we will discuss from the summary output is the
@@ -147,23 +177,25 @@ explained by at-bats.
## Prediction and prediction errors
-Let's create a scatterplot with the least squares line laid on top.
+Let's create a scatterplot with the least squares line for `m1` laid on top.
-```{r reg-with-line, eval=FALSE}
-plot(mlb11$runs ~ mlb11$at_bats)
-abline(m1)
+```{r reg-with-line}
+qplot(x = at_bats, y = runs, data = mlb11, geom = "point") +
+ geom_smooth(method = "lm", se = FALSE)
```
-The function `abline` plots a line based on its slope and intercept. Here, we
-used a shortcut by providing the model `m1`, which contains both parameter
-estimates. This line can be used to predict $y$ at any value of $x$. When
+Here we are literally adding a layer on top of our plot. `geom_smooth` creates
+the line by fitting a linear model. It can also show us the standard error `se`
+associated with our line, but we'll suppress that for now.
+
+This line can be used to predict $y$ at any value of $x$. When
predictions are made for values of $x$ that are beyond the range of the observed
data, it is referred to as *extrapolation* and is not usually recommended.
However, predictions made within the range of the data are more reliable.
They're also used to compute the residuals.
5. If a team manager saw the least squares regression line and not the actual
- data, how many runs would he or she predict for a team with 5,578 at-bats?
+ data, how many runs would he or she predict for a team with 5,579 at-bats?
Is this an overestimate or an underestimate, and by how much? In other
words, what is the residual for this prediction?
@@ -172,50 +204,61 @@ They're also used to compute the residuals.
To assess whether the linear model is reliable, we need to check for (1)
linearity, (2) nearly normal residuals, and (3) constant variability.
-*Linearity*: You already checked if the relationship between runs and at-bats
+**Linearity**: You already checked if the relationship between runs and at-bats
is linear using a scatterplot. We should also verify this condition with a plot
-of the residuals vs. at-bats. Recall that any code following a *#* is intended
-to be a comment that helps understand the code but is ignored by R.
+of the residuals vs. fitted (predicted) values.
-```{r residuals, eval=FALSE}
-plot(m1$residuals ~ mlb11$at_bats)
-abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0
+```{r residuals}
+qplot(x = .fitted, y = .resid, data = m1) +
+ geom_hline(yintercept = 0, linetype = "dashed") +
+ xlab("Fitted values") +
+ ylab("Residuals")
```
+Notice here that our model object `m1` can also serve as a data set because stored within it are the fitted values ($\hat{y}$) and the residuals. Also note that we're getting fancy with the code here. After creating the scatterplot on the first layer (first line of code), we overlay a horizontal dashed line at $y = 0$ (to help us check whether residuals are distributed around 0), and we also adjust the axis labels to be more informative.
+
6. Is there any apparent pattern in the residuals plot? What does this indicate
- about the linearity of the relationship between runs and at-bats?
+ about the linearity of the relationship between runs and at-bats?
-*Nearly normal residuals*: To check this condition, we can look at a histogram
+
-```{r hist-res, eval=FALSE}
-hist(m1$residuals)
+**Nearly normal residuals**: To check this condition, we can look at a histogram
+
+```{r hist-res}
+qplot(x = .resid, data = m1, geom = "histogram", binwidth = 25) +
+ xlab("Residuals")
```
or a normal probability plot of the residuals.
-```{r qq-res, eval=FALSE}
-qqnorm(m1$residuals)
-qqline(m1$residuals) # adds diagonal line to the normal prob plot
+```{r qq-res}
+qplot(sample = .resid, data = m1, stat = "qq")
```
+Note that the syntax for making a normal probability plot is a bit different than what you're used to seeing: we set `sample` equal to the residuals instead of `x`, and we set a statistical method `qq`, which stands for "quantile-quantile", another name commonly used for normal probability plots.
+
7. Based on the histogram and the normal probability plot, does the nearly
normal residuals condition appear to be met?
-*Constant variability*:
+
+
+**Constant variability**:
+
+8. Based on the residuals vs. fitted plot, does the constant variability condition
+ appear to be met?
-8. Based on the plot in (1), does the constant variability condition appear to
- be met?
-
* * *
-## On Your Own
+## On your own
+
-- Choose another traditional variable from `mlb11` that you think might be a
- good predictor of `runs`. Produce a scatterplot of the two variables and fit
- a linear model. At a glance, does there seem to be a linear relationship?
+- Choose another one of the seven traditional variables from `mlb11` besides
+ `at_bats` that you think might be a good predictor of `runs`. Produce a
+ scatterplot of the two variables and fit a linear model. At a glance, does
+ there seem to be a linear relationship?
- How does this relationship compare to the relationship between `runs` and
- `at_bats`? Use the R$^2$ values from the two model summaries to compare.
+ `at_bats`? Use the $R^2$ values from the two model summaries to compare.
Does your variable seem to predict `runs` better than `at_bats`? How can you
tell?
@@ -226,8 +269,8 @@ qqline(m1$residuals) # adds diagonal line to the normal prob plot
the sake of conciseness, only include output for the best variable, not all
five).
-- Now examine the three newer variables. These are the statistics used by the
- author of *Moneyball* to predict a teams success. In general, are they more
+- Now examine the three newer variables. These are the statistics used by [the
+ central character](https://en.wikipedia.org/wiki/Paul_DePodesta) in *Moneyball* to predict a team's success. In general, are they more
or less effective at predicting runs that the old variables? Explain using
appropriate graphical and numerical evidence. Of all ten variables we've
analyzed, which seems to be the best predictor of `runs`? Using the limited
@@ -236,6 +279,8 @@ qqline(m1$residuals) # adds diagonal line to the normal prob plot
- Check the model diagnostics for the regression model with the variable you
decided was the best predictor for runs.
+
+
This is a product of OpenIntro that is released under a [Creative Commons
diff --git a/09_simple_regression/simple_regression.html b/09_simple_regression/simple_regression.html
new file mode 100644
index 0000000..95b2717
--- /dev/null
+++ b/09_simple_regression/simple_regression.html
@@ -0,0 +1,307 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+Introduction to linear regression
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Introduction to linear regression
+
+
+
+
+
+
Batter up
+
The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.
+
In this lab we’ll be looking at data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.
+
+
+
Getting Started
+
+
Load packages
+
In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data can be found in the companion package for OpenIntro labs, oilabs.
+
Let’s load the packages.
+
library(dplyr)
+library(ggplot2)
+library(oilabs)
+
+
+
Creating a reproducible lab report
+
To create your new lab report, start by opening a new R Markdown document… From Template… then select Lab Report from the oilabs package.
+
+
+
The data
+
Let’s load up the data for the 2011 season.
+
data(mlb11)
+
In addition to runs scored, there are seven traditionally-used variables in the data set: at-bats, hits, home runs, batting average, strikeouts, stolen bases, and wins. There are also three newer variables: on-base percentage, slugging percentage, and on-base plus slugging. For the first portion of the analysis we’ll consider the seven traditional variables. At the end of the lab, you’ll work with the three newer variables on your own.
+
+
What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?
+
+
If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.
+
mlb11 %>%
+summarise(cor(runs, at_bats))
+
+
+
+
Sum of squared residuals
+
+
In this section you will use an interactive function to investigate what we mean by “sum of squared residuals”. You will need to run this function in your console, not in your markdown document. Running the function also requires that the mlb11 dataset is loaded in your environment.
+
+
Think back to the way that we described the distribution of a single variable. Recall that we discussed characteristics such as center, spread, and shape. It’s also useful to be able to describe the relationship of two numerical variables, such as runs and at_bats above.
+
+
Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.
+
+
Just as we used the mean and standard deviation to summarize a single variable, we can summarize the relationship between these two variables by finding the line that best follows their association. Use the following interactive function to select the line that you think does the best job of going through the cloud of points.
+
plot_ss(x = at_bats, y = runs, data = mlb11)
+
After running this command, you’ll be prompted to click two points on the plot to define a line. Once you’ve done that, the line you specified will be shown in black and the residuals in blue. Note that there are 30 residuals, one for each of the 30 observations. Recall that the residuals are the difference between the observed values and the values predicted by the line:
+
\[
+ e_i = y_i - \hat{y}_i
+\]
+
The most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, you can rerun the plot command and add the argument showSquares = TRUE.
+
plot_ss(x = at_bats, y = runs, data = mlb11, showSquares =TRUE)
+
Note that the output from the plot_ss function provides you with the slope and intercept of your line as well as the sum of squares.
+
+
Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?
+
+
+
+
The linear model
+
It is rather cumbersome to try to get the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error. Instead we can use the lm function in R to fit the linear model (a.k.a. regression line).
+
m1 <-lm(runs ~at_bats, data = mlb11)
+
The first argument in the function lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of at_bats. The second argument specifies that R should look in the mlb11 data frame to find the runs and at_bats variables.
+
The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this information using the summary function.
+
summary(m1)
+
Let’s consider this output piece by piece. First, the formula used to describe the model is shown at the top. After the formula you find the five-number summary of the residuals. The “Coefficients” table shown next is key; its first column displays the linear model’s y-intercept and the coefficient of at_bats. With this table, we can write down the least squares regression line for the linear model:
One last piece of information we will discuss from the summary output is the Multiple R-squared, or more simply, \(R^2\). The \(R^2\) value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 37.3% of the variability in runs is explained by at-bats.
+
+
Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?
+
+
+
+
Prediction and prediction errors
+
Let’s create a scatterplot with the least squares line for m1 laid on top.
+
qplot(x = at_bats, y = runs, data = mlb11, geom ="point") +
+geom_smooth(method ="lm", se =FALSE)
+
Here we are literally adding a layer on top of our plot. geom_smooth creates the line by fitting a linear model. It can also show us the standard error se associated with our line, but we’ll suppress that for now.
+
This line can be used to predict \(y\) at any value of \(x\). When predictions are made for values of \(x\) that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.
+
+
If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,579 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?
+
+
+
+
Model diagnostics
+
To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
+
Linearity: You already checked if the relationship between runs and at-bats is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. fitted (predicted) values.
+
qplot(x = .fitted, y = .resid, data = m1) +
+geom_hline(yintercept =0, linetype ="dashed") +
+xlab("Fitted values") +
+ylab("Residuals")
+
Notice here that our model object m1 can also serve as a data set because stored within it are the fitted values (\(\hat{y}\)) and the residuals. Also note that we’re getting fancy with the code here. After creating the scatterplot on the first layer (first line of code), we overlay a horizontal dashed line at \(y = 0\) (to help us check whether residuals are distributed around 0), and we also adjust the axis labels to be more informative.
+
+
Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?
+
+
+
Nearly normal residuals: To check this condition, we can look at a histogram
Note that the syntax for making a normal probability plot is a bit different than what you’re used to seeing: we set sample equal to the residuals instead of x, and we set a statistical method qq, which stands for “quantile-quantile”, another name commonly used for normal probability plots.
+
+
Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?
+
+
+
Constant variability:
+
+
Based on the residuals vs. fitted plot, does the constant variability condition appear to be met?
+
+
+
+
+
On your own
+
+
+
Choose another one of the seven traditional variables from mlb11 besides at_bats that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?
+
How does this relationship compare to the relationship between runs and at_bats? Use the \(R^2\) values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?
+
Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).
+
Now examine the three newer variables. These are the statistics used by the central character in Moneyball to predict a team’s success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?
+
Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.
+
+
+
+
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/multiple_regression/more/code.R b/10_multiple_regression/more/code.R
similarity index 100%
rename from multiple_regression/more/code.R
rename to 10_multiple_regression/more/code.R
diff --git a/multiple_regression/more/evals-readme.txt b/10_multiple_regression/more/evals-readme.txt
similarity index 100%
rename from multiple_regression/more/evals-readme.txt
rename to 10_multiple_regression/more/evals-readme.txt
diff --git a/multiple_regression/more/evals.RData b/10_multiple_regression/more/evals.RData
similarity index 100%
rename from multiple_regression/more/evals.RData
rename to 10_multiple_regression/more/evals.RData
diff --git a/multiple_regression/more/evals.csv b/10_multiple_regression/more/evals.csv
similarity index 100%
rename from multiple_regression/more/evals.csv
rename to 10_multiple_regression/more/evals.csv
diff --git a/multiple_regression/multiple_regression.Rmd b/10_multiple_regression/multiple_regression.Rmd
similarity index 68%
rename from multiple_regression/multiple_regression.Rmd
rename to 10_multiple_regression/multiple_regression.Rmd
index c5fec27..09663cd 100644
--- a/multiple_regression/multiple_regression.Rmd
+++ b/10_multiple_regression/multiple_regression.Rmd
@@ -5,9 +5,15 @@ output:
css: ../lab.css
highlight: pygments
theme: cerulean
- pdf_document: default
---
+```{r global_options, include=FALSE}
+knitr::opts_chunk$set(eval = FALSE)
+library(dplyr)
+library(ggplot2)
+library(GGally)
+```
+
## Grading the professor
Many college courses conclude by giving students the opportunity to evaluate
@@ -16,54 +22,58 @@ evaluations as an indicator of course quality and teaching effectiveness is
often criticized because these measures may reflect the influence of
non-teaching related characteristics, such as the physical appearance of the
instructor. The article titled, "Beauty in the classroom: instructors'
-pulchritude and putative pedagogical productivity" (Hamermesh and Parker, 2005)
+pulchritude and putative pedagogical productivity" by Hamermesh and Parker
found that instructors who are viewed to be better looking receive higher
-instructional ratings. (Daniel S. Hamermesh, Amy Parker, Beauty in the
-classroom: instructors pulchritude and putative pedagogical productivity,
-*Economics of Education Review*, Volume 24, Issue 4, August 2005, Pages 369-376,
-ISSN 0272-7757, 10.1016/j.econedurev.2004.07.013. [http://www.sciencedirect.com/science/article/pii/S0272775704001165](http://www.sciencedirect.com/science/article/pii/S0272775704001165).)
+instructional ratings.
In this lab we will analyze the data from this study in order to learn what goes
into a positive professor evaluation.
-## The data
+## Getting Started
+
+### Load packages
+
+In this lab we will explore the data using the `dplyr` package and visualize it
+using the `ggplot2` package for data visualization. The data can be found in the
+companion package for the OpenIntro labs, `oilabs`.
+
+Let's load the packages.
+
+```{r load-packages, message=FALSE}
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+library(GGally)
+```
+
+This is the first time we're using the `GGally` package. We will be using the
+`ggpairs` function from this package later in the lab.
+
+### Creating a reproducible lab report
+
+To create your new lab report, start by opening a new R Markdown document... From Template... then select Lab Report from the `oilabs` package. Make sure that all
+necessary packages loaded in your R Markdown document.
+
+### The data
The data were gathered from end of semester student evaluations for a large
sample of professors from the University of Texas at Austin. In addition, six
-students rated the professors' physical appearance. (This is aslightly modified
-version of the original data set that was released as part of the replication
-data for *Data Analysis Using Regression and Multilevel/Hierarchical Models*
-(Gelman and Hill, 2007).) The result is a data frame where each row contains a
-different course and columns represent variables about the courses and professors.
-
-```{r load-data, eval=FALSE}
-download.file("http://www.openintro.org/stat/data/evals.RData", destfile = "evals.RData")
-load("evals.RData")
+students rated the professors' physical appearance. The result is a data frame
+where each row contains a different course and columns represent variables about
+the courses and professors.
+
+Let's load the data:
+
+```{r load-data, message=FALSE}
+data(evals)
```
-variable | description
----------------- | -----------
-`score` | average professor evaluation score: (1) very unsatisfactory - (5) excellent.
-`rank` | rank of professor: teaching, tenure track, tenured.
-`ethnicity` | ethnicity of professor: not minority, minority.
-`gender` | gender of professor: female, male.
-`language` | language of school where professor received education: english or non-english.
-`age` | age of professor.
-`cls_perc_eval` | percent of students in class who completed evaluation.
-`cls_did_eval` | number of students in class who completed evaluation.
-`cls_students` | total number of students in class.
-`cls_level` | class level: lower, upper.
-`cls_profs` | number of professors teaching sections in course in sample: single, multiple.
-`cls_credits` | number of credits of class: one credit (lab, PE, etc.), multi credit.
-`bty_f1lower` | beauty rating of professor from lower level female: (1) lowest - (10) highest.
-`bty_f1upper` | beauty rating of professor from upper level female: (1) lowest - (10) highest.
-`bty_f2upper` | beauty rating of professor from second upper level female: (1) lowest - (10) highest.
-`bty_m1lower` | beauty rating of professor from lower level male: (1) lowest - (10) highest.
-`bty_m1upper` | beauty rating of professor from upper level male: (1) lowest - (10) highest.
-`bty_m2upper` | beauty rating of professor from second upper level male: (1) lowest - (10) highest.
-`bty_avg` | average beauty rating of professor.
-`pic_outfit` | outfit of professor in picture: not formal, formal.
-`pic_color` | color of professor's picture: color, black & white.
+We have observations on 21 different variables, some categorical and some
+numerical. The meaning of each variable can be found by bringing up the help file:
+
+```{r help-nc}
+?evals
+```
## Exploring the data
@@ -77,8 +87,7 @@ variable | description
see? Why, or why not?
3. Excluding `score`, select two other variables and describe their relationship
- using an appropriate visualization (scatterplot, side-by-side boxplots, or
- mosaic plot).
+ with each other using an appropriate visualization.
## Simple linear regression
@@ -86,25 +95,41 @@ The fundamental phenomenon suggested by the study is that better looking teacher
are evaluated more favorably. Let's create a scatterplot to see if this appears
to be the case:
-```{r scatter-score-bty_avg, eval = FALSE}
-plot(evals$score ~ evals$bty_avg)
+```{r scatter-score-bty_avg}
+qplot(data = evals, x = bty_avg, y = score)
```
-
Before we draw conclusions about the trend, compare the number of observations
in the data frame with the approximate number of points on the scatterplot.
Is anything awry?
-4. Replot the scatterplot, but this time use the function `jitter()` on the
- $y$- or the $x$-coordinate. (Use `?jitter` to learn more.) What was
+4. Replot the scatterplot, but this time use `geom = "jitter"`. What was
misleading about the initial scatterplot?
+```{r scatter-score-bty_avg-jitter}
+qplot(data = evals, x = bty_avg, y = score, geom = "jitter")
+```
+
5. Let's see if the apparent trend in the plot is something more than
natural variation. Fit a linear model called `m_bty` to predict average
- professor score by average beauty rating and add the line to your plot
- using `abline(m_bty)`. Write out the equation for the linear model and
- interpret the slope. Is average beauty score a statistically significant
+ professor score by average beauty rating. Write out the equation for the linear
+ model and interpret the slope. Is average beauty score a statistically significant
predictor? Does it appear to be a practically significant predictor?
+
+Add the line of the bet fit model to your plot using the following:
+
+```{r scatter-score-bty_avg-line-se}
+qplot(data = evals, x = bty_avg, y = score, geom = "jitter") +
+ geom_smooth(method = "lm")
+```
+
+The blue line is the model. The shaded gray area around the line tells us about the
+variability we might expect in our predictions. To turn that off, use `se = FALSE`.
+
+```{r scatter-score-bty_avg-line}
+qplot(data = evals, x = bty_avg, y = score, geom = "jitter") +
+ geom_smooth(method = "lm", se = FALSE)
+```
6. Use residual plots to evaluate whether the conditions of least squares
regression are reasonable. Provide plots and comments for each one (see
@@ -118,18 +143,21 @@ physical appearance of the professors and the average of these six scores. Let's
take a look at the relationship between one of these scores and the average
beauty score.
-```{r bty-rel, eval = FALSE}
-plot(evals$bty_avg ~ evals$bty_f1lower)
-cor(evals$bty_avg, evals$bty_f1lower)
+```{r bty-rel}
+qplot(data = evals, x = bty_f1lower, y = bty_avg)
+evals %>%
+ summarise(cor(bty_avg, bty_f1lower))
```
-As expected the relationship is quite strong - after all, the average score is
-calculated using the individual scores. We can actually take a look at the
+As expected the relationship is quite strong---after all, the average score is
+calculated using the individual scores. We can actually look at the
relationships between all beauty variables (columns 13 through 19) using the
following command:
-```{r bty-rels, eval = FALSE}
-plot(evals[,13:19])
+```{r bty-rels}
+evals %>%
+ select(contains("bty")) %>%
+ ggpairs()
```
These variables are collinear (correlated), and adding more than one of these
@@ -141,7 +169,7 @@ In order to see if beauty is still a significant predictor of professor score
after we've accounted for the gender of the professor, we can add the gender
term into the model.
-```{r scatter-score-bty_avg_gender, eval = FALSE}
+```{r scatter-score-bty_avg_gender}
m_bty_gen <- lm(score ~ bty_avg + gender, data = evals)
summary(m_bty_gen)
```
@@ -169,12 +197,13 @@ the intercept and slope form familiar from simple regression.
&= \hat{\beta}_0 + \hat{\beta}_1 \times bty\_avg\end{aligned}
\]
-We can plot this line and the line corresponding to males with the following
-custom function.
+
+
-```{r twoLines, eval = FALSE}
-multiLines(m_bty_gen)
-```
+
+
+
+
9. What is the equation of the line corresponding to males? (*Hint:* For
males, the parameter estimate is multiplied by 1.) For two professors
@@ -184,7 +213,7 @@ multiLines(m_bty_gen)
The decision to call the indicator variable `gendermale` instead of`genderfemale`
has no deeper meaning. R simply codes the category that comes first
alphabetically as a $0$. (You can change the reference level of a categorical
-variable, which is the level that is coded as a 0, using the`relevel` function.
+variable, which is the level that is coded as a 0, using the`relevel()` function.
Use `?relevel` to learn more.)
10. Create a new model called `m_bty_rank` with `gender` removed and `rank`
@@ -213,7 +242,7 @@ picture color.
Let's run the model...
-```{r m_full, eval = FALSE, tidy = FALSE}
+```{r m_full, tidy = FALSE}
m_full <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval
+ cls_students + cls_level + cls_profs + cls_credits + bty_avg
+ pic_outfit + pic_color, data = evals)
@@ -256,4 +285,6 @@ summary(m_full)
This is a product of OpenIntro that is released under a [Creative Commons Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0). This lab was written by
Mine Çetinkaya-Rundel and Andrew Bray.
-
Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. The article titled, “Beauty in the classroom: instructors’ pulchritude and putative pedagogical productivity” by Hamermesh and Parker found that instructors who are viewed to be better looking receive higher instructional ratings.
+
In this lab we will analyze the data from this study in order to learn what goes into a positive professor evaluation.
+
+
+
Getting Started
+
+
Load packages
+
In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data can be found in the companion package for the OpenIntro labs, oilabs.
This is the first time we’re using the GGally package. We will be using the ggpairs function from this package later in the lab.
+
+
+
Creating a reproducible lab report
+
To create your new lab report, start by opening a new R Markdown document… From Template… then select Lab Report from the oilabs package. Make sure that all necessary packages loaded in your R Markdown document.
+
+
+
The data
+
The data were gathered from end of semester student evaluations for a large sample of professors from the University of Texas at Austin. In addition, six students rated the professors’ physical appearance. The result is a data frame where each row contains a different course and columns represent variables about the courses and professors.
+
Let’s load the data:
+
data(evals)
+
We have observations on 21 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:
+
?evals
+
+
+
+
Exploring the data
+
+
Is this an observational study or an experiment? The original research question posed in the paper is whether beauty leads directly to the differences in course evaluations. Given the study design, is it possible to answer this question as it is phrased? If not, rephrase the question.
+
Describe the distribution of score. Is the distribution skewed? What does that tell you about how students rate courses? Is this what you expected to see? Why, or why not?
+
Excluding score, select two other variables and describe their relationship with each other using an appropriate visualization.
+
+
+
+
Simple linear regression
+
The fundamental phenomenon suggested by the study is that better looking teachers are evaluated more favorably. Let’s create a scatterplot to see if this appears to be the case:
+
qplot(data = evals, x = bty_avg, y = score)
+
Before we draw conclusions about the trend, compare the number of observations in the data frame with the approximate number of points on the scatterplot. Is anything awry?
+
+
Replot the scatterplot, but this time use geom = "jitter". What was misleading about the initial scatterplot?
+
+
qplot(data = evals, x = bty_avg, y = score, geom ="jitter")
+
+
Let’s see if the apparent trend in the plot is something more than natural variation. Fit a linear model called m_bty to predict average professor score by average beauty rating. Write out the equation for the linear model and interpret the slope. Is average beauty score a statistically significant predictor? Does it appear to be a practically significant predictor?
+
+
Add the line of the bet fit model to your plot using the following:
+
qplot(data = evals, x = bty_avg, y = score, geom ="jitter") +
+geom_smooth(method ="lm")
+
The blue line is the model. The shaded gray area around the line tells us about the variability we might expect in our predictions. To turn that off, use se = FALSE.
+
qplot(data = evals, x = bty_avg, y = score, geom ="jitter") +
+geom_smooth(method ="lm", se =FALSE)
+
+
Use residual plots to evaluate whether the conditions of least squares regression are reasonable. Provide plots and comments for each one (see the Simple Regression Lab for a reminder of how to make these).
+
+
+
+
Multiple linear regression
+
The data set contains several variables on the beauty score of the professor: individual ratings from each of the six students who were asked to score the physical appearance of the professors and the average of these six scores. Let’s take a look at the relationship between one of these scores and the average beauty score.
+
qplot(data = evals, x = bty_f1lower, y = bty_avg)
+evals %>%
+summarise(cor(bty_avg, bty_f1lower))
+
As expected the relationship is quite strong—after all, the average score is calculated using the individual scores. We can actually look at the relationships between all beauty variables (columns 13 through 19) using the following command:
+
evals %>%
+select(contains("bty")) %>%
+ggpairs()
+
These variables are collinear (correlated), and adding more than one of these variables to the model would not add much value to the model. In this application and with these highly-correlated predictors, it is reasonable to use the average beauty score as the single representative of these variables.
+
In order to see if beauty is still a significant predictor of professor score after we’ve accounted for the gender of the professor, we can add the gender term into the model.
+
m_bty_gen <-lm(score ~bty_avg +gender, data = evals)
+summary(m_bty_gen)
+
+
P-values and parameter estimates should only be trusted if the conditions for the regression are reasonable. Verify that the conditions for this model are reasonable using diagnostic plots.
+
Is bty_avg still a significant predictor of score? Has the addition of gender to the model changed the parameter estimate for bty_avg?
+
+
Note that the estimate for gender is now called gendermale. You’ll see this name change whenever you introduce a categorical variable. The reason is that R recodes gender from having the values of female and male to being an indicator variable called gendermale that takes a value of \(0\) for females and a value of \(1\) for males. (Such variables are often referred to as “dummy” variables.)
+
As a result, for females, the parameter estimate is multiplied by zero, leaving the intercept and slope form familiar from simple regression.
What is the equation of the line corresponding to males? (Hint: For males, the parameter estimate is multiplied by 1.) For two professors who received the same beauty rating, which gender tends to have the higher course evaluation score?
+
+
The decision to call the indicator variable gendermale instead ofgenderfemale has no deeper meaning. R simply codes the category that comes first alphabetically as a \(0\). (You can change the reference level of a categorical variable, which is the level that is coded as a 0, using therelevel() function. Use ?relevel to learn more.)
+
+
Create a new model called m_bty_rank with gender removed and rank added in. How does R appear to handle categorical variables that have more than two levels? Note that the rank variable has three levels: teaching, tenure track, tenured.
+
+
The interpretation of the coefficients in multiple regression is slightly different from that of simple regression. The estimate for bty_avg reflects how much higher a group of professors is expected to score if they have a beauty rating that is one point higher while holding all other variables constant. In this case, that translates into considering only professors of the same rank with bty_avg scores that are one point apart.
+
+
+
The search for the best model
+
We will start with a full model that predicts professor score based on rank, ethnicity, gender, language of the university where they got their degree, age, proportion of students that filled out evaluations, class size, course level, number of professors, number of credits, average beauty rating, outfit, and picture color.
+
+
Which variable would you expect to have the highest p-value in this model? Why? Hint: Think about which variable would you expect to not have any association with the professor score.
Check your suspicions from the previous exercise. Include the model output in your response.
+
Interpret the coefficient associated with the ethnicity variable.
+
Drop the variable with the highest p-value and re-fit the model. Did the coefficients and significance of the other explanatory variables change? (One of the things that makes multiple regression interesting is that coefficient estimates depend on the other variables that are included in the model.) If not, what does this say about whether or not the dropped variable was collinear with the other explanatory variables?
+
Using backward-selection and p-value as the selection criterion, determine the best model. You do not need to show all steps in your answer, just the output for the final model. Also, write out the linear model for predicting score based on the final model you settle on.
+
Verify that the conditions for this model are reasonable using diagnostic plots.
+
The original paper describes how these data were gathered by taking a sample of professors from the University of Texas at Austin and including all courses that they have taught. Considering that each row represents a course, could this new information have an impact on any of the conditions of linear regression?
+
Based on your final model, describe the characteristics of a professor and course at University of Texas at Austin that would be associated with a high evaluation score.
+
Would you be comfortable generalizing your conclusions to apply to professors generally (at any university)? Why or why not?
+
+
+
+
+
+
+
+
diff --git a/README.md b/README.md
index 7500768..af072df 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-OpenIntro Labs
+OpenIntro Labs - dplyr and ggplot2
==============
OpenIntro Labs promote the understanding and application of statistics through
@@ -7,16 +7,24 @@ particular chapters in all three versions of OpenIntro Statistics, a free and
open-source textbook. The textbook as well as the html version of the labs can
be found at [http://www.openintro.org/stat/labs.php](http://www.openintro.org/stat/labs.php).
+This repository is a fork of the original labs that is intended to incorporate
+the syntax from the dplyr package. The conversion will occur over summer 2015,
+with classroom testing in fall 2015, and release on
+[www.openintro.org](www.openintro.org) in January 2016.
+
We currently support our source files in the .Rmd format, which can be output into
html format (though output to pdf is also possible). The source files are processed
-using the [knitr](http://yihui.name/knitr/) package in R. If you are using RStudio,
-be sure that you've adjusted your settings to compile source files using this
-package. If you are unfamiliar with working with these file types, you may consider
-creating your own copy of the Google Doc version of each lab, which are archived at
-[http://www.openintro.org/stat/labs.php](http://www.openintro.org/stat/labs.php).
-These versions are not currently supported, so they will differ slightly from the
-.Rmd labs found here.
+using the [knitr](http://yihui.name/knitr/) package in R.
It is our hope that these materials are useful for instructors and students of
statistics. If you end up developing some interesting variants of these labs or
-creating new ones, please let us know!
\ No newline at end of file
+creating new ones, please let us know!
+
+This branch contains a fork of the OpenIntro Labs that have been infused with **mosaic**. Project MOSAIC is a community of educators working to develop a new way to introduce mathematics, statistics, computation and modeling to students in colleges and universities. [The **mosaic** package](https://github.com/rpruim/mosaic) brings the full power of R to statistics students.
+
+## Feedback / collaboration
+
+Your feedback is most welcomed! If you have suggestions for minor updates (fixing
+typos, etc.) please do not hesitate to issue a pull request. If you have ideas for
+major revamp of a lab (replacing outdated code with modern version, overhaul in
+pedagogy, etc.) please create an issue so to start the conversation.
\ No newline at end of file
diff --git a/confidence_intervals/confidence_intervals.Rmd b/confidence_intervals/confidence_intervals.Rmd
deleted file mode 100644
index 01b0822..0000000
--- a/confidence_intervals/confidence_intervals.Rmd
+++ /dev/null
@@ -1,185 +0,0 @@
----
-title: 'Foundations for statistical inference - Confidence intervals'
-output:
- html_document:
- css: ../lab.css
- highlight: pygments
- theme: cerulean
- pdf_document: default
----
-
-## Sampling from Ames, Iowa
-
-If you have access to data on an entire population, say the size of every
-house in Ames, Iowa, it's straight forward to answer questions like, "How big
-is the typical house in Ames?" and "How much variation is there in sizes of
-houses?". If you have access to only a sample of the population, as is often
-the case, the task becomes more complicated. What is your best guess for the
-typical size if you only know the sizes of several dozen houses? This sort of
-situation requires that you use your sample to make inference on what your
-population looks like.
-
-## The data
-
-In the previous lab, ``Sampling Distributions'', we looked at the population data
-of houses from Ames, Iowa. Let's start by loading that data set.
-
-```{r load-data, eval=FALSE}
-download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
-load("ames.RData")
-```
-
-In this lab we'll start with a simple random sample of size 60 from the
-population. Specifically, this is a simple random sample of size 60. Note that
-the data set has information on many housing variables, but for the first
-portion of the lab we'll focus on the size of the house, represented by the
-variable `Gr.Liv.Area`.
-
-```{r sample, eval=FALSE}
-population <- ames$Gr.Liv.Area
-samp <- sample(population, 60)
-```
-
-1. Describe the distribution of your sample. What would you say is the
-"typical" size within your sample? Also state precisely what you interpreted
-"typical" to mean.
-
-2. Would you expect another student's distribution to be identical to yours?
-Would you expect it to be similar? Why or why not?
-
-## Confidence intervals
-
-One of the most common ways to describe the typical or central value of a
-distribution is to use the mean. In this case we can calculate the mean of the
-sample using,
-
-```{r sample-mean, eval=FALSE}
-sample_mean <- mean(samp)
-```
-
-Return for a moment to the question that first motivated this lab: based on
-this sample, what can we infer about the population? Based only on this single
-sample, the best estimate of the average living area of houses sold in Ames
-would be the sample mean, usually denoted as $\bar{x}$ (here we're calling it
-`sample_mean`). That serves as a good *point estimate* but it would be useful
-to also communicate how uncertain we are of that estimate. This can be
-captured by using a *confidence interval*.
-
-We can calculate a 95% confidence interval for a sample mean by adding and
-subtracting 1.96 standard errors to the point estimate (See Section 4.2.3 if
-you are unfamiliar with this formula).
-
-```{r ci, eval=FALSE}
-se <- sd(samp) / sqrt(60)
-lower <- sample_mean - 1.96 * se
-upper <- sample_mean + 1.96 * se
-c(lower, upper)
-```
-
-This is an important inference that we've just made: even though we don't know
-what the full population looks like, we're 95% confident that the true
-average size of houses in Ames lies between the values *lower* and *upper*.
-There are a few conditions that must be met for this interval to be valid.
-
-3. For the confidence interval to be valid, the sample mean must be normally
-distributed and have standard error $s / \sqrt{n}$. What conditions must be
-met for this to be true?
-
-## Confidence levels
-
-4. What does "95% confidence" mean? If you're not sure, see Section 4.2.2.
-
-In this case we have the luxury of knowing the true population mean since we
-have data on the entire population. This value can be calculated using the
-following command:
-
-```{r pop-mean, eval=FALSE}
-mean(population)
-```
-
-5. Does your confidence interval capture the true average size of houses in
-Ames? If you are working on this lab in a classroom, does your neighbor's
-interval capture this value?
-
-6. Each student in your class should have gotten a slightly different
-confidence interval. What proportion of those intervals would you expect to
-capture the true population mean? Why? If you are working in this lab in a
-classroom, collect data on the intervals created by other students in the
-class and calculate the proportion of intervals that capture the true
-population mean.
-
-Using R, we're going to recreate many samples to learn more about how sample
-means and confidence intervals vary from one sample to another. *Loops* come
-in handy here (If you are unfamiliar with loops, review the [Sampling Distribution Lab](http://htmlpreview.github.io/?https://github.com/andrewpbray/oiLabs/blob/master/sampling_distributions/sampling_distributions.html)).
-
-Here is the rough outline:
-
-- Obtain a random sample.
-- Calculate and store the sample's mean and standard deviation.
-- Repeat steps (1) and (2) 50 times.
-- Use these stored statistics to calculate many confidence intervals.
-
-
-But before we do all of this, we need to first create empty vectors where we
-can save the means and standard deviations that will be calculated from each
-sample. And while we're at it, let's also store the desired sample size as `n`.
-
-```{r set-up, eval=FALSE}
-samp_mean <- rep(NA, 50)
-samp_sd <- rep(NA, 50)
-n <- 60
-```
-
-Now we're ready for the loop where we calculate the means and standard deviations of 50 random samples.
-
-```{r loop, eval=FALSE, tidy = FALSE}
-for(i in 1:50){
- samp <- sample(population, n) # obtain a sample of size n = 60 from the population
- samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
- samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
-}
-```
-
-Lastly, we construct the confidence intervals.
-
-```{r ci50, eval=FALSE}
-lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n)
-upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
-```
-
-Lower bounds of these 50 confidence intervals are stored in `lower_vector`,
-and the upper bounds are in `upper_vector`. Let's view the first interval.
-
-```{r first-interval, eval=FALSE}
-c(lower_vector[1], upper_vector[1])
-```
-
-* * *
-
-## On your own
-
-- Using the following function (which was downloaded with the data set),
- plot all intervals. What proportion of your confidence intervals include
- the true population mean? Is this proportion exactly equal to the
- confidence level? If not, explain why.
-
- ```{r plot-ci, eval=FALSE}
- plot_ci(lower_vector, upper_vector, mean(population))
- ```
-
-- Pick a confidence level of your choosing, provided it is not 95%. What is
- the appropriate critical value?
-
-- Calculate 50 confidence intervals at the confidence level you chose in the
- previous question. You do not need to obtain new samples, simply calculate
- new intervals based on the sample means and standard deviations you have
- already collected. Using the `plot_ci` function, plot all intervals and
- calculate the proportion of intervals that include the true population
- mean. How does this percentage compare to the confidence level selected for
- the intervals?
-
-
-This is a product of OpenIntro that is released under a [Creative Commons
-Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0).
-This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.
-
Foundations for statistical inference - Confidence intervals
-
-
-
-
-
Sampling from Ames, Iowa
-
If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straight forward to answer questions like, “How big is the typical house in Ames?” and “How much variation is there in sizes of houses?”. If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This sort of situation requires that you use your sample to make inference on what your population looks like.
-
-
-
The data
-
In the previous lab, ``Sampling Distributions’’, we looked at the population data of houses from Ames, Iowa. Let’s start by loading that data set.
In this lab we’ll start with a simple random sample of size 60 from the population. Specifically, this is a simple random sample of size 60. Note that the data set has information on many housing variables, but for the first portion of the lab we’ll focus on the size of the house, represented by the variable Gr.Liv.Area.
-
population <-ames$Gr.Liv.Area
-samp <-sample(population, 60)
-
-
Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.
-
Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?
-
-
-
-
Confidence intervals
-
One of the most common ways to describe the typical or central value of a distribution is to use the mean. In this case we can calculate the mean of the sample using,
-
sample_mean <-mean(samp)
-
Return for a moment to the question that first motivated this lab: based on this sample, what can we infer about the population? Based only on this single sample, the best estimate of the average living area of houses sold in Ames would be the sample mean, usually denoted as \(\bar{x}\) (here we’re calling it sample_mean). That serves as a good point estimate but it would be useful to also communicate how uncertain we are of that estimate. This can be captured by using a confidence interval.
-
We can calculate a 95% confidence interval for a sample mean by adding and subtracting 1.96 standard errors to the point estimate (See Section 4.2.3 if you are unfamiliar with this formula).
This is an important inference that we’ve just made: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the values lower and upper. There are a few conditions that must be met for this interval to be valid.
-
-
For the confidence interval to be valid, the sample mean must be normally distributed and have standard error \(s / \sqrt{n}\). What conditions must be met for this to be true?
-
-
-
-
Confidence levels
-
-
What does “95% confidence” mean? If you’re not sure, see Section 4.2.2.
-
-
In this case we have the luxury of knowing the true population mean since we have data on the entire population. This value can be calculated using the following command:
-
mean(population)
-
-
Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?
-
Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why? If you are working in this lab in a classroom, collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.
-
-
Using R, we’re going to recreate many samples to learn more about how sample means and confidence intervals vary from one sample to another. Loops come in handy here (If you are unfamiliar with loops, review the Sampling Distribution Lab).
-
Here is the rough outline:
-
-
Obtain a random sample.
-
Calculate and store the sample’s mean and standard deviation.
-
Repeat steps (1) and (2) 50 times.
-
Use these stored statistics to calculate many confidence intervals.
-
-
But before we do all of this, we need to first create empty vectors where we can save the means and standard deviations that will be calculated from each sample. And while we’re at it, let’s also store the desired sample size as n.
Now we’re ready for the loop where we calculate the means and standard deviations of 50 random samples.
-
for(i in 1:50){
- samp <-sample(population, n) # obtain a sample of size n = 60 from the population
- samp_mean[i] <-mean(samp) # save sample mean in ith element of samp_mean
- samp_sd[i] <-sd(samp) # save sample sd in ith element of samp_sd
-}
Lower bounds of these 50 confidence intervals are stored in lower_vector, and the upper bounds are in upper_vector. Let’s view the first interval.
-
c(lower_vector[1], upper_vector[1])
-
-
-
-
On your own
-
-
Using the following function (which was downloaded with the data set), plot all intervals. What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.
Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?
-
Calculate 50 confidence intervals at the confidence level you chose in the previous question. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?
Foundations for statistical inference - Confidence intervals
+
+
+
+
+
If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straightforward to answer questions like, “How big is the typical house in Ames?” and “How much variation is there in sizes of houses?”. If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This sort of situation requires that you use your sample to make inference on what your population looks like.
+
+
Setting a seed: We will take some random samples and build sampling distributions in this lab, which means you should set a seed on top of your lab. If this concept is new to you, review the lab concerning probability.
+
+
+
Getting Started
+
+
Load packages
+
In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data can be found in the companion package for the OpenIntro labs, oilabs.
+
Let’s load the packages.
+
library(dplyr)
+library(ggplot2)
+library(oilabs)
+
+
+
Creating a reproducible lab report
+
To create your new lab report, start by opening a new R Markdown document… From Template… then select Lab Report from the oilabs package.
+
+
+
The data
+
We consider real estate data from the city of Ames, Iowa. This is the same dataset used in the previous lab. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the data.
+
data(ames)
+
In this lab we’ll start with a simple random sample of size 60 from the population.
+
n <-60
+samp <-sample_n(ames, n)
+
Note that the data set has information on many housing variables, but for the first portion of the lab we’ll focus on the size of the house, represented by the variable area.
+
+
Describe the distribution of house area in your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.
+
Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?
+
+
+
+
+
Confidence intervals
+
Return for a moment to the question that first motivated this lab: based on this sample, what can we infer about the population? Based only on this single sample, the best estimate of the average living area of houses sold in Ames would be the sample mean, usually denoted as \(\bar{x}\) (here we’re calling it x_bar). That serves as a good point estimate but it would be useful to also communicate how uncertain we are of that estimate. This uncertainty can be quantified using a confidence interval.
+
A confidence interval for a population mean is of the following form \[ \bar{x} + z^\star \frac{s}{\sqrt{n}} \]
+
You should by now be comfortable with calculating the mean and standard deviation of a sample in R. And we know that the sample size is 60. So the only remaining building block is finding the appropriate critical value for a given confidence level. We can use the qnorm function for this task, which will give the critical value associated with a given percentile under the normal distribution. Remember that confidence levels and percentiles are not equivalent. For example, a 95% confidence level refers to the middle 95% of the distribution, and the critical value associated with this area will correspond to the 97.5th percentile.
+
We can find the critical value for a 95% confidence interal using
+
z_star_95 <-qnorm(0.975)
+z_star_95
+
which is roughly equal to the value critical value 1.96 that you’re likely familiar with by now.
To recap: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the values lower and upper. There are a few conditions that must be met for this interval to be valid.
+
+
For the confidence interval to be valid, the sample mean must be normally distributed and have standard error \(s / \sqrt{n}\). What conditions must be met for this to be true?
+
+
+
+
Confidence levels
+
+
What does “95% confidence” mean?
+
+
In this case we have the rare luxury of knowing the true population mean since we have data on the entire population. Let’s calculate this value so that we can determine if our confidence intervals actually capture it. We’ll store it in a data frame called params (short for population parameters), and name it mu.
+
params <-ames %>%
+summarise(mu =mean(area))
+
+
Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?
+
Each student should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?
+
+
Using R, we’re going to collect many samples to learn more about how sample means and confidence intervals vary from one sample to another.
+
Here is the rough outline:
+
+
Obtain a random sample.
+
Calculate the sample’s mean and standard deviation, and use these to calculate and store the lower and upper bounds of the confidence intervals.
+
Repeat these steps 50 times.
+
+
We can accomplish this using the rep_sample_n function. The following lines of code takes 50 random samples of size n from population (and remember we defined \(n = 60\) earlier), and computes the upper and lower bounds of the confidence intervals based on these samples.
+
ci <-ames %>%
+rep_sample_n(size = n, reps =50, replace =TRUE) %>%
+summarise(x_bar =mean(area),
+ se =sd(area) /sqrt(n),
+ me = z_star_95 *se,
+ lower = x_bar -me,
+ upper = x_bar +me)
+
Let’s view the first five intervals:
+
ci %>%
+slice(1:5)
+
Next we’ll create a plot similar to Figure 4.8 on page 175 of OpenIntro Statistics, 3rd Edition. The first step will be to create a new variable in the ci data frame that indicates whether the interval does or does not capture the true population mean. Note that capturing this value would mean the lower bound of the confidence interval is below the value and upper bound of the confidence interval is above the value. Remember that we create new variables using the mutate function.
+
ci <-ci %>%
+mutate(capture_mu =ifelse(lower <params$mu &upper >params$mu, "yes", "no"))
+
The ifelse function is new. It takes three arguments: first is a logical statement, second is the value we want if the logical statement yields a true result, and the third is the value we want if the logical statement yields a false result.
+
We now have all the information we need to create the plot. Note that the geom_errorbar() function only understands y values, and thus we have used the coord_flip() function to flip the coordinates of the entire plot back to the more familiar vertical orientation.
+
qplot(data = ci, x = replicate, y = x_bar, color = capture_mu) +
+geom_errorbar(aes(ymin = lower, ymax = upper)) +
+geom_hline(data = params, aes(yintercept = mu), color ="darkgray") +# draw vertical line
+coord_flip()
+
+
What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why. Make sure to include your plot in your answer.
+
+
+
+
+
More Practice
+
+
Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?
+
Calculate 50 confidence intervals at the confidence level you chose in the previous question, and plot all intervals on one plot, and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals? Make sure to include your plot in your answer.
+
+
+
+
+
+
+
+
diff --git a/docs/inf_for_categorical_data.Rmd b/docs/inf_for_categorical_data.Rmd
new file mode 100644
index 0000000..e672b6f
--- /dev/null
+++ b/docs/inf_for_categorical_data.Rmd
@@ -0,0 +1,288 @@
+---
+title: "Inference for categorical data"
+runtime: shiny
+output:
+ html_document:
+ css: www/lab.css
+ highlight: pygments
+ theme: cerulean
+ toc: true
+ toc_float: true
+---
+
+```{r global_options, include=FALSE}
+knitr::opts_chunk$set(eval = FALSE)
+```
+
+In August of 2012, news outlets ranging from the [Washington Post](http://www.washingtonpost.com/national/on-faith/poll-shows-atheism-on-the-rise-in-the-us/2012/08/13/90020fd6-e57d-11e1-9739-eef99c5fb285_story.html) to the [Huffington Post](http://www.huffingtonpost.com/2012/08/14/atheism-rise-religiosity-decline-in-america_n_1777031.html) ran a story about the rise of atheism in America. The source for the story was a poll that asked people, "Irrespective of whether you attend a place of worship or not, would you say you are a religious person, not a religious person, or a convinced atheist?" This type of question, which asks people to classify themselves in one way or another, is common in polling and generates categorical data. In this lab we take a look at the atheism survey and explore what's at play when making inference about population proportions using categorical data.
+
+## Getting Started
+
+### Load packages
+
+In this lab we will explore the data using the `dplyr` package and visualize it
+using the `ggplot2` package for data visualization. The data can be found in the
+companion package for OpenIntro labs, `oilabs`.
+
+Let's load the packages.
+
+```{r load-packages, message=FALSE, eval=TRUE}
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+```
+
+### Creating a reproducible lab report
+
+To create your new lab report, start by opening a new R Markdown document... From Template... then select Lab Report from the `oilabs` package.
+
+### The survey
+
+The press release for the poll, conducted by WIN-Gallup International, can be accessed [here](http://www.wingia.com/web/files/richeditor/filemanager/Global_INDEX_of_Religiosity_and_Atheism_PR__6.pdf).
+
+Take a moment to review the report then address the following questions.
+
+1. In the first paragraph, several key findings are reported. Do these
+ percentages appear to be *sample statistics* (derived from the data
+ sample) or *population parameters*? Explain your reasoning.
+
+1. The title of the report is "Global Index of Religiosity and Atheism". To
+ generalize the report's findings to the global human population, what must
+ we assume about the sampling method? Does that seem like a reasonable
+ assumption?
+
+### The data
+
+Turn your attention to Table 6 (pages 15 and 16), which reports the
+sample size and response percentages for all 57 countries. While this is
+a useful format to summarize the data, we will base our analysis on the
+original data set of individual responses to the survey. Load this data
+set into R with the following command.
+
+```{r head-data}
+data(atheism)
+```
+
+1. What does each row of Table 6 correspond to? What does each row of
+ `atheism` correspond to?
+
+To investigate the link between these two ways of organizing this data, take a
+look at the estimated proportion of atheists in the United States. Towards
+the bottom of Table 6, we see that this is 5%. We should be able to come to
+the same number using the `atheism` data.
+
+1. Using the command below, create a new dataframe called `us12` that contains
+ only the rows in `atheism` associated with respondents to the 2012 survey
+ from the United States. Next, calculate the proportion of atheist
+ responses. Does it agree with the percentage in Table 6? If not, why?
+
+```{r us-atheism}
+us12 <- atheism %>%
+ filter(nationality == "United States", year == "2012")
+```
+
+## Inference on proportions
+
+As was hinted earlier, Table 6 provides *statistics*, that is,
+calculations made from the sample of 51,927 people. What we'd like, though, is
+insight into the population *parameters*. You answer the question, "What
+proportion of people in your sample reported being atheists?" with a
+statistic; while the question "What proportion of people on earth would report
+being atheists" is answered with an estimate of the parameter.
+
+The inferential tools for estimating population proportion are analogous to
+those used for means in the last chapter: the confidence interval and the
+hypothesis test.
+
+1. Write out the conditions for inference to construct a 95% confidence
+ interval for the proportion of atheists in the United States in 2012.
+ Are you confident all conditions are met?
+
+If the conditions for inference are reasonable, we can calculate
+the standard error and construct the interval in R.
+
+```{r us-atheism-ci, tidy = FALSE}
+us12 %>%
+ summarize(N = n(), atheist = sum(response == "atheist")) %>%
+ mutate(p_hat = atheist / N,
+ se = sqrt(p_hat * (1 - p_hat) / N),
+ me = qnorm(0.975) * se,
+ lower = p_hat - me,
+ upper = p_hat + me)
+```
+
+Note that since the goal is to construct an interval estimate for a
+proportion, it's necessary to specify what constitutes a "success", which here
+is a response of `"atheist"`.
+
+Although formal confidence intervals and hypothesis tests don't show up in the
+report, suggestions of inference appear at the bottom of page 7: "In general,
+the error margin for surveys of this kind is $\pm$ 3--5% at 95% confidence".
+
+1. Based on the R output, what is the margin of error for the estimate of the
+ proportion of the proportion of atheists in US in 2012?
+
+1. Calculate confidence intervals for the
+ proportion of atheists in 2012 in two other countries of your choice, and
+ report the associated margins of error. Be sure to note whether the
+ conditions for inference are met, and interpet the interval in context of the data.
+ It may be helpful to create new data sets for each of the two countries first, and
+ then use these data sets to construct the confidence
+ intervals.
+
+## How does the proportion affect the margin of error?
+
+Imagine you've set out to survey 1000 people on two questions: are you female?
+and are you left-handed? Since both of these sample proportions were
+calculated from the same sample size, they should have the same margin of
+error, right? Wrong! While the margin of error does change with sample size,
+it is also affected by the proportion.
+
+Think back to the formula for the standard error: $SE = \sqrt{p(1-p)/n}$. This
+is then used in the formula for the margin of error for a 95% confidence
+interval:
+$$
+ME = 1.96\times SE = 1.96\times\sqrt{p(1-p)/n} \,.
+$$
+Since the
+population proportion $p$ is in this $ME$ formula, it should make sense that
+the margin of error is in some way dependent on the population proportion. We
+can visualize this relationship by creating a plot of $ME$ vs. $p$.
+
+Since sample size is irrelevant to this discussion, let's just set it to
+some value ($n = 1000$) and use this value in the following calculations:
+
+```{r n-for-me-plot}
+n <- 1000
+```
+
+The first step is to make a variable `p` that is a sequence from 0 to 1 with
+each number incremented by 0.01. We can then create a variable of the margin of
+error (`me`) associated with each of these values of `p` using the familiar
+approximate formula ($ME = 2 \times SE$).
+
+```{r p-me}
+p <- seq(from = 0, to = 1, by = 0.01)
+me <- 2 * sqrt(p * (1 - p)/n)
+```
+
+Lastly, we plot the two variables against each other to reveal their relationship.
+To do so, we need to first put these variables in a data frame that we can
+call in the `qplot` function.
+
+```{r me-plot}
+dd <- data.frame(p = p, me = me)
+qplot(x = p, y = me, data = dd,
+ ylab = "Margin of Error",
+ xlab = "Population Proportion") +
+ geom_line()
+```
+
+1. Describe the relationship between `p` and `me`. Include the margin of
+ error vs. population proportion plot you constructed in your answer. For
+ a given sample size, for which value of `p` is margin of error maximized?
+
+## Success-failure condition
+
+We have emphasized that you must always check conditions before making
+inference. For inference on proportions, the sample proportion can be assumed
+to be nearly normal if it is based upon a random sample of independent
+observations and if both $np \geq 10$ and $n(1 - p) \geq 10$. This rule of
+thumb is easy enough to follow, but it makes one wonder: what's so special
+about the number 10?
+
+The short answer is: nothing. You could argue that we would be fine with 9 or
+that we really should be using 11. What is the "best" value for such a rule of
+thumb is, at least to some degree, arbitrary. However, when $np$ and $n(1-p)$
+reaches 10 the sampling distribution is sufficiently normal to use confidence
+intervals and hypothesis tests that are based on that approximation.
+
+We can investigate the interplay between $n$ and $p$ and the shape of the
+sampling distribution by using simulations. Play around with the following
+app to investigate how the shape, center, and spread of the distribution of
+$\hat{p}$ changes as $n$ and $p$ changes.
+
+```{r sf-app, echo=FALSE, eval=TRUE}
+inputPanel(
+ numericInput("n", label = "Sample size:", value = 300),
+
+ sliderInput("p", label = "Population proportion:",
+ min = 0, max = 1, value = 0.1, step = 0.01),
+
+ numericInput("x_min", label = "Min for x-axis:", value = 0, min = 0, max = 1),
+ numericInput("x_max", label = "Max for x-axis:", value = 1, min = 0, max = 1)
+)
+
+renderPlot({
+ pp <- data.frame(p_hat = rep(0, 5000))
+ for(i in 1:5000){
+ samp <- sample(c(TRUE, FALSE), input$n, replace = TRUE,
+ prob = c(input$p, 1 - input$p))
+ pp$p_hat[i] <- sum(samp == TRUE) / input$n
+ }
+ bw <- diff(range(pp$p_hat)) / 30
+ ggplot(data = pp, aes(x = p_hat)) +
+ geom_histogram(binwidth = bw) +
+ xlim(input$x_min, input$x_max) +
+ ggtitle(paste0("Distribution of p_hats, drawn from p = ", input$p, ", n = ", input$n))
+})
+```
+
+1. Describe the sampling distribution of sample proportions at $n = 300$ and
+ $p = 0.1$. Be sure to note the center, spread, and shape.
+
+1. Keep $n$ constant and change $p$. How does the shape, center, and spread
+ of the sampling distribution vary as $p$ changes. You might want to adjust
+ min and max for the $x$-axis for a better view of the distribution.
+
+1. Now also change $n$. How does $n$ appear to affect the distribution of $\hat{p}$?
+
+1. If you refer to Table 6, you'll find that Australia has a sample
+ proportion of 0.1 in a sample size of 1040, and that Ecuador has a sample
+ proportion of 0.02 on 400 subjects. Let's suppose for this exercise that
+ these point estimates are actually the truth. Construct their sampling
+ distributions by using these values as inputs in the app. Do you think it
+ is sensible to proceed with inference and report margin of errors, as the
+ report does?
+
+* * *
+
+## More Practice
+
+The question of atheism was asked by WIN-Gallup International in a similar
+survey that was conducted in 2005. (We assume here that sample sizes have
+remained the same.) Table 4 on page 13 of the report summarizes survey results
+from 2005 and 2012 for 39 countries.
+
+
+1. Is there convincing evidence that Spain has seen a change in its atheism index
+ between 2005 and 2012? As always, write out the hypotheses for any tests you
+ conduct and outline the status of the conditions for inference. If you find a
+ significant difference, also quantify this difference with a confidence interval. \
+ *Hint:* Use the difference of two proportions methodology (i.e. find the
+ observed difference, compute the standard error, compute the z-score, etc.)
+
+1. Is there convincing evidence that the US has seen a change in its atheism index
+ between 2005 and 2012? As always, write out the hypotheses for any tests you
+ conduct and outline the status of the conditions for inference. If you find a
+ significant difference, also quantify this difference with a confidence interval.
+
+1. If in fact there has been no change in the atheism index in the countries
+ listed in Table 4, in how many of those countries would you expect to
+ detect a change (at a significance level of 0.05) simply by chance?\
+ *Hint:* Review the definition of the Type 1 error.
+
+1. Suppose you're hired by the local government to estimate the proportion of
+ residents that attend a religious service on a weekly basis. According to
+ the guidelines, the estimate must have a margin of error no greater than
+ 1% with 95% confidence. You have no idea what to expect for $p$. How many
+ people would you have to sample to ensure that you are within the
+ guidelines?\
+ *Hint:* Refer to your plot of the relationship between $p$ and margin of
+ error. This question does not require using the dataset.
+
+
+This is a product of OpenIntro that is released under a [Creative Commons
+Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0).
+This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.
+
diff --git a/sampling_distributions/sampling_distributions.html b/docs/inf_for_categorical_data.html
similarity index 97%
rename from sampling_distributions/sampling_distributions.html
rename to docs/inf_for_categorical_data.html
index e1da1a6..b451e90 100644
--- a/sampling_distributions/sampling_distributions.html
+++ b/docs/inf_for_categorical_data.html
@@ -10,7 +10,7 @@
-Foundations for statistical inference - Sampling distributions
+Inference for categorical data
@@ -47,7 +47,7 @@
-
+
@@ -72,129 +72,102 @@
-
Foundations for statistical inference - Sampling distributions
+
Inference for categorical data
-
In this lab, we investigate the ways in which the statistics from a random sample of data can serve as point estimates for population parameters. We’re interested in formulating a sampling distribution of our estimate in order to learn about the properties of the estimate, such as its distribution.
-
-
The data
-
We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the data.
We see that there are quite a few variables in the data set, enough to do a very in-depth analysis. For this lab, we’ll restrict our attention to just two of the variables: the above ground living area of the house in square feet (Gr.Liv.Area) and the sale price (SalePrice). To save some effort throughout the lab, create two variables with short names that represent these two variables.
-
area <-ames$Gr.Liv.Area
-price <-ames$SalePrice
-
Let’s look at the distribution of area in our population of home sales by calculating a few summary statistics and making a histogram.
-
summary(area)
-hist(area)
+
In August of 2012, news outlets ranging from the Washington Post to the Huffington Post ran a story about the rise of atheism in America. The source for the story was a poll that asked people “Irrespective of whether you attend a place of worship or not, would you say you are a religious person, not a religious person or a convinced atheist?” This type of question, which asks people to classify themselves in one way or another, is common in polling and generates categorical data. In this lab we take a look at the atheism survey and explore what’s at play when making inference about population proportions using categorical data.
+
+
The survey
+
Take a moment to review the press release for the poll conducted by WIN-Gallup International then address the following questions.
-
Describe this population distribution.
+
In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?
+
The title of the report is “Global Index of Religiosity and Atheism.” To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?
-
-
The unknown sampling distribution
-
In this lab we have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population.
-
If we were interested in estimating the mean living area in Ames based on a sample, we can use the following command to survey the population.
-
samp1 <-sample(area, 50)
-
This command collects a simple random sample of size 50 from the vector area, which is assigned to samp1. This is like going into the City Assessor’s database and pulling up the files on 50 random home sales. Working with these 50 files would be considerably simpler than working with all 2930 home sales.
-
-
Describe the distribution of this sample. How does it compare to the distribution of the population?
-
-
If we’re interested in estimating the average living area in homes in Ames using the sample, our best single guess is the sample mean.
-
mean(samp1)
-
Depending on which 50 homes you selected, your estimate could be a bit above or a bit below the true population mean of 1499.69 square feet. In general, though, the sample mean turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 3% of the population.
+
+
The data
+
Turn your attention to Table 6 (pages 15 and 16), which reports the sample size and response percentages for all 57 countries. While this is a useful format to summarize the data, we will base our analysis on the original data set of individual responses to the survey. Load the necessary packages and the data set into R with the following command.
+
library(oilabs)
+library(mosaic)
+data(atheism)
-
Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?
+
What does each row of Table 6 correspond to? What does each row of atheism correspond to?
-
Not surprisingly, every time we take another random sample, we get a different sample mean. It’s useful to get a sense of just how much variability we should expect when estimating the population mean this way. The distribution of sample means, called the sampling distribution, can help us understand this variability. In this lab, because we have access to the population, we can build up the sampling distribution for the sample mean by repeating the above steps many times. Here we will generate 5000 samples and compute the sample mean of each.
If you would like to adjust the bin width of your histogram to show a little more detail, you can do so by changing the breaks argument.
-
hist(sample_means50, breaks =25)
-
Here we use R to take 5000 samples of size 50 from the population, calculate the mean of each sample, and store each result in a vector called sample_means50. On the next page, we’ll review how this set of code works.
+
To investigate the link between these two ways of organizing this data, take a look at the estimated proportion of atheists in the United States. Towards the bottom of Table 6, we see that this is 5%. We should be able to come to the same number using the atheism data.
-
How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?
+
Using the command below, create a new data frame called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?
Let’s take a break from the statistics for a moment to let that last block of code sink in. You have just run your first for loop, a cornerstone of computer programming. The idea behind the for loop is iteration: it allows you to execute code as many times as you want without having to type out every iteration. In the case above, we wanted to iterate the two lines of code inside the curly braces that take a random sample of size 50 from area then save the mean of that sample into the sample_means50 vector. Without the for loop, this would be painful:
With the for loop, these thousands of lines of code are compressed into a handful of lines. We’ve added one extra line to the code below, which prints the variable i during each iteration of the for loop. Run this code.
Let’s consider this code line by line to figure out what it does. In the first line we initialized a vector. In this case, we created a vector of 5000 zeros called sample_means50. This vector will will store values generated within the for loop.
-
The second line calls the for loop itself. The syntax can be loosely read as, “for every element i from 1 to 5000, run the following lines of code”. You can think of i as the counter that keeps track of which loop you’re on. Therefore, more precisely, the loop will run once when i = 1, then once when i = 2, and so on up to i = 5000.
-
The body of the for loop is the part inside the curly braces, and this set of code is run for each value of i. Here, on every loop, we take a random sample of size 50 from area, take its mean, and store it as the \(i\)th element of sample_means50.
-
In order to display that this is really happening, we asked R to print i at each iteration. This line of code is optional and is only used for displaying what’s going on while the for loop is running.
-
The for loop allows us to not just run the code 5000 times, but to neatly package the results, element by element, into the empty vector that we initialized at the outset.
+
+
Inference on proportions
+
As was hinted at in Exercise 1, Table 6 provides statistics, that is, calculations made from the sample of 51,927 people. What we’d like, though, is insight into the population parameters. You answer the question, “What proportion of people in your sample reported being atheists?” with a statistic; while the question “What proportion of people on earth would report being atheists” is answered with an estimate of the parameter.
+
The inferential tools for estimating population proportion are analogous to those used for means in the last chapter: the confidence interval and the hypothesis test.
-
To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?
+
Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?
-
-
-
Sample size and the sampling distribution
-
Mechanics aside, let’s return to the reason we used a for loop: to compute a sampling distribution, specifically, this one.
-
hist(sample_means50)
-
The sampling distribution that we computed tells us much about estimating the average living area in homes in Ames. Because the sample mean is an unbiased estimator, the sampling distribution is centered at the true average living area of the the population, and the spread of the distribution indicates how much variability is induced by sampling only 50 home sales.
-
To get a sense of the effect that sample size has on our distribution, let’s build up two more sampling distributions: one based on a sample size of 10 and another based on a sample size of 100.
Here we’re able to use a single for loop to build two distributions by adding additional lines inside the curly braces. Don’t worry about the fact that samp is used for the name of two different objects. In the second command of the for loop, the mean of samp is saved to the relevant place in the vector sample_means10. With the mean saved, we’re now free to overwrite the object samp with a new sample, this time of size 100. In general, anytime you create an object using a name that is already in use, the old object will get replaced with the new one.
-
To see the effect that different sample sizes have on the sampling distribution, plot the three distributions on top of one another.
The first command specifies that you’d like to divide the plotting area into 3 rows and 1 column of plots (to return to the default setting of plotting one at a time, use par(mfrow = c(1, 1))). The breaks argument specifies the number of bins used in constructing the histogram. The xlim argument specifies the range of the x-axis of the histogram, and by setting it equal to xlimits for each histogram, we ensure that all three histograms will be plotted with the same limits on the x-axis.
+
If the conditions for inference are reasonable (check this!), we can calculate the standard error and construct the confidence interval.
Note that since the goal is to construct an interval estimate for a proportion, it’s necessary to specify what constitutes a “success”, which here is a response of "atheist". Secondly, the qnorm function helps us find the width (in terms of the number of standard deviations from the mean) that our confidence interval needs to be in order to achieve a 95% confidence level. Note that since the normal distribution is symmetric, by cutting off the smallest 2.5% and the largest 2.5%, we’re left with the middle 95%. By changing the argument to qnorm we can find intervals that correspond to different confidence levels.
+
Although formal confidence intervals and hypothesis tests don’t show up in the report, suggestions of inference appear at the bottom of page 7: “In general, the error margin for surveys of this kind is \(\pm\) 3-5% at 95% confidence.”
-
When the sample size is larger, what happens to the center? What about the spread?
+
Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?
+
Calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets to construct the confidence intervals.
+
+
+
+
How does the proportion affect the margin of error?
+
Imagine you’ve set out to survey 1000 people on two questions: are you female? and are you left-handed? Since both of these sample proportions were calculated from the same sample size, they should have the same margin of error, right? Wrong! While the margin of error does change with sample size, it is also affected by the proportion.
+
Think back to the formula for the standard error: \(SE = \sqrt{p(1-p)/n}\). This is then used in the formula for the margin of error for a 95% confidence interval: \(ME = 1.96\times SE = 1.96\times\sqrt{p(1-p)/n}\). Since the population proportion \(p\) is in this \(ME\) formula, it should make sense that the margin of error is in some way dependent on the population proportion. We can visualize this relationship by creating a plot of \(ME\) vs. \(p\) for all values of \(p\) between 0 and 1.
+
The first step is to make a vector p that is a sequence from 0 to 1 with each number separated by 0.01. We can then create a vector of the margin of error (ME) associated with each of these values of p using the familiar approximate formula (\(ME = 1.96 \times SE\)). Lastly, we plot the two vectors against each other to reveal their relationship.
+
n <-1000
+p <-seq(from =0, to =1, by =0.01)
+ME <-2 *sqrt(p *(1 -p)/n)
+xyplot(ME ~p, ylab ="Margin of Error", xlab ="Population Proportion")
+
+
Describe the relationship between p and ME.
+
+
+
+
Success-failure condition
+
The textbook emphasizes that you must always check conditions before making inference. For inference on proportions, the sample proportion can be assumed to be nearly normal if it is based upon a random sample of independent observations and if both \(np \geq 10\) and \(n(1 - p) \geq 10\). This rule of thumb is easy enough to follow, but it makes one wonder: what’s so special about the number 10?
+
The short answer is: nothing. You could argue that we would be fine with 9 or that we really should be using 11. What is the “best” value for such a rule of thumb is, at least to some degree, arbitrary. However, when \(np\) and \(n(1-p)\) reaches 10 the sampling distribution is sufficiently normal to use confidence intervals and hypothesis tests that are based on that approximation.
+
We can investigate the interplay between \(n\) and \(p\) and the shape of the sampling distribution by using simulations. To start off, we simulate the process of drawing 5000 samples of size 1040 from a population with a true atheist proportion of 0.1. For each of the 5000 samples we compute \(\hat{p}\) and then plot a histogram to visualize their distribution.
+
Here, we first resample\(n\) times from the list of responses, with the probability of drawing "atheist" equal to \(p = 0.1\). Then we tally the proportion of those responses that are "atheist". We do this 500 times.
+
p <-0.1
+n <-1040
+responses <-c("atheist", "non_atheist")
+
+p_hats <-
+do(5000) *
+responses %>%
+resample(size = n, prob =c(p, 1-p)) %>%
+tally(format ="proportion")
+
+histogram(~atheist, data=p_hats, main ="p = 0.1, n = 1040")
+
These commands build up the sampling distribution of \(\hat{p}\) using the familiar do loop. You can read the sampling procedure for the inner bit of code as, “take a sample of size \(n\) with replacement from the choices of atheist and non-atheist with probabilities \(p\) and \(1 - p\), respectively.” The tally command says, “calculate the proportion of atheists in this sample and record this value.” The loop allows us to repeat this process 5,000 times to build a good representation of the sampling distribution.
+
+
Describe the sampling distribution of sample proportions at \(n = 1040\) and \(p = 0.1\). Be sure to note the center, spread, and shape. Hint: Remember that R has functions such as mean to calculate summary statistics.
+
Repeat the above simulation three more times but with modified sample sizes and proportions: for \(n = 400\) and \(p = 0.1\), \(n = 1040\) and \(p = 0.02\), and \(n = 400\) and \(p = 0.02\). Plot all four histograms. Describe the three new sampling distributions. Based on these limited plots, how does \(n\) appear to affect the distribution of \(\hat{p}\)? How does \(p\) affect the sampling distribution?
+
If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?
On your own
-
So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.
+
The question of atheism was asked by WIN-Gallup International in a similar survey that was conducted in 2005. (We assume here that sample sizes have remained the same.) Table 4 on page 13 of the report summarizes survey results from 2005 and 2012 for 39 countries.
-
Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?
-
Since you have access to the population, simulate the sampling distribution for \(\bar{x}_{price}\) by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.
-
Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?
-
Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?
+
Answer the following two questions. As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference.
+
a. Is there convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012? Hint: Create a new data set for respondents from Spain. Form confidence intervals for the true proportion of athiests in both years, and determine whether they overlap.
+
b. Is there convincing evidence that the United States has seen a change in its atheism index between 2005 and 2012?
+
If in fact there has been no change in the atheism index in the countries listed in Table 4, in how many of those countries would you expect to detect a change (at a significance level of 0.05) simply by chance? Hint: Look in the textbook index under Type 1 error.
+
Suppose you’re hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for \(p\). How many people would you have to sample to ensure that you are within the guidelines? Hint: Refer to your plot of the relationship between \(p\) and margin of error. Do not use the data set to answer this question.
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.
diff --git a/docs/inf_for_numerical_data.html b/docs/inf_for_numerical_data.html
new file mode 100644
index 0000000..84c77bf
--- /dev/null
+++ b/docs/inf_for_numerical_data.html
@@ -0,0 +1,361 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+Inference for numerical data
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Inference for numerical data
+
+
+
+
+
+
Getting Started
+
+
Load packages
+
In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data can be found in the companion package for OpenIntro labs, oilabs.
+
Let’s load the packages.
+
library(dplyr)
+library(ggplot2)
+library(oilabs)
+
+
+
Creating a reproducible lab report
+
To create your new lab report, start by opening a new R Markdown document… From Template… then select Lab Report from the oilabs package.
+
+
+
The data
+
In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.
+
Load the nc data set into our workspace.
+
data(nc)
+
We have observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:
+
?nc
+
+
What are the cases in this data set? How many cases are there in our sample?
+
+
Remember that you can answer this question by viewing the data in the data viewer or by using the following command:
+
glimpse(nc)
+
+
+
+
Exploratory data analysis
+
We will first start with analyzing the weight gained by mothers throughout the pregnancy: gained.
+
Using visualization and summary statistics, describe the distribution of weight gained by mothers during pregnancy. The favstats function from mosaic can be useful.
+
library(mosaic)
+favstats(~gained, data = nc)
+
+
How many mothers are we missing weight gain data from?
+
+
Next, consider the possible relationship between a mother’s smoking habit and the weight of her baby. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.
+
+
Make a side-by-side boxplot of habit and weight. What does the plot highlight about the relationship between these two variables?
+
+
The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following to first group the data by the habit variable, and then calculate the mean weight in these groups using the mean function.
There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test .
+
+
+
Inference
+
+
Are all conditions necessary for inference satisfied? Comment on each. You can compute the group sizes with the summarize command above by defining a new variable with the definition n().
+
Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.
+
+
Next, we introduce a new function, inference, that we will use for conducting hypothesis tests and constructing confidence intervals.
+
inference(y = weight, x = habit, data = nc, statistic ="mean", type ="ht", null =0,
+ alternative ="twosided", method ="theoretical")
+
Let’s pause for a moment to go through the arguments of this custom function. The first argument is y, which is the response variable that we are interested in: weight. The second argument is the explanatory variable, x, which is the variable that splits the data into two groups, smokers and non-smokers: habit. The third argument, data, is the data frame these variables are stored in. Next is statistic, which is the sample statistic we’re using, or similarly, the population parameter we’re estimating. In future labs we’ll also work with “median” and “proportion”. Next we decide on the type of inference we want: a hypothesis test ("ht") or a confidence interval ("ci"). When performing a hypothesis test, we also need to supply the null value, which in this case is 0, since the null hypothesis sets the two population means equal to each other. The alternative hypothesis can be "less", "greater", or "twosided". Lastly, the method of inference can be "theoretical" or "simulation" based.
+
For more information on the inference function see the help file with ?inference.
+
+
Change the type argument to "ci" to construct and record a confidence interval for the difference between the weights of babies born to nonsmoking and smoking mothers, and interpret this interval in context of the data. Note that by default you’ll get a 95% confidence interval. If you want to change the confidence level, add a new argument (conf_level) which takes on a value between 0 and 1. Also note that when doing a confidence interval arguments like null and alternative are not useful, so make sure to remove them.
+
+
By default the function reports an interval for (\(\mu_{nonsmoker} - \mu_{smoker}\)) . We can easily change this order by using the order argument:
+
inference(y = weight, x = habit, data = nc, statistic ="mean", type ="ci",
+ method ="theoretical", order =c("smoker","nonsmoker"))
+
+
+
+
More Practice
+
+
Calculate a 95% confidence interval for the average length of pregnancies (weeks) and interpret it in context. Note that since you’re doing inference on a single population parameter, there is no explanatory variable, so you can omit the x variable from the function.
+
Calculate a new confidence interval for the same parameter at the 90% confidence level. You can change the confidence level by adding a new argument to the function: conf_level = 0.90. Comment on the width of this interval versus the one obtained in the previous exercise.
+
Conduct a hypothesis test evaluating whether the average weight gained by younger mothers is different than the average weight gained by mature mothers.
+
Now, a non-inference task: Determine the age cutoff for younger and mature mothers. Use a method of your choice, and explain how your method works.
+
Pick a pair of variables: one numerical (response) and one categorical (explanatory). Come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Answer your question using the inference function, report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions,state your \(\alpha\) level, and conclude in context. (Note: Picking your own variables, coming up with a research question, and analyzing the data to answer this question is basically what you’ll need to do for your project as well.)
+
+
+
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics.
Some define statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information – the data. In this lab we explore flights, specifically a random sample of domestic flights that departed from the three major New York City airport in 2013. We will generate simple graphical and numerical summaries of data on these flights and explore delay times. As this is a large data set, along the way you’ll also learn the indispensable skills of data processing and subsetting.
+
+
Getting started
+
+
Load packages
+
In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data can be found in the companion package for OpenIntro labs, oilabs.
+
Let’s load the packages.
+
library(dplyr)
+library(ggplot2)
+library(oilabs)
+
+
+
Creating a reproducible lab report
+
Remember that we will be using R Markdown to create reproducible lab reports. See the following video describing how to get started with creating these reports for this lab, and all future labs:
The Bureau of Transportation Statistics (BTS) is a statistical agency that is a part of the Research and Innovative Technology Administration (RITA). As its name implies, BTS collects and makes available transportation data, such as the flights data we will be working with in this lab.
+
We begin by loading the nycflights data frame. Type the following in your console to load the data:
+
data(nycflights)
+
The data set nycflights that shows up in your workspace is a data matrix, with each row representing an observation and each column representing a variable. R calls this data format a data frame, which is a term that will be used throughout the labs. For this data set, each observation is a single flight.
+
To view the names of the variables, type the command
+
names(nycflights)
+
This returns the names of the variables in this data frame. The codebook (description of the variables) can be accessed by pulling up the help file:
+
?nycflights
+
One of the variables refers to the carrier (i.e. airline) of the flight, which is coded according to the following system.
+
+
carrier: Two letter carrier abbreviation.
+
+
9E: Endeavor Air Inc.
+
AA: American Airlines Inc.
+
AS: Alaska Airlines Inc.
+
B6: JetBlue Airways
+
DL: Delta Air Lines Inc.
+
EV: ExpressJet Airlines Inc.
+
F9: Frontier Airlines Inc.
+
FL: AirTran Airways Corporation
+
HA: Hawaiian Airlines Inc.
+
MQ: Envoy Air
+
OO: SkyWest Airlines Inc.
+
UA: United Air Lines Inc.
+
US: US Airways Inc.
+
VX: Virgin America
+
WN: Southwest Airlines Co.
+
YV: Mesa Airlines Inc.
+
+
+
A very useful function for taking a quick peek at your data frame and viewing its dimensions and data types is str, which stands for structure.
+
str(nycflights)
+
The nycflights data frame is a massive trove of information. Let’s think about some questions we might want to answer with these data:
+
+
How delayed were flights that were headed to Los Angeles?
+
How do departure delays vary over months?
+
Which of the three major NYC airports has a better on time percentage for departing flights?
+
+
+
+
+
Analysis
+
+
Lab report
+
To record your analysis in a reproducible format, you can adapt the general Lab Report template from the oilabs package. Watch the video above to learn how.
+
+
+
Departure delays
+
Let’s start by examing the distribution of departure delays of all flights with a histogram.
+
qplot(x = dep_delay, data = nycflights, geom ="histogram")
+
This function says to plot the dep_delay variable from the nycflights data frame on the x-axis. It also defines a geom (short for geometric object), which describes the type of plot you will produce.
+
Histograms are generally a very good way to see the shape of a single distribution of numerical data, but that shape can change depending on how the data is split between the different bins. You can easily define the binwidth you want to use:
+
qplot(x = dep_delay, data = nycflights, geom ="histogram", binwidth =15)
+qplot(x = dep_delay, data = nycflights, geom ="histogram", binwidth =150)
+
+
Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another?
+
+
If we want to focus only on departure delays of flights headed to Los Angeles, we need to first filter the data for flights with that destination (dest == "LAX") and then make a histogram of the departure delays of only those flights.
Let’s decipher these two commands (OK, so it might look like three lines, but the first two physical lines of code are actually part of the same command. It’s common to add a break to a new line after %>% to help readability).
+
+
Command 1: Take the nycflights data frame, filter for flights headed to LAX, and save the result as a new data frame called lax_flights.
+
+
== means “if it’s equal to”.
+
LAX is in quotation marks since it is a character string.
+
+
Command 2: Basically the same qplot call from earlier for making a histogram, except that it uses the smaller data frame for flights headed to LAX instead of all flights.
+
+
+
Logical operators: Filtering for certain observations (e.g. flights from a particular airport) is often of interest in data frames where we might want to examine observations with certain characteristics separately from the rest of the data. To do so we use the filter function and a series of logical operators. The most commonly used logical operators for data analysis are as follows:
+
+
== means “equal to”
+
!= means “not equal to”
+
> or < means “greater than” or “less than”
+
>= or <= means “greater than or equal to” or “less than or equal to”
+
+
+
We can also obtain numerical summaries for these flights:
+
lax_flights %>%
+summarise(mean_dd =mean(dep_delay), median_dd =median(dep_delay), n =n())
+
Note that in the summarise function we created a list of three different numerical summaries that we were interested in. The names of these elements are user defined, like mean_dd, median_dd, n, and you could customize these names as you like (just don’t use spaces in your names). Calculating these summary statistics also require that you know the function calls. Note that n() reports the sample size.
+
+
Summary statistics: Some useful function calls for summary statistics for a single numerical variable are as follows:
+
+
mean
+
median
+
sd
+
var
+
IQR
+
min
+
max
+
+
Note that each of these functions take a single vector as an argument, and returns a single value.
+
+
We can also filter based on multiple criteria. Suppose we are interested in flights headed to San Francisco (SFO) in February:
Note that we can separate the conditions using commas if we want flights that are both headed to SFO and in February. If we are interested in either flights headed to SFO or in February we can use the | instead of the comma.
+
+
Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?
+
Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.
+
+
Another useful technique is quickly calculating summary statistics for various groups in your data frame. For example, we can modify the above command using the group_by function to get the same summary stats for each origin airport:
Here, we first grouped the data by origin, and then calculated the summary statistics.
+
+
Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?
+
+
+
+
Departure delays over months
+
Which month would you expect to have the highest average delay departing from an NYC airport?
+
Let’s think about how we would answer this question:
+
+
First, calculate monthly averages for departure delays. With the new language we are learning, we need to
+
+
group_by months, then
+
summarise mean departure delays.
+
+
Then, we need to arrange these average delays in descending order
Suppose you really dislike departure delays, and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?
+
+
+
+
+
On time departure rate for NYC airports
+
Suppose you will be flying out of NYC and want to know which of the three major NYC airports has the best on time departure rate of departing flights. Suppose also that for you a flight that is delayed for less than 5 minutes is basically “on time”. You consider any flight delayed for 5 minutes of more to be “delayed”.
+
In order to determine which airport has the best on time departure rate, we need to
+
+
first classify each flight as “on time” or “delayed”,
+
then group flights by origin airport,
+
then calculate on time departure rates for each origin airport,
+
and finally arrange the airports in descending order for on time departure percentage.
+
+
Let’s start with classifying each flight as “on time” or “delayed” by creating a new variable with the mutate function.
The first argument in the mutate function is the name of the new variable we want to create, in this case dep_type. Then if dep_delay < 5 we classify the flight as "on time" and "delayed" if not, i.e. if the flight is delayed for 5 or more minutes.
+
Note that we are also overwriting the nycflights data frame with the new version of this data frame that includes the new dep_type variable.
+
We can handle all the remaining steps in one code chunk:
If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?
+
+
We can also visualize the distribution of on on time departure rate across the three airports using a segmented bar plot.
+
qplot(x = origin, fill = dep_type, data = nycflights, geom ="bar")
+
+
+
+
+
More Practice
+
+
Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.
+
Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom = "point".
+
Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.
The goal of this lab is to introduce you to R and RStudio, which you’ll be using throughout the course both to learn the statistical concepts discussed in the course and to analyze real data and come to informed conclusions. To clarify which is which: R is the name of the programming language itself and RStudio is a convenient interface.
+
As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R. Today we begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.
+
Go ahead and launch RStudio. You should see a window that looks like the image shown below.
+
+
The panel on the lower left is where the action happens. It’s called the console. Everytime you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is really a request: a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.
+
The panel in the upper right contains your workspace as well as a history of the commands that you’ve previously entered.
+
Any plots that you generate will show up in the panel in the lower right corner. This is also where you can browse your files, access help, manage packages, etc.
+
+
R Packages
+
R is an open-source programming language, meaning that users can contribute packages that make our lives easier, and we can use them for free. For this lab, and many others in the future, we will use the following R packages:
+
+
dplyr: for data wrangling
+
ggplot2: for data visualization
+
oilabs: for data and custom functions with the OpenIntro labs
+
+
If these packages are not already available in your R environment, install them by typing the following three lines of code into the console of your RStudio session, pressing the enter/return key after each one. Note that you can check to see which packages (and which versions) are installed by inspecting the Packages tab in the lower right panel of RStudio.
You may need to select a server from which to download; any of them will work. Next, you need to load these packages in your working environment. We do this with the library function. Run the following three lines in your console.
+
library(dplyr)
+library(ggplot2)
+library(oilabs)
+
Note that you only need to install packages once, but you need to load them each time you relaunch RStudio.
+
+
+
Creating a reproducible lab report
+
We will be using R Markdown to create reproducible lab reports. See the following videos describing why and how:
Going forward you should refrain from typing your code directly in the console, and instead type any code (final correct answer, or anything you’re just trying out) in the R Markdown file and run the chunk using either the Run button on the chunk (green sideways triangle) or by highlighting the code and clicking Run on the top right corner of the R Markdown editor. If at any point you need to start over, you can Run All Chunks above the chunk you’re working in by clicking on the down arrow in the code chunk.
+
+
+
+
Dr. Arbuthnot’s Baptism Records
+
To get you started, run the following command to load the data.
+
data(arbuthnot)
+
You can do this by
+
+
clicking on the green arrow at the top right of the code chunk in the R Markdown (Rmd) file, or
+
putting your cursor on this line, and hit the Run button on the upper right corner of the pane, or
+
hitting Ctrl-Shift-Enter, or
+
typing the code in the console.
+
+
This command instructs R to load some data: the Arbuthnot baptism counts for boys and girls. You should see that the workspace area in the upper righthand corner of the RStudio window now lists a data set called arbuthnot that has 82 observations on 3 variables. As you interact with R, you will create a series of objects. Sometimes you load them as we have done here, and sometimes you create them yourself as the byproduct of a computation or some analysis you have performed.
+
The Arbuthnot data set refers to Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710. We can view the data by typing its name into the console.
+
arbuthnot
+
However printing the whole dataset in the console is not that useful. One advantage of RStudio is that it comes with a built-in data viewer. Click on the name arbuthnot in the Environment pane (upper right window) that lists the objects in your workspace. This will bring up an alternative display of the data set in the Data Viewer (upper left window). You can close the data viewer by clicking on the x in the upper lefthand corner.
+
What you should see are four columns of numbers, each row representing a different year: the first entry in each row is simply the row number (an index we can use to access the data from individual years if we want), the second is the year, and the third and fourth are the numbers of boys and girls baptized that year, respectively. Use the scrollbar on the right side of the console window to examine the complete data set.
+
Note that the row numbers in the first column are not part of Arbuthnot’s data. R adds them as part of its printout to help you make visual comparisons. You can think of them as the index that you see on the left side of a spreadsheet. In fact, the comparison to a spreadsheet will generally be helpful. R has stored Arbuthnot’s data in a kind of spreadsheet or table called a data frame.
+
You can see the dimensions of this data frame as well as the names of the variables and the first few observations by typing:
We can see that there are 82 observations and 3 variables in this dataset. The variable names are year, boys, and girls. At this point, you might notice that many of the commands in R look a lot like functions from math class; that is, invoking R commands means supplying a function with some number of arguments. The glimpse command, for example, took a single argument, the name of a data frame.
+
+
+
Some Exploration
+
Let’s start to examine the data a little more closely. We can access the data in a single column of a data frame separately using a command like
+
arbuthnot$boys
+
This command will only show the number of boys baptized each year. The dollar sign basically says “go to the data frame that comes before me, and find the variable that comes after me”.
+
+
What command would you use to extract just the counts of girls baptized? Try it!
+
+
Notice that the way R has printed these data is different. When we looked at the complete data frame, we saw 82 rows, one on each line of the display. These data are no longer structured in a table with other variables, so they are displayed one right after another. Objects that print out in this way are called vectors; they represent a set of numbers. R has added numbers in [brackets] along the left side of the printout to indicate locations within the vector. For example, 5218 follows [1], indicating that 5218 is the first entry in the vector. And if [43] starts a line, then that would mean the first number on that line would represent the 43rd entry in the vector.
+
+
Data visualization
+
R has some powerful functions for making graphics. We can create a simple plot of the number of girls baptized per year with the command
+
qplot(x = year, y = girls, data = arbuthnot)
+
The qplot() function (meaning “quick plot”) considers the type of data you have provided it and makes the decision to visualize it with a scatterplot. The plot should appear under the Plots tab of the lower right panel of RStudio. Notice that the command above again looks like a function, this time with three arguments separated by commas. The first two arguments in the qplot() function specify the variables for the x-axis and the y-axis and the third provides the name of the data set where they can be found. If we wanted to connect the data points with lines, we could add a fourth argument to specify the geometry that we’d like.
+
qplot(x = year, y = girls, data = arbuthnot, geom ="line")
+
You might wonder how you are supposed to know that it was possible to add that fourth argument. Thankfully, R documents all of its functions extensively. To read what a function does and learn the arguments that are available to you, just type in a question mark followed by the name of the function that you’re interested in. Try the following.
+
?qplot
+
Notice that the help file replaces the plot in the lower right panel. You can toggle between plots and help files using the tabs at the top of that panel.
+
+
Is there an apparent trend in the number of girls baptized over the years? How would you describe it? (To ensure that your lab report is comprehensive, be sure to include the code needed to make the plot as well as your written interpretation.)
+
+
+
+
R as a big calculator
+
Now, suppose we want to plot the total number of baptisms. To compute this, we could use the fact that R is really just a big calculator. We can type in mathematical expressions like
+
5218 +4683
+
to see the total number of baptisms in 1629. We could repeat this once for each year, but there is a faster way. If we add the vector for baptisms for boys to that of girls, R will compute all sums simultaneously.
+
arbuthnot$boys +arbuthnot$girls
+
What you will see are 82 numbers (in that packed display, because we aren’t looking at a data frame here), each one representing the sum we’re after. Take a look at a few of them and verify that they are right.
+
+
+
Adding a new variable to the data frame
+
We’ll be using this new vector to generate some plots, so we’ll want to save it as a permanent column in our data frame.
+
arbuthnot <-arbuthnot %>%
+mutate(total = boys +girls)
+
The %>% operator is called the piping operator. It takes the output of the previous expression and pipes it into the first argument of the function in the following one. To continue our analogy with mathematical functions, x %>% f(y) is equivalent to f(x, y).
+
+
A note on piping: Note that we can read these three lines of code as the following:
+
“Take the arbuthnot dataset and pipe it into the mutate function. Mutate the arbuthnot data set by creating a new variable called total that is the sum of the variables called boys and girls. Then assign the resulting dataset to the object called arbuthnot, i.e. overwrite the old arbuthnot dataset with the new one containing the new variable.”
+
This is equivalent to going through each row and adding up the boys and girls counts for that year and recording that value in a new column called total.
+
+
+
Where is the new variable? When you make changes to variables in your dataset, click on the name of the dataset again to update it in the data viewer.
+
+
You’ll see that there is now a new column called total that has been tacked on to the data frame. The special symbol <- performs an assignment, taking the output of one line of code and saving it into an object in your workspace. In this case, you already have an object called arbuthnot, so this command updates that data set with the new mutated column.
+
We can make a plot of the total number of baptisms per year with the command
+
qplot(x = year, y = total, data = arbuthnot, geom ="line")
+
Similarly to how we computed the total number of births, we can compute the ratio of the number of boys to the number of girls baptized in 1629 with
+
5218 /4683
+
or we can act on the complete columns with the expression
+
arbuthnot <-arbuthnot %>%
+mutate(boy_to_girl_ratio = boys /girls)
+
We can also compute the proportion of newborns that are boys in 1629
+
5218 /(5218 +4683)
+
or this may also be computed for all years simultaneously and append it to the dataset:
+
arbuthnot <-arbuthnot %>%
+mutate(boy_ratio = boys /total)
+
Note that we are using the new total variable we created earlier in our calculations.
+
+
Now, generate a plot of the proportion of boys born over time. What do you see?
+
+
+
Tip: If you use the up and down arrow keys, you can scroll through your previous commands, your so-called command history. You can also access it by clicking on the history tab in the upper right panel. This will save you a lot of typing in the future.
+
+
Finally, in addition to simple mathematical operators like subtraction and division, you can ask R to make comparisons like greater than, >, less than, <, and equality, ==. For example, we can ask if boys outnumber girls in each year with the expression
+
arbuthnot <-arbuthnot %>%
+mutate(more_boys = boys >girls)
+
This command add a new variable to the arbuthnot dataframe containing the values of either TRUE if that year had more boys than girls, or FALSE if that year did not (the answer may surprise you). This variable contains a different kind of data than we have encountered so far. All other columns in the arbuthnot data frame have values that are numerical (the year, the number of boys and girls). Here, we’ve asked R to create logical data, data where the values are either TRUE or FALSE. In general, data analysis will involve many different kinds of data types, and one reason for using R is that it is able to represent and compute with many of them.
+
+
+
+
+
More Practice
+
In the previous few pages, you recreated some of the displays and preliminary analysis of Arbuthnot’s baptism data. Your assignment involves repeating these steps, but for present day birth records in the United States. Load the present day data with the following command.
+
data(present)
+
The data are stored in a data frame called present.
+
+
What years are included in this data set? What are the dimensions of the data frame? What are the variable (column) names?
+
How do these counts compare to Arbuthnot’s? Are they of a similar magnitude?
+
Make a plot that displays the proportion of boys born over time. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in your response. Hint: You should be able to reuse your code from Ex 3 above, just replace the dataframe name.
+
In what year did we see the most total number of births in the U.S.? Hint: First calculate the totals and save it as a new variable. Then, sort your dataset in descending order based on the total column. You can do this interactively in the data viewer by clicking on the arrows next to the variable names. To include the sorted result in your report you will need to use two new functions: arrange (for sorting). We can arrange the data in a descending order with another function: desc (for descending order). Sample code provided below.
+
+
present %>%
+arrange(desc(total))
+
These data come from reports by the Centers for Disease Control. You can learn more about them by bringing up the help file using the command ?present.
+
+
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.
+
+
+
+
+
Resources for learning R and working in RStudio
+
That was a short introduction to R and RStudio, but we will provide you with more functions and a more complete sense of the language as the course progresses.
+
In this course we will be using R packages called dplyr for data wrangling and ggplot2 for data visualization. If you are googling for R code, make sure to also include these package names in your search query. For example, instead of googling “scatterplot in R”, google “scatterplot in R with ggplot2”.
+
These cheatsheets may come in handy throughout the semester:
Chester Ismay has put together a resource for new users of R, RStudio, and R Markdown here. It includes examples showing working with R Markdown files in RStudio recorded as GIFs.
+
Note that some of the code on these cheatsheets may be too advanced for this course, however majority of it will become useful throughout the semester.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/lab.css b/docs/lab.css
new file mode 100644
index 0000000..567e82e
--- /dev/null
+++ b/docs/lab.css
@@ -0,0 +1,87 @@
+body {
+ counter-reset: li; /* initialize counter named li */
+}
+
+h1 {
+ font-family:Arial, Helvetica, sans-serif;
+ font-weight:bold;
+}
+
+h2 {
+ font-family:Arial, Helvetica, sans-serif;
+ font-weight:bold;
+ margin-top: 24px;
+}
+
+ol {
+ margin-left:0; /* Remove the default left margin */
+ padding-left:0; /* Remove the default left padding */
+}
+ol > li {
+ position:relative; /* Create a positioning context */
+ margin:0 0 10px 2em; /* Give each list item a left margin to make room for the numbers */
+ padding:10px 80px; /* Add some spacing around the content */
+ list-style:none; /* Disable the normal item numbering */
+ border-top:2px solid #317EAC;
+ background:rgba(49, 126, 172, 0.1);
+}
+ol > li:before {
+ content:"Exercise " counter(li); /* Use the counter as content */
+ counter-increment:li; /* Increment the counter by 1 */
+ /* Position and style the number */
+ position:absolute;
+ top:-2px;
+ left:-2em;
+ -moz-box-sizing:border-box;
+ -webkit-box-sizing:border-box;
+ box-sizing:border-box;
+ width:7em;
+ /* Some space between the number and the content in browsers that support
+ generated content but not positioning it (Camino 2 is one example) */
+ margin-right:8px;
+ padding:4px;
+ border-top:2px solid #317EAC;
+ color:#fff;
+ background:#317EAC;
+ font-weight:bold;
+ font-family:"Helvetica Neue", Arial, sans-serif;
+ text-align:center;
+}
+li ol,
+li ul {margin-top:6px;}
+ol ol li:last-child {margin-bottom:0;}
+
+.oyo ul {
+ list-style-type:decimal;
+}
+
+hr {
+ border: 1px solid #357FAA;
+}
+
+div#boxedtext {
+ background-color: rgba(86, 155, 189, 0.2);
+ padding: 20px;
+ margin-bottom: 20px;
+ font-size: 10pt;
+}
+
+div#template {
+ margin-top: 30px;
+ margin-bottom: 30px;
+ color: #808080;
+ border:1px solid #808080;
+ padding: 10px 10px;
+ background-color: rgba(128, 128, 128, 0.2);
+ border-radius: 5px;
+}
+
+div#license {
+ margin-top: 30px;
+ margin-bottom: 30px;
+ color: #4C721D;
+ border:1px solid #4C721D;
+ padding: 10px 10px;
+ background-color: rgba(76, 114, 29, 0.2);
+ border-radius: 5px;
+}
\ No newline at end of file
diff --git a/docs/multiple_regression.html b/docs/multiple_regression.html
new file mode 100644
index 0000000..23f8ba0
--- /dev/null
+++ b/docs/multiple_regression.html
@@ -0,0 +1,310 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+Multiple linear regression
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Multiple linear regression
+
+
+
+
+
+
Grading the professor
+
Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. The article titled, “Beauty in the classroom: instructors’ pulchritude and putative pedagogical productivity” by Hamermesh and Parker found that instructors who are viewed to be better looking receive higher instructional ratings.
+
In this lab we will analyze the data from this study in order to learn what goes into a positive professor evaluation.
+
+
+
Getting Started
+
+
Load packages
+
In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data can be found in the companion package for the OpenIntro labs, oilabs.
This is the first time we’re using the GGally package. We will be using the ggpairs function from this package later in the lab.
+
+
+
Creating a reproducible lab report
+
To create your new lab report, start by opening a new R Markdown document… From Template… then select Lab Report from the oilabs package. Make sure that all necessary packages loaded in your R Markdown document.
+
+
+
The data
+
The data were gathered from end of semester student evaluations for a large sample of professors from the University of Texas at Austin. In addition, six students rated the professors’ physical appearance. The result is a data frame where each row contains a different course and columns represent variables about the courses and professors.
+
Let’s load the data:
+
data(evals)
+
We have observations on 21 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:
+
?evals
+
+
+
+
Exploring the data
+
+
Is this an observational study or an experiment? The original research question posed in the paper is whether beauty leads directly to the differences in course evaluations. Given the study design, is it possible to answer this question as it is phrased? If not, rephrase the question.
+
Describe the distribution of score. Is the distribution skewed? What does that tell you about how students rate courses? Is this what you expected to see? Why, or why not?
+
Excluding score, select two other variables and describe their relationship with each other using an appropriate visualization.
+
+
+
+
Simple linear regression
+
The fundamental phenomenon suggested by the study is that better looking teachers are evaluated more favorably. Let’s create a scatterplot to see if this appears to be the case:
+
qplot(data = evals, x = bty_avg, y = score)
+
Before we draw conclusions about the trend, compare the number of observations in the data frame with the approximate number of points on the scatterplot. Is anything awry?
+
+
Replot the scatterplot, but this time use geom = "jitter". What was misleading about the initial scatterplot?
+
+
qplot(data = evals, x = bty_avg, y = score, geom ="jitter")
+
+
Let’s see if the apparent trend in the plot is something more than natural variation. Fit a linear model called m_bty to predict average professor score by average beauty rating. Write out the equation for the linear model and interpret the slope. Is average beauty score a statistically significant predictor? Does it appear to be a practically significant predictor?
+
+
Add the line of the bet fit model to your plot using the following:
+
qplot(data = evals, x = bty_avg, y = score, geom ="jitter") +
+geom_smooth(method ="lm")
+
The blue line is the model. The shaded gray area around the line tells us about the variability we might expect in our predictions. To turn that off, use se = FALSE.
+
qplot(data = evals, x = bty_avg, y = score, geom ="jitter") +
+geom_smooth(method ="lm", se =FALSE)
+
+
Use residual plots to evaluate whether the conditions of least squares regression are reasonable. Provide plots and comments for each one (see the Simple Regression Lab for a reminder of how to make these).
+
+
+
+
Multiple linear regression
+
The data set contains several variables on the beauty score of the professor: individual ratings from each of the six students who were asked to score the physical appearance of the professors and the average of these six scores. Let’s take a look at the relationship between one of these scores and the average beauty score.
+
qplot(data = evals, x = bty_f1lower, y = bty_avg)
+evals %>%
+summarise(cor(bty_avg, bty_f1lower))
+
As expected the relationship is quite strong—after all, the average score is calculated using the individual scores. We can actually look at the relationships between all beauty variables (columns 13 through 19) using the following command:
+
evals %>%
+select(contains("bty")) %>%
+ggpairs()
+
These variables are collinear (correlated), and adding more than one of these variables to the model would not add much value to the model. In this application and with these highly-correlated predictors, it is reasonable to use the average beauty score as the single representative of these variables.
+
In order to see if beauty is still a significant predictor of professor score after we’ve accounted for the gender of the professor, we can add the gender term into the model.
+
m_bty_gen <-lm(score ~bty_avg +gender, data = evals)
+summary(m_bty_gen)
+
+
P-values and parameter estimates should only be trusted if the conditions for the regression are reasonable. Verify that the conditions for this model are reasonable using diagnostic plots.
+
Is bty_avg still a significant predictor of score? Has the addition of gender to the model changed the parameter estimate for bty_avg?
+
+
Note that the estimate for gender is now called gendermale. You’ll see this name change whenever you introduce a categorical variable. The reason is that R recodes gender from having the values of female and male to being an indicator variable called gendermale that takes a value of \(0\) for females and a value of \(1\) for males. (Such variables are often referred to as “dummy” variables.)
+
As a result, for females, the parameter estimate is multiplied by zero, leaving the intercept and slope form familiar from simple regression.
What is the equation of the line corresponding to males? (Hint: For males, the parameter estimate is multiplied by 1.) For two professors who received the same beauty rating, which gender tends to have the higher course evaluation score?
+
+
The decision to call the indicator variable gendermale instead ofgenderfemale has no deeper meaning. R simply codes the category that comes first alphabetically as a \(0\). (You can change the reference level of a categorical variable, which is the level that is coded as a 0, using therelevel() function. Use ?relevel to learn more.)
+
+
Create a new model called m_bty_rank with gender removed and rank added in. How does R appear to handle categorical variables that have more than two levels? Note that the rank variable has three levels: teaching, tenure track, tenured.
+
+
The interpretation of the coefficients in multiple regression is slightly different from that of simple regression. The estimate for bty_avg reflects how much higher a group of professors is expected to score if they have a beauty rating that is one point higher while holding all other variables constant. In this case, that translates into considering only professors of the same rank with bty_avg scores that are one point apart.
+
+
+
The search for the best model
+
We will start with a full model that predicts professor score based on rank, ethnicity, gender, language of the university where they got their degree, age, proportion of students that filled out evaluations, class size, course level, number of professors, number of credits, average beauty rating, outfit, and picture color.
+
+
Which variable would you expect to have the highest p-value in this model? Why? Hint: Think about which variable would you expect to not have any association with the professor score.
Check your suspicions from the previous exercise. Include the model output in your response.
+
Interpret the coefficient associated with the ethnicity variable.
+
Drop the variable with the highest p-value and re-fit the model. Did the coefficients and significance of the other explanatory variables change? (One of the things that makes multiple regression interesting is that coefficient estimates depend on the other variables that are included in the model.) If not, what does this say about whether or not the dropped variable was collinear with the other explanatory variables?
+
Using backward-selection and p-value as the selection criterion, determine the best model. You do not need to show all steps in your answer, just the output for the final model. Also, write out the linear model for predicting score based on the final model you settle on.
+
Verify that the conditions for this model are reasonable using diagnostic plots.
+
The original paper describes how these data were gathered by taking a sample of professors from the University of Texas at Austin and including all courses that they have taught. Considering that each row represents a course, could this new information have an impact on any of the conditions of linear regression?
+
Based on your final model, describe the characteristics of a professor and course at University of Texas at Austin that would be associated with a high evaluation score.
+
Would you be comfortable generalizing your conclusions to apply to professors generally (at any university)? Why or why not?
In this lab we’ll investigate the probability distribution that is most central to statistics: the normal distribution. If we are confident that our data are nearly normal, that opens the door to many powerful statistical methods. Here we’ll use the graphical tools of R to assess the normality of our data and also learn how to generate random numbers from a normal distribution.
+
+
The Data
+
This week we’ll be working with measurements of body dimensions. This data set contains measurements from 247 men and 260 women, most of whom were considered healthy young adults. Let’s take a quick peek at the first few rows of the data.
You’ll see that for every observation we have 25 measurements, many of which are either diameters or girths. You can learn about what the variable names mean by bringing up the help page.
+
?bdims
+
We’ll be focusing on just three columns to get started: weight in kg (wgt), height in cm (hgt), and sex (m indicates male, f indicates female).
+
Since males and females tend to have different body dimensions, it will be useful to create two additional data sets: one with only men and another with only women.
Make a plot (or plots) to visualize the distributions of men’s and women’s heights.
+How do their centers, shapes, and spreads compare?
+
+
+
+
The normal distribution
+
In your description of the distributions, did you use words like bell-shaped or normal? It’s tempting to say so when faced with a unimodal symmetric distribution.
+
To see how accurate that description is, we can plot a normal distribution curve on top of a histogram to see how closely the data follow a normal distribution. This normal curve should have the same mean and standard deviation as the data. We’ll be working with women’s heights, so let’s store them as a separate object and then calculate some statistics that will be referenced later.
+
fhgtmean <-mean(~hgt, data = fdims)
+fhgtsd <-sd(~hgt, data = fdims)
+
Next we make a density histogram to use as the backdrop and use the lines function to overlay a normal probability curve. The difference between a frequency histogram and a density histogram is that while in a frequency histogram the heights of the bars add up to the total number of observations, in a density histogram the areas of the bars add up to 1. The area of each bar can be calculated as simply the height times the width of the bar. Using a density histogram allows us to properly overlay a normal distribution curve over the histogram since the curve is a normal probability density function that also has area under the curve of 1. Frequency and density histograms both display the same exact shape; they only differ in their y-axis. You can verify this by comparing the frequency histogram you constructed earlier and the density histogram created by the commands below.
After initializing a blank plot with the first command, the ggplot2 package allows us to add additional layers. The first layer is a density histogram. The second layer is a statistical function – the density of the normal curve, dnorm. We specify that we want the curve to have the same mean and standard deviation as the column of female heights. The argument col simply sets the color for the line to be drawn. If we left it out, the line would be drawn in black.
+
+
Based on the this plot, does it appear that the data follow a nearly normal distribution?
+
+
+
+
Evaluating the normal distribution
+
Eyeballing the shape of the histogram is one way to determine if the data appear to be nearly normally distributed, but it can be frustrating to decide just how close the histogram is to the curve. An alternative approach involves constructing a normal probability plot, also called a normal Q-Q plot for “quantile-quantile”.
+
qplot(sample = hgt, data = fdims, geom ="qq")
+
The x-axis values correspond to the quantiles of a theoretically normal curve with mean 0 and standard deviation 1 (i.e., the standard normal distribution). The y-axis values correspond to the quantiles of the original unstandardized sample data. However, even if we were to standardize the sample data values, the Q-Q plot would look identical. A data set that is nearly normal will result in a probability plot where the points closely follow a diagonal line. Any deviations from normality leads to deviations of these points from that line.
+
The plot for female heights shows points that tend to follow the line but with some errant points towards the tails. We’re left with the same problem that we encountered with the histogram above: how close is close enough?
+
A useful way to address this question is to rephrase it as: what do probability plots look like for data that I know came from a normal distribution? We can answer this by simulating data from a normal distribution using rnorm.
+
sim_norm <-rnorm(n =nrow(fdims), mean = fhgtmean, sd = fhgtsd)
+
The first argument indicates how many numbers you’d like to generate, which we specify to be the same number of heights in the fdims data set using the nrow() function. The last two arguments determine the mean and standard deviation of the normal distribution from which the simulated sample will be generated. We can take a look at the shape of our simulated data set, sim_norm, as well as its normal probability plot.
+
+
Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a dataframe, it can be put directly into the sample argument and the data argument can be dropped.)
+
+
Even better than comparing the original plot to a single plot generated from a normal distribution is to compare it to many more plots using the following function. It shows the Q-Q plot corresponding to the original data in the top left corner, and the Q-Q plots of 8 different simulated normal data. It may be helpful to click the zoom button in the plot window.
+
qqnormsim(sample = hgt, data = fdims)
+
+
Does the normal probability plot for female heights look similar to the plots created for the simulated data? That is, do the plots provide evidence that the female heights are nearly normal?
+
Using the same technique, determine whether or not female weights appear to come from a normal distribution.
+
+
+
+
Normal probabilities
+
Okay, so now you have a slew of tools to judge whether or not a variable is normally distributed. Why should we care?
+
It turns out that statisticians know a lot about the normal distribution. Once we decide that a random variable is approximately normal, we can answer all sorts of questions about that variable related to probability. Take, for example, the question of, “What is the probability that a randomly chosen young adult female is taller than 6 feet (about 182 cm)?” (The study that published this data set is clear to point out that the sample was not random and therefore inference to a general population is not suggested. We do so here only as an exercise.)
+
If we assume that female heights are normally distributed (a very close approximation is also okay), we can find this probability by calculating a Z score and consulting a Z table (also called a normal probability table). In R, this is done in one step with the function pnorm().
+
1 -pnorm(q =182, mean = fhgtmean, sd = fhgtsd)
+
Note that the function pnorm() gives the area under the normal curve below a given value, q, with a given mean and standard deviation. Since we’re interested in the probability that someone is taller than 182 cm, we have to take one minus that probability.
+
Assuming a normal distribution has allowed us to calculate a theoretical probability. If we want to calculate the probability empirically, we simply need to determine how many observations fall above 182 then divide this number by the total sample size.
Although the probabilities are not exactly the same, they are reasonably close. The closer that your distribution is to being normal, the more accurate the theoretical probabilities will be.
+
+
Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?
+
+
+
+
+
More Practice
+
+
Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.
+
a. The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter ____.
+
b. The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter ____.
+
c. The histogram for general age (age) belongs to normal probability plot letter ____.
+
d. The histogram for female chest depth (che.de) belongs to normal probability plot letter ____.
+
Note that normal probability plots C and D have a slight stepwise pattern.
+Why do you think this is the case?
+
As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.
+
+
+
+
+
+
+
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.
Basketball players who make several baskets in succession are described as having a hot hand. Fans and players have long believed in the hot hand phenomenon, which refutes the assumption that each shot is independent of the next. However, a 1985 paper by Gilovich, Vallone, and Tversky collected evidence that contradicted this belief and showed that successive shots are independent events. This paper started a great controversy that continues to this day, as you can see by Googling hot hand basketball.
+
We do not expect to resolve this controversy today. However, in this lab we’ll apply one approach to answering questions like this. The goals for this lab are to (1) think about the effects of independent and dependent events, (2) learn how to simulate shooting streaks in R, and (3) to compare a simulation to actual data in order to determine if the hot hand phenomenon appears to be real.
+
+
+
Getting Started
+
+
Load packages
+
In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data can be found in the companion package for OpenIntro labs, oilabs.
+
Let’s load the packages.
+
library(dplyr)
+library(ggplot2)
+library(oilabs)
+
+
+
Creating a reproducible lab report
+
To create your new lab report, start by opening a new R Markdown document… From Template… then select Lab Report from the oilabs package.
+
+
+
Data
+
Our investigation will focus on the performance of one player: Kobe Bryant of the Los Angeles Lakers. His performance against the Orlando Magic in the 2009 NBA Finals earned him the title Most Valuable Player and many spectators commented on how he appeared to show a hot hand. Let’s load some necessary files that we will need for this lab.
+
data(kobe_basket)
+
This data frame contains 133 observations and 6 variables, where every row records a shot taken by Kobe Bryant. The shot variable in this dataset indicates whether the shot was a hit (H) or a miss (M).
+
Just looking at the string of hits and misses, it can be difficult to gauge whether or not it seems like Kobe was shooting with a hot hand. One way we can approach this is by considering the belief that hot hand shooters tend to go on shooting streaks. For this lab, we define the length of a shooting streak to be the number of consecutive baskets made until a miss occurs.
+
For example, in Game 1 Kobe had the following sequence of hits and misses from his nine shot attempts in the first quarter:
+
\[ \textrm{H M | M | H H M | M | M | M} \]
+
You can verify this by viewing the first 8 rows of the data in the data viewer.
+
Within the nine shot attempts, there are six streaks, which are separated by a “|” above. Their lengths are one, zero, two, zero, zero, zero (in order of occurrence).
+
+
What does a streak length of 1 mean, i.e. how many hits and misses are in a streak of 1? What about a streak length of 0?
+
+
Counting streak lengths manually for all 133 shots would get tedious, so we’ll use the custom function calc_streak to calculate them, and store the results in a data frame called kobe_streak as the length variable.
+
kobe_streak <-calc_streak(kobe_basket$shot)
+
We can then take a look at the distribution of these streak lengths.
+
qplot(data = kobe_streak, x = length, geom ="bar")
+
+
Describe the distribution of Kobe’s streak lengths from the 2009 NBA finals. What was his typical streak length? How long was his longest streak of baskets? Make sure to include the accompanying plot in your answer.
+
+
+
+
+
Compared to What?
+
We’ve shown that Kobe had some long shooting streaks, but are they long enough to support the belief that he had a hot hand? What can we compare them to?
+
To answer these questions, let’s return to the idea of independence. Two processes are independent if the outcome of one process doesn’t effect the outcome of the second. If each shot that a player takes is an independent process, having made or missed your first shot will not affect the probability that you will make or miss your second shot.
+
A shooter with a hot hand will have shots that are not independent of one another. Specifically, if the shooter makes his first shot, the hot hand model says he will have a higher probability of making his second shot.
+
Let’s suppose for a moment that the hot hand model is valid for Kobe. During his career, the percentage of time Kobe makes a basket (i.e. his shooting percentage) is about 45%, or in probability notation,
+
\[ P(\textrm{shot 1 = H}) = 0.45 \]
+
If he makes the first shot and has a hot hand (not independent shots), then the probability that he makes his second shot would go up to, let’s say, 60%,
As a result of these increased probabilites, you’d expect Kobe to have longer streaks. Compare this to the skeptical perspective where Kobe does not have a hot hand, where each shot is independent of the next. If he hit his first shot, the probability that he makes the second is still 0.45.
In other words, making the first shot did nothing to effect the probability that he’d make his second shot. If Kobe’s shots are independent, then he’d have the same probability of hitting every shot regardless of his past shots: 45%.
+
Now that we’ve phrased the situation in terms of independent shots, let’s return to the question: how do we tell if Kobe’s shooting streaks are long enough to indicate that he has a hot hand? We can compare his streak lengths to someone without a hot hand: an independent shooter.
+
+
+
Simulations in R
+
While we don’t have any data from a shooter we know to have independent shots, that sort of data is very easy to simulate in R. In a simulation, you set the ground rules of a random process and then the computer uses random numbers to generate an outcome that adheres to those rules. As a simple example, you can simulate flipping a fair coin with the following.
The vector coin_outcomes can be thought of as a hat with two slips of paper in it: one slip says heads and the other says tails. The function sample draws one slip from the hat and tells us if it was a head or a tail.
+
Run the second command listed above several times. Just like when flipping a coin, sometimes you’ll get a heads, sometimes you’ll get a tails, but in the long run, you’d expect to get roughly equal numbers of each.
+
If you wanted to simulate flipping a fair coin 100 times, you could either run the function 100 times or, more simply, adjust the size argument, which governs how many samples to draw (the replace = TRUE argument indicates we put the slip of paper back in the hat before drawing again). Save the resulting vector of heads and tails in a new object called sim_fair_coin.
To view the results of this simulation, type the name of the object and then use table to count up the number of heads and tails.
+
sim_fair_coin
+table(sim_fair_coin)
+
Since there are only two elements in coin_outcomes, the probability that we “flip” a coin and it lands heads is 0.5. Say we’re trying to simulate an unfair coin that we know only lands heads 20% of the time. We can adjust for this by adding an argument called prob, which provides a vector of two probability weights.
prob=c(0.2, 0.8) indicates that for the two elements in the outcomes vector, we want to select the first one, heads, with probability 0.2 and the second one, tails with probability 0.8. Another way of thinking about this is to think of the outcome space as a bag of 10 chips, where 2 chips are labeled “head” and 8 chips “tail”. Therefore at each draw, the probability of drawing a chip that says “head”" is 20%, and “tail” is 80%.
+
+
In your simulation of flipping the unfair coin 100 times, how many flips came up heads? Include the code for sampling the unfair coin in your response. Since the markdown file will run the code, and generate a new sample each time you Knit it, you should also “set a seed” before you sample. Read more about setting a seed below.
+
+
+
A note on setting a seed: Setting a seed will cause R to select the same sample each time you knit your document. This will make sure your results don’t change each time you knit, and it will also ensure reproducibility of your work (by setting the same seed it will be possible to reproduce your results). You can set a seed like this:
+
set.seed(35797) # make sure to change the seed
+
The number above is completely arbitraty. If you need inspiration, you can use your ID, birthday, or just a random string of numbers. The important thing is that you use each seed only once. Remember to do this before you sample in the exercise above.
+
+
In a sense, we’ve shrunken the size of the slip of paper that says “heads”, making it less likely to be drawn and we’ve increased the size of the slip of paper saying “tails”, making it more likely to be drawn. When we simulated the fair coin, both slips of paper were the same size. This happens by default if you don’t provide a prob argument; all elements in the outcomes vector have an equal probability of being drawn.
+
If you want to learn more about sample or any other function, recall that you can always check out its help file.
+
?sample
+
+
+
Simulating the Independent Shooter
+
Simulating a basketball player who has independent shots uses the same mechanism that we use to simulate a coin flip. To simulate a single shot from an independent shooter with a shooting percentage of 50% we type,
To make a valid comparison between Kobe and our simulated independent shooter, we need to align both their shooting percentage and the number of attempted shots.
+
+
What change needs to be made to the sample function so that it reflects a shooting percentage of 45%? Make this adjustment, then run a simulation to sample 133 shots. Assign the output of this simulation to a new object called sim_basket.
+
+
Note that we’ve named the new vector sim_basket, the same name that we gave to the previous vector reflecting a shooting percentage of 50%. In this situation, R overwrites the old object with the new one, so always make sure that you don’t need the information in an old vector before reassigning its name.
+
With the results of the simulation saved as sim_basket, we have the data necessary to compare Kobe to our independent shooter.
+
Both data sets represent the results of 133 shot attempts, each with the same shooting percentage of 45%. We know that our simulated data is from a shooter that has independent shots. That is, we know the simulated shooter does not have a hot hand.
+
+
+
+
More Practice
+
+
Comparing Kobe Bryant to the Independent Shooter
+
+
Using calc_streak, compute the streak lengths of sim_basket, and save the results in a data frame called sim_streak.
+
Describe the distribution of streak lengths. What is the typical streak length for this simulated independent shooter with a 45% shooting percentage? How long is the player’s longest streak of baskets in 133 shots? Make sure to include a plot in your answer.
+
If you were to run the simulation of the independent shooter a second time, how would you expect its streak distribution to compare to the distribution from the question above? Exactly the same? Somewhat similar? Totally different? Explain your reasoning.
+
How does Kobe Bryant’s distribution of streak lengths compare to the distribution of streak lengths for the simulated shooter? Using this comparison, do you have evidence that the hot hand model fits Kobe’s shooting patterns? Explain.
+
+
+
+
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/sampling_distributions.Rmd b/docs/sampling_distributions.Rmd
new file mode 100644
index 0000000..5adcdb6
--- /dev/null
+++ b/docs/sampling_distributions.Rmd
@@ -0,0 +1,363 @@
+---
+title: "Foundations for statistical inference - Sampling distributions"
+runtime: shiny
+output:
+ html_document:
+ css: lab.css
+ highlight: pygments
+ theme: cerulean
+ toc: true
+ toc_float: true
+---
+
+```{r global_options, include=FALSE}
+knitr::opts_chunk$set(eval = FALSE)
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+data(ames)
+```
+
+In this lab, we investigate the ways in which the statistics from a random
+sample of data can serve as point estimates for population parameters. We're
+interested in formulating a *sampling distribution* of our estimate in order
+to learn about the properties of the estimate, such as its distribution.
+
+
+**Setting a seed:** We will take some random samples and build sampling distributions
+in this lab, which means you should set a seed on top of your lab. If this concept
+is new to you, review the lab concerning probability.
+
+
+## Getting Started
+
+### Load packages
+
+In this lab we will explore the data using the `dplyr` package and visualize it
+using the `ggplot2` package for data visualization. The data can be found in the
+companion package for OpenIntro labs, `oilabs`.
+
+Let's load the packages.
+
+```{r load-packages, message=FALSE}
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+```
+
+### Creating a reproducible lab report
+
+To create your new lab report, start by opening a new R Markdown document... From Template... then select Lab Report from the `oilabs` package.
+
+### The data
+
+We consider real estate data from the city of Ames, Iowa. The details of
+every real estate transaction in Ames is recorded by the City Assessor's
+office. Our particular focus for this lab will be all residential home sales
+in Ames between 2006 and 2010. This collection represents our population of
+interest. In this lab we would like to learn about these home sales by taking
+smaller samples from the full population. Let's load the data.
+
+```{r load-data}
+data(ames)
+```
+
+We see that there are quite a few variables in the data set, enough to do a
+very in-depth analysis. For this lab, we'll restrict our attention to just
+two of the variables: the above ground living area of the house in square feet
+(`area`) and the sale price (`price`).
+
+We can explore the distribution of areas of homes in the population of home
+sales visually and with summary statistics. Let's first create a visualization,
+a histogram:
+
+```{r area-hist}
+qplot(data = ames, x = area, binwidth = 250, geom = "histogram")
+```
+
+Let's also obtain some summary statistics. Note that we can do this using the
+`summarise` function. We can calculate as many statistics as we want using this
+function, and just string along the results. Some of the functions below should
+be self explanatory (like `mean`, `median`, `sd`, `IQR`, `min`, and `max`). A
+new function here is the `quantile` function which we can use to calculate
+values corresponding to specific percentile cutoffs in the distribution. For
+example `quantile(x, 0.25)` will yield the cutoff value for the 25th percentile (Q1)
+in the distribution of x. Finding these values are useful for describing the
+distribution, as we can use them for descriptions like *"the middle 50% of the
+homes have areas between such and such square feet"*.
+
+```{r area-stats}
+ames %>%
+ summarise(mu = mean(area), pop_med = median(area),
+ sigma = sd(area), pop_iqr = IQR(area),
+ pop_min = min(area), pop_max = max(area),
+ pop_q1 = quantile(area, 0.25), # first quartile, 25th percentile
+ pop_q3 = quantile(area, 0.75)) # third quartile, 75th percentile
+```
+
+1. Describe this population distribution using a visualization and these summary
+ statistics. You don't have to use all of the summary statistics in your
+ description, you will need to decide which ones are relevant based on the
+ shape of the distribution. Make sure to include the plot and the summary
+ statistics output in your report along with your narrative.
+
+## The unknown sampling distribution
+
+In this lab we have access to the entire population, but this is rarely the
+case in real life. Gathering information on an entire population is often
+extremely costly or impossible. Because of this, we often take a sample of
+the population and use that to understand the properties of the population.
+
+If we were interested in estimating the mean living area in Ames based on a
+sample, we can use the `sample_n` command to survey the population.
+
+```{r samp1}
+samp1 <- ames %>%
+ sample_n(50)
+```
+
+This command collects a simple random sample of size 50 from the `ames` dataset,
+and assigns the result to `samp1`. This is like going into the City
+Assessor's database and pulling up the files on 50 random home sales. Working
+with these 50 files would be considerably simpler than working with all 2930
+home sales.
+
+1. Describe the distribution of area in this sample. How does it compare to the
+ distribution of the population? **Hint:** the `sample_n` function takes a random
+ sample of observations (i.e. rows) from the dataset, you can still refer to
+ the variables in the dataset with the same names. Code you used in the
+ previous exercise will also be helpful for visualizing and summarizing the sample,
+ however be careful to not label values `mu` and `sigma` anymore since these
+ are sample statistics, not population parameters. You can customize the labels
+ of any of the statistics to indicate that these come from the sample.
+
+If we're interested in estimating the average living area in homes in Ames
+using the sample, our best single guess is the sample mean.
+
+```{r mean-samp1}
+samp1 %>%
+ summarise(x_bar = mean(area))
+```
+
+Depending on which 50 homes you selected, your estimate could be a bit above
+or a bit below the true population mean of `r round(mean(ames$area),2)` square feet. In general,
+though, the sample mean turns out to be a pretty good estimate of the average
+living area, and we were able to get it by sampling less than 3\% of the
+population.
+
+1. Would you expect the mean of your sample to match the mean of another team's
+ sample? Why, or why not? If the answer is no, would you expect the means to
+ just be somewhat different or very different? Ask a neighboring team to confirm
+ your answer.
+
+1. Take a second sample, also of size 50, and call it `samp2`. How does the
+ mean of `samp2` compare with the mean of `samp1`? Suppose we took two
+ more samples, one of size 100 and one of size 1000. Which would you think
+ would provide a more accurate estimate of the population mean?
+
+Not surprisingly, every time we take another random sample, we get a different
+sample mean. It's useful to get a sense of just how much variability we
+should expect when estimating the population mean this way. The distribution
+of sample means, called the *sampling distribution (of the mean)*, can help us understand
+this variability. In this lab, because we have access to the population, we
+can build up the sampling distribution for the sample mean by repeating the
+above steps many times. Here we will generate 15,000 samples and compute the
+sample mean of each. Note that we specify that
+`replace = TRUE` since sampling distributions are constructed by sampling
+with replacement.
+
+```{r loop}
+sample_means50 <- ames %>%
+ rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
+ summarise(x_bar = mean(area))
+
+qplot(data = sample_means50, x = x_bar)
+```
+
+Here we use R to take 15,000 different samples of size 50 from the population, calculate
+the mean of each sample, and store each result in a vector called
+`sample_means50`. Next, we review how this set of code works.
+
+1. How many elements are there in `sample_means50`? Describe the sampling
+ distribution, and be sure to specifically note its center. Make sure to include
+ a plot of the distribution in your answer.
+
+## Interlude: Sampling distributions
+
+The idea behind the `rep_sample_n` function is *repetition*. Earlier we took
+a single sample of size `n` (50) from the population of all houses in Ames. With
+this new function we are able to repeat this sampling procedure `rep` times in order
+to build a distribution of a series of sample statistics, which is called the
+**sampling distribution**.
+
+Note that in practice one rarely gets to build true sampling distributions,
+because we rarely have access to data from the entire population.
+
+Without the `rep_sample_n` function, this would be painful. We would have to
+manually run the following code 15,000 times
+```{r sample-code, eval=FALSE}
+ames %>%
+ sample_n(size = 50) %>%
+ summarise(x_bar = mean(area))
+```
+as well as store the resulting sample means each time in a separate vector.
+
+Note that for each of the 15,000 times we computed a mean, we did so from a
+**different** sample!
+
+1. To make sure you understand how sampling distributions are built, and exactly
+ what the `rep_sample_n` function does, try modifying the code to create a
+ sampling distribution of **25 sample means** from **samples of size 10**,
+ and put them in a data frame named `sample_means_small`. Print the output.
+ How many observations are there in this object called `sample_means_small`?
+ What does each observation represent?
+
+## Sample size and the sampling distribution
+
+Mechanics aside, let's return to the reason we used the `rep_sample_n` function: to
+compute a sampling distribution, specifically, the sampling distribution of the
+mean home area for samples of 50 houses.
+
+```{r hist}
+qplot(data = sample_means50, x = x_bar, geom = "histogram")
+```
+
+The sampling distribution that we computed tells us much about estimating
+the average living area in homes in Ames. Because the sample mean is an
+unbiased estimator, the sampling distribution is centered at the true average
+living area of the population, and the spread of the distribution
+indicates how much variability is incurred by sampling only 50 home sales.
+
+In the remainder of this section we will work on getting a sense of the effect that
+sample size has on our sampling distribution.
+
+1. Use the app below to create sampling distributions of means of `area`s from
+ samples of size 10, 50, and 100. Use 5,000 simulations. What does each
+ observation in the sampling distribution represent? How does the mean, standard
+ error, and shape of the sampling distribution change as the sample size
+ increases? How (if at all) do these values change if you increase the number
+ of simulations? (You do not need to include plots in your answer.)
+
+```{r shiny, echo=FALSE, eval=TRUE}
+shinyApp(
+ ui <- fluidPage(
+
+ # Sidebar with a slider input for number of bins
+ sidebarLayout(
+ sidebarPanel(
+
+ selectInput("selected_var",
+ "Variable:",
+ choices = list("area", "price"),
+ selected = "area"),
+
+ numericInput("n_samp",
+ "Sample size:",
+ min = 1,
+ max = nrow(ames),
+ value = 30),
+
+ numericInput("n_sim",
+ "Number of samples:",
+ min = 1,
+ max = 30000,
+ value = 15000)
+
+ ),
+
+ # Show a plot of the generated distribution
+ mainPanel(
+ plotOutput("sampling_plot"),
+ verbatimTextOutput("sampling_mean"),
+ verbatimTextOutput("sampling_se")
+ )
+ )
+ ),
+
+ # Define server logic required to draw a histogram
+ server <- function(input, output) {
+
+ # create sampling distribution
+ sampling_dist <- reactive({
+ ames[[input$selected_var]] %>%
+ sample(size = input$n_samp * input$n_sim, replace = TRUE) %>%
+ matrix(ncol = input$n_samp) %>%
+ rowMeans() %>%
+ data.frame(x_bar = .)
+ #ames %>%
+ # rep_sample_n(size = input$n_samp, reps = input$n_sim, replace = TRUE) %>%
+ # summarise_(x_bar = mean(input$selected_var))
+ })
+
+ # plot sampling distribution
+ output$sampling_plot <- renderPlot({
+ x_min <- quantile(ames[[input$selected_var]], 0.1)
+ x_max <- quantile(ames[[input$selected_var]], 0.9)
+
+ ggplot(sampling_dist(), aes(x = x_bar)) +
+ geom_histogram() +
+ xlim(x_min, x_max) +
+ ylim(0, input$n_sim * 0.35) +
+ ggtitle(paste0("Sampling distribution of mean ",
+ input$selected_var, " (n = ", input$n_samp, ")")) +
+ xlab(paste("mean", input$selected_var)) +
+ theme(plot.title = element_text(face = "bold", size = 16))
+ })
+
+ # mean of sampling distribution
+ output$sampling_mean <- renderText({
+ paste0("mean of sampling distribution = ", round(mean(sampling_dist()$x_bar), 2))
+ })
+
+ # mean of sampling distribution
+ output$sampling_se <- renderText({
+ paste0("SE of sampling distribution = ", round(sd(sampling_dist()$x_bar), 2))
+ })
+ },
+
+ options = list(height = 500)
+)
+```
+
+
+* * *
+
+## More Practice
+
+So far, we have only focused on estimating the mean living area in homes in
+Ames. Now you'll try to estimate the mean home price.
+
+Note that while you might be able to answer some of these questions using the app
+you are expected to write the required code and produce the necessary plots and
+summary statistics. You are welcome to use the app for exploration.
+
+1. Take a sample of size 15 from the population and calculate the mean `price`
+ of the homes in this sample. Using this sample, what is your best point estimate
+ of the population mean of prices of homes?
+
+1. Since you have access to the population, simulate the sampling
+ distribution of $\overline{price}$ for samples of size 15 by taking 2000
+ samples from the population of size 15 and computing 2000 sample means.
+ Store these means
+ in a vector called `sample_means15`. Plot the data, then describe the
+ shape of this sampling distribution. Based on this sampling distribution,
+ what would you guess the mean home price of the population to be? Finally,
+ calculate and report the population mean.
+
+1. Change your sample size from 15 to 150, then compute the sampling
+ distribution using the same method as above, and store these means in a
+ new vector called `sample_means150`. Describe the shape of this sampling
+ distribution, and compare it to the sampling distribution for a sample
+ size of 15. Based on this sampling distribution, what would you guess to
+ be the mean sale price of homes in Ames?
+
+1. Of the sampling distributions from 2 and 3, which has a smaller spread? If
+ we're concerned with making estimates that are more often close to the
+ true value, would we prefer a sampling distribution with a large or small spread?
+
+
+
+This is a product of OpenIntro that is released under a [Creative Commons
+Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0).
+This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.
+
\ No newline at end of file
diff --git a/docs/simple_regression.html b/docs/simple_regression.html
new file mode 100644
index 0000000..95b2717
--- /dev/null
+++ b/docs/simple_regression.html
@@ -0,0 +1,307 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+Introduction to linear regression
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Introduction to linear regression
+
+
+
+
+
+
Batter up
+
The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.
+
In this lab we’ll be looking at data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.
+
+
+
Getting Started
+
+
Load packages
+
In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data can be found in the companion package for OpenIntro labs, oilabs.
+
Let’s load the packages.
+
library(dplyr)
+library(ggplot2)
+library(oilabs)
+
+
+
Creating a reproducible lab report
+
To create your new lab report, start by opening a new R Markdown document… From Template… then select Lab Report from the oilabs package.
+
+
+
The data
+
Let’s load up the data for the 2011 season.
+
data(mlb11)
+
In addition to runs scored, there are seven traditionally-used variables in the data set: at-bats, hits, home runs, batting average, strikeouts, stolen bases, and wins. There are also three newer variables: on-base percentage, slugging percentage, and on-base plus slugging. For the first portion of the analysis we’ll consider the seven traditional variables. At the end of the lab, you’ll work with the three newer variables on your own.
+
+
What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?
+
+
If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.
+
mlb11 %>%
+summarise(cor(runs, at_bats))
+
+
+
+
Sum of squared residuals
+
+
In this section you will use an interactive function to investigate what we mean by “sum of squared residuals”. You will need to run this function in your console, not in your markdown document. Running the function also requires that the mlb11 dataset is loaded in your environment.
+
+
Think back to the way that we described the distribution of a single variable. Recall that we discussed characteristics such as center, spread, and shape. It’s also useful to be able to describe the relationship of two numerical variables, such as runs and at_bats above.
+
+
Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.
+
+
Just as we used the mean and standard deviation to summarize a single variable, we can summarize the relationship between these two variables by finding the line that best follows their association. Use the following interactive function to select the line that you think does the best job of going through the cloud of points.
+
plot_ss(x = at_bats, y = runs, data = mlb11)
+
After running this command, you’ll be prompted to click two points on the plot to define a line. Once you’ve done that, the line you specified will be shown in black and the residuals in blue. Note that there are 30 residuals, one for each of the 30 observations. Recall that the residuals are the difference between the observed values and the values predicted by the line:
+
\[
+ e_i = y_i - \hat{y}_i
+\]
+
The most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, you can rerun the plot command and add the argument showSquares = TRUE.
+
plot_ss(x = at_bats, y = runs, data = mlb11, showSquares =TRUE)
+
Note that the output from the plot_ss function provides you with the slope and intercept of your line as well as the sum of squares.
+
+
Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?
+
+
+
+
The linear model
+
It is rather cumbersome to try to get the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error. Instead we can use the lm function in R to fit the linear model (a.k.a. regression line).
+
m1 <-lm(runs ~at_bats, data = mlb11)
+
The first argument in the function lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of at_bats. The second argument specifies that R should look in the mlb11 data frame to find the runs and at_bats variables.
+
The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this information using the summary function.
+
summary(m1)
+
Let’s consider this output piece by piece. First, the formula used to describe the model is shown at the top. After the formula you find the five-number summary of the residuals. The “Coefficients” table shown next is key; its first column displays the linear model’s y-intercept and the coefficient of at_bats. With this table, we can write down the least squares regression line for the linear model:
One last piece of information we will discuss from the summary output is the Multiple R-squared, or more simply, \(R^2\). The \(R^2\) value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 37.3% of the variability in runs is explained by at-bats.
+
+
Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?
+
+
+
+
Prediction and prediction errors
+
Let’s create a scatterplot with the least squares line for m1 laid on top.
+
qplot(x = at_bats, y = runs, data = mlb11, geom ="point") +
+geom_smooth(method ="lm", se =FALSE)
+
Here we are literally adding a layer on top of our plot. geom_smooth creates the line by fitting a linear model. It can also show us the standard error se associated with our line, but we’ll suppress that for now.
+
This line can be used to predict \(y\) at any value of \(x\). When predictions are made for values of \(x\) that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.
+
+
If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,579 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?
+
+
+
+
Model diagnostics
+
To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
+
Linearity: You already checked if the relationship between runs and at-bats is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. fitted (predicted) values.
+
qplot(x = .fitted, y = .resid, data = m1) +
+geom_hline(yintercept =0, linetype ="dashed") +
+xlab("Fitted values") +
+ylab("Residuals")
+
Notice here that our model object m1 can also serve as a data set because stored within it are the fitted values (\(\hat{y}\)) and the residuals. Also note that we’re getting fancy with the code here. After creating the scatterplot on the first layer (first line of code), we overlay a horizontal dashed line at \(y = 0\) (to help us check whether residuals are distributed around 0), and we also adjust the axis labels to be more informative.
+
+
Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?
+
+
+
Nearly normal residuals: To check this condition, we can look at a histogram
Note that the syntax for making a normal probability plot is a bit different than what you’re used to seeing: we set sample equal to the residuals instead of x, and we set a statistical method qq, which stands for “quantile-quantile”, another name commonly used for normal probability plots.
+
+
Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?
+
+
+
Constant variability:
+
+
Based on the residuals vs. fitted plot, does the constant variability condition appear to be met?
+
+
+
+
+
On your own
+
+
+
Choose another one of the seven traditional variables from mlb11 besides at_bats that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?
+
How does this relationship compare to the relationship between runs and at_bats? Use the \(R^2\) values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?
+
Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).
+
Now examine the three newer variables. These are the statistics used by the central character in Moneyball to predict a team’s success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?
+
Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.
+
+
+
+
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/index.Rmd b/index.Rmd
new file mode 100644
index 0000000..bc81287
--- /dev/null
+++ b/index.Rmd
@@ -0,0 +1,17 @@
+---
+title: "Index"
+output: html_document
+---
+
+
+- [Intro to R](intro_to_r.html)
+- [Intro to Data](intro_to_data.html)
+- [Sampling Distributions](sampling_distributions.Rmd)
+- [Confidence Intervals](confidence_intervals.html)
+- [Simple Linear Regression](simple_regression.html)
+- [Probability](probability.html)
+- [Inference for Categorical Data](inf_for_categorical_data.Rmd)
+- [Inference for Numerical Data](inf_for_numerical_data.html)
+- [Normal Distribution](normal_distribution.html)
+- [Multiple Regression](multiple_regression.html)
+
diff --git a/inf_for_categorical_data/inf_for_categorical_data.Rmd b/inf_for_categorical_data/inf_for_categorical_data.Rmd
deleted file mode 100644
index 9aedfe8..0000000
--- a/inf_for_categorical_data/inf_for_categorical_data.Rmd
+++ /dev/null
@@ -1,254 +0,0 @@
----
-title: 'Inference for categorical data'
-output:
- html_document:
- css: ../lab.css
- highlight: pygments
- theme: cerulean
- pdf_document: default
----
-
-In August of 2012, news outlets ranging from the [Washington
-Post](http://www.washingtonpost.com/national/on-faith/poll-shows-atheism-on-the-rise-in-the-us/2012/08/13/90020fd6-e57d-11e1-9739-eef99c5fb285_story.html) to the [Huffington
-Post](http://www.huffingtonpost.com/2012/08/14/atheism-rise-religiosity-decline-in-america_n_1777031.html)
-ran a story about the rise of atheism in America. The source for the story was
-a poll that asked people, "Irrespective of whether you attend a place of
-worship or not, would you say you are a religious person, not a religious
-person or a convinced atheist?" This type of question, which asks people to
-classify themselves in one way or another, is common in polling and generates
-categorical data. In this lab we take a look at the atheism survey and explore
-what's at play when making inference about population proportions using
-categorical data.
-
-## The survey
-
-To access the press release for the poll, conducted by WIN-Gallup
-International, click on the following link:
-
-**
-
-Take a moment to review the report then address the following questions.
-
-1. In the first paragraph, several key findings are reported. Do these
- percentages appear to be *sample statistics* (derived from the data
- sample) or *population parameters*?
-
-2. The title of the report is "Global Index of Religiosity and Atheism". To
- generalize the report's findings to the global human population, what must
- we assume about the sampling method? Does that seem like a reasonable
- assumption?
-
-## The data
-
-Turn your attention to Table 6 (pages 15 and 16), which reports the
-sample size and response percentages for all 57 countries. While this is
-a useful format to summarize the data, we will base our analysis on the
-original data set of individual responses to the survey. Load this data
-set into R with the following command.
-
-```{r head-data, eval=FALSE}
-download.file("http://www.openintro.org/stat/data/atheism.RData", destfile = "atheism.RData")
-load("atheism.RData")
-```
-
-3. What does each row of Table 6 correspond to? What does each row of
- `atheism` correspond to?
-
-To investigate the link between these two ways of organizing this data, take a
-look at the estimated proportion of atheists in the United States. Towards
-the bottom of Table 6, we see that this is 5%. We should be able to come to
-the same number using the `atheism` data.
-
-4. Using the command below, create a new dataframe called `us12` that contains
- only the rows in `atheism` associated with respondents to the 2012 survey
- from the United States. Next, calculate the proportion of atheist
- responses. Does it agree with the percentage in Table 6? If not, why?
-
-```{r us-atheism, eval=FALSE}
-us12 <- subset(atheism, nationality == "United States" & year == "2012")
-```
-
-## Inference on proportions
-
-As was hinted at in Exercise 1, Table 6 provides *statistics*, that is,
-calculations made from the sample of 51,927 people. What we'd like, though, is
-insight into the population *parameters*. You answer the question, "What
-proportion of people in your sample reported being atheists?" with a
-statistic; while the question "What proportion of people on earth would report
-being atheists" is answered with an estimate of the parameter.
-
-The inferential tools for estimating population proportion are analogous to
-those used for means in the last chapter: the confidence interval and the
-hypothesis test.
-
-5. Write out the conditions for inference to construct a 95% confidence
- interval for the proportion of atheists in the United States in 2012.
- Are you confident all conditions are met?
-
-If the conditions for inference are reasonable, we can either calculate
-the standard error and construct the interval by hand, or allow the `inference`
-function to do it for us.
-
-```{r us-atheism-ci, eval=FALSE, tidy = FALSE}
-inference(us12$response, est = "proportion", type = "ci", method = "theoretical",
- success = "atheist")
-```
-
-Note that since the goal is to construct an interval estimate for a
-proportion, it's necessary to specify what constitutes a "success", which here
-is a response of `"atheist"`.
-
-Although formal confidence intervals and hypothesis tests don't show up in the
-report, suggestions of inference appear at the bottom of page 7: "In general,
-the error margin for surveys of this kind is $\pm$ 3-5% at 95% confidence".
-
-6. Based on the R output, what is the margin of error for the estimate of the
- proportion of the proportion of atheists in US in 2012?
-
-7. Using the `inference` function, calculate confidence intervals for the
- proportion of atheists in 2012 in two other countries of your choice, and
- report the associated margins of error. Be sure to note whether the
- conditions for inference are met. It may be helpful to create new data
- sets for each of the two countries first, and then use these data sets in
- the `inference` function to construct the confidence intervals.
-
-## How does the proportion affect the margin of error?
-
-Imagine you've set out to survey 1000 people on two questions: are you female?
-and are you left-handed? Since both of these sample proportions were
-calculated from the same sample size, they should have the same margin of
-error, right? Wrong! While the margin of error does change with sample size,
-it is also affected by the proportion.
-
-Think back to the formula for the standard error: $SE = \sqrt{p(1-p)/n}$. This
-is then used in the formula for the margin of error for a 95% confidence
-interval: $ME = 1.96\times SE = 1.96\times\sqrt{p(1-p)/n}$. Since the
-population proportion $p$ is in this $ME$ formula, it should make sense that
-the margin of error is in some way dependent on the population proportion. We
-can visualize this relationship by creating a plot of $ME$ vs. $p$.
-
-The first step is to make a vector `p` that is a sequence from 0 to 1 with
-each number separated by 0.01. We can then create a vector of the margin of
-error (`me`) associated with each of these values of `p` using the familiar
-approximate formula ($ME = 2 \times SE$). Lastly, we plot the two vectors
-against each other to reveal their relationship.
-
-```{r me-plot, eval=FALSE}
-n <- 1000
-p <- seq(0, 1, 0.01)
-me <- 2 * sqrt(p * (1 - p)/n)
-plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion")
-```
-
-8. Describe the relationship between `p` and `me`.
-
-## Success-failure condition
-
-The textbook emphasizes that you must always check conditions before making
-inference. For inference on proportions, the sample proportion can be assumed
-to be nearly normal if it is based upon a random sample of independent
-observations and if both $np \geq 10$ and $n(1 - p) \geq 10$. This rule of
-thumb is easy enough to follow, but it makes one wonder: what's so special
-about the number 10?
-
-The short answer is: nothing. You could argue that we would be fine with 9 or
-that we really should be using 11. What is the "best" value for such a rule of
-thumb is, at least to some degree, arbitrary. However, when $np$ and $n(1-p)$
-reaches 10 the sampling distribution is sufficiently normal to use confidence
-intervals and hypothesis tests that are based on that approximation.
-
-We can investigate the interplay between $n$ and $p$ and the shape of the
-sampling distribution by using simulations. To start off, we simulate the
-process of drawing 5000 samples of size 1040 from a population with a true
-atheist proportion of 0.1. For each of the 5000 samples we compute $\hat{p}$
-and then plot a histogram to visualize their distribution.
-
-```{r sim-np, eval=FALSE}
-p <- 0.1
-n <- 1040
-p_hats <- rep(0, 5000)
-
-for(i in 1:5000){
- samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
- p_hats[i] <- sum(samp == "atheist")/n
-}
-
-hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))
-```
-
-These commands build up the sampling distribution of $\hat{p}$ using the
-familiar `for` loop. You can read the sampling procedure for the first line of
-code inside the `for` loop as, "take a sample of size $n$ with replacement
-from the choices of atheist and non-atheist with probabilities $p$ and $1 - p$,
-respectively." The second line in the loop says, "calculate the proportion of
-atheists in this sample and record this value." The loop allows us to repeat
-this process 5,000 times to build a good representation of the sampling
-distribution.
-
-9. Describe the sampling distribution of sample proportions at $n = 1040$ and
- $p = 0.1$. Be sure to note the center, spread, and shape.\
- *Hint:* Remember that R has functions such as `mean` to calculate summary
- statistics.
-
-10. Repeat the above simulation three more times but with modified sample
- sizes and proportions: for $n = 400$ and $p = 0.1$, $n = 1040$ and
- $p = 0.02$, and $n = 400$ and $p = 0.02$. Plot all four histograms
- together by running the `par(mfrow = c(2, 2))` command before creating the
- histograms. You may need to expand the plot window to accommodate the
- larger two-by-two plot. Describe the three new sampling distributions.
- Based on these limited plots, how does $n$ appear to affect the
- distribution of $\hat{p}$? How does $p$ affect the sampling distribution?
-
-Once you're done, you can reset the layout of the plotting window by using the
-command `par(mfrow = c(1, 1))` command or clicking on "Clear All" above the
-plotting window (if using RStudio). Note that the latter will get rid of all
-your previous plots.
-
-11. If you refer to Table 6, you'll find that Australia has a sample
- proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample
- proportion of 0.02 on 400 subjects. Let's suppose for this exercise that
- these point estimates are actually the truth. Then given the shape of
- their respective sampling distributions, do you think it is sensible to
- proceed with inference and report margin of errors, as the reports does?
-
-* * *
-## On your own
-
-The question of atheism was asked by WIN-Gallup International in a similar
-survey that was conducted in 2005. (We assume here that sample sizes have
-remained the same.) Table 4 on page 13 of the report summarizes survey results
-from 2005 and 2012 for 39 countries.
-
-- Answer the following two questions using the `inference` function. As
- always, write out the hypotheses for any tests you conduct and outline the
- status of the conditions for inference.
-
- **a.** Is there convincing evidence that Spain has seen a change in its
- atheism index between 2005 and 2012?\
- *Hint:* Create a new data set for respondents from Spain. Form
- confidence intervals for the true proportion of athiests in both
- years, and determine whether they overlap.
-
- **b.** Is there convincing evidence that the United States has seen a
- change in its atheism index between 2005 and 2012?
-
-- If in fact there has been no change in the atheism index in the countries
- listed in Table 4, in how many of those countries would you expect to
- detect a change (at a significance level of 0.05) simply by chance?\
- *Hint:* Look in the textbook index under Type 1 error.
-
-- Suppose you're hired by the local government to estimate the proportion of
- residents that attend a religious service on a weekly basis. According to
- the guidelines, the estimate must have a margin of error no greater than
- 1% with 95% confidence. You have no idea what to expect for $p$. How many
- people would you have to sample to ensure that you are within the
- guidelines?\
- *Hint:* Refer to your plot of the relationship between $p$ and margin of
- error. Do not use the data set to answer this question.
-
-
-
-This is a product of OpenIntro that is released under a [Creative Commons
-Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0).
-This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.
-
\ No newline at end of file
diff --git a/inf_for_numerical_data/inf_for_numerical_data.Rmd b/inf_for_numerical_data/inf_for_numerical_data.Rmd
deleted file mode 100644
index 7eb6c82..0000000
--- a/inf_for_numerical_data/inf_for_numerical_data.Rmd
+++ /dev/null
@@ -1,156 +0,0 @@
----
-title: 'Inference for numerical data'
-output:
- html_document:
- css: ../lab.css
- highlight: pygments
- theme: cerulean
- pdf_document: default
----
-
-## North Carolina births
-
-In 2004, the state of North Carolina released a large data set containing
-information on births recorded in this state. This data set is useful to
-researchers studying the relation between habits and practices of expectant
-mothers and the birth of their children. We will work with a random sample of
-observations from this data set.
-
-## Exploratory analysis
-
-Load the `nc` data set into our workspace.
-
-```{r load-data, eval=FALSE}
-download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
-load("nc.RData")
-```
-
-We have observations on 13 different variables, some categorical and some
-numerical. The meaning of each variable is as follows.
-
-variable | description
----------------- | -----------
-`fage` | father's age in years.
-`mage` | mother's age in years.
-`mature` | maturity status of mother.
-`weeks` | length of pregnancy in weeks.
-`premie` | whether the birth was classified as premature (premie) or full-term.
-`visits` | number of hospital visits during pregnancy.
-`marital` | whether mother is `married` or `not married` at birth.
-`gained` | weight gained by mother during pregnancy in pounds.
-`weight` | weight of the baby at birth in pounds.
-`lowbirthweight` | whether baby was classified as low birthweight (`low`) or not (`not low`).
-`gender` | gender of the baby, `female` or `male`.
-`habit` | status of the mother as a `nonsmoker` or a `smoker`.
-`whitemom` | whether mom is `white` or `not white`.
-
-1. What are the cases in this data set? How many cases are there in our sample?
-
-As a first step in the analysis, we should consider summaries of the data. This
-can be done using the `summary` command:
-
-```{r summary, eval=FALSE}
-summary(nc)
-```
-
-As you review the variable summaries, consider which variables are categorical
-and which are numerical. For numerical variables, are there outliers? If you
-aren't sure or want to take a closer look at the data, make a graph.
-
-Consider the possible relationship between a mother's smoking habit and the
-weight of her baby. Plotting the data is a useful first step because it helps
-us quickly visualize trends, identify strong associations, and develop research
-questions.
-
-2. Make a side-by-side boxplot of `habit` and `weight`. What does the plot
-highlight about the relationship between these two variables?
-
-The box plots show how the medians of the two distributions compare, but we can
-also compare the means of the distributions using the following function to
-split the `weight` variable into the `habit` groups, then take the mean of each
-using the `mean` function.
-
-```{r by-means, eval=FALSE}
-by(nc$weight, nc$habit, mean)
-```
-
-There is an observed difference, but is this difference statistically
-significant? In order to answer this question we will conduct a hypothesis test
-.
-
-## Inference
-
-3. Check if the conditions necessary for inference are satisfied. Note that
-you will need to obtain sample sizes to check the conditions. You can compute
-the group size using the same `by` command above but replacing `mean` with
-`length`.
-
-4. Write the hypotheses for testing if the average weights of babies born to
-smoking and non-smoking mothers are different.
-
-Next, we introduce a new function, `inference`, that we will use for conducting
-hypothesis tests and constructing confidence intervals.
-
-```{r inf-weight-habit-ht, eval=FALSE, tidy=FALSE}
-inference(y = nc$weight, x = nc$habit, est = "mean", type = "ht", null = 0,
- alternative = "twosided", method = "theoretical")
-```
-
-Let's pause for a moment to go through the arguments of this custom function.
-The first argument is `y`, which is the response variable that we are
-interested in: `nc$weight`. The second argument is the explanatory variable,
-`x`, which is the variable that splits the data into two groups, smokers and
-non-smokers: `nc$habit`. The third argument, `est`, is the parameter we're
-interested in: `"mean"` (other options are `"median"`, or `"proportion"`.) Next
-we decide on the `type` of inference we want: a hypothesis test (`"ht"`) or a
-confidence interval (`"ci"`). When performing a hypothesis test, we also need
-to supply the `null` value, which in this case is `0`, since the null
-hypothesis sets the two population means equal to each other. The `alternative`
-hypothesis can be `"less"`, `"greater"`, or `"twosided"`. Lastly, the `method`
-of inference can be `"theoretical"` or `"simulation"` based.
-
-5. Change the `type` argument to `"ci"` to construct and record a confidence
-interval for the difference between the weights of babies born to smoking and
-non-smoking mothers.
-
-By default the function reports an interval for ($\mu_{nonsmoker} - \mu_{smoker}$)
-. We can easily change this order by using the `order` argument:
-
-```{r inf-weight-habit-ci, eval=FALSE, tidy=FALSE}
-inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0,
- alternative = "twosided", method = "theoretical",
- order = c("smoker","nonsmoker"))
-```
-
-* * *
-
-## On your own
-
-- Calculate a 95% confidence interval for the average length of pregnancies
-(`weeks`) and interpret it in context. Note that since you're doing inference
-on a single population parameter, there is no explanatory variable, so you can
-omit the `x` variable from the function.
-
-- Calculate a new confidence interval for the same parameter at the 90%
-confidence level. You can change the confidence level by adding a new argument
-to the function: `conflevel = 0.90`.
-
-- Conduct a hypothesis test evaluating whether the average weight gained by
-younger mothers is different than the average weight gained by mature mothers.
-
-- Now, a non-inference task: Determine the age cutoff for younger and mature
-mothers. Use a method of your choice, and explain how your method works.
-
-- Pick a pair of numerical and categorical variables and come up with a
-research question evaluating the relationship between these variables.
-Formulate the question in a way that it can be answered using a hypothesis test
-and/or a confidence interval. Answer your question using the `inference`
-function, report the statistical results, and also provide an explanation in
-plain language.
-
-
-This is a product of OpenIntro that is released under a [Creative Commons
-Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0).
-This lab was adapted for OpenIntro by Mine Çetinkaya-Rundel from a lab
-written by the faculty and TAs of UCLA Statistics.
-
\ No newline at end of file
diff --git a/inf_for_numerical_data/inf_for_numerical_data.html b/inf_for_numerical_data/inf_for_numerical_data.html
deleted file mode 100644
index e30a1ca..0000000
--- a/inf_for_numerical_data/inf_for_numerical_data.html
+++ /dev/null
@@ -1,221 +0,0 @@
-
-
-
-
-
-
-
-
-
-
-
-
-Inference for numerical data
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Inference for numerical data
-
-
-
-
-
North Carolina births
-
In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.
We have observations on 13 different variables, some categorical and some numerical. The meaning of each variable is as follows.
-
-
-
-
variable
-
description
-
-
-
-
-
fage
-
father’s age in years.
-
-
-
mage
-
mother’s age in years.
-
-
-
mature
-
maturity status of mother.
-
-
-
weeks
-
length of pregnancy in weeks.
-
-
-
premie
-
whether the birth was classified as premature (premie) or full-term.
-
-
-
visits
-
number of hospital visits during pregnancy.
-
-
-
marital
-
whether mother is married or not married at birth.
-
-
-
gained
-
weight gained by mother during pregnancy in pounds.
-
-
-
weight
-
weight of the baby at birth in pounds.
-
-
-
lowbirthweight
-
whether baby was classified as low birthweight (low) or not (not low).
-
-
-
gender
-
gender of the baby, female or male.
-
-
-
habit
-
status of the mother as a nonsmoker or a smoker.
-
-
-
whitemom
-
whether mom is white or not white.
-
-
-
-
-
What are the cases in this data set? How many cases are there in our sample?
-
-
As a first step in the analysis, we should consider summaries of the data. This can be done using the summary command:
-
summary(nc)
-
As you review the variable summaries, consider which variables are categorical and which are numerical. For numerical variables, are there outliers? If you aren’t sure or want to take a closer look at the data, make a graph.
-
Consider the possible relationship between a mother’s smoking habit and the weight of her baby. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.
-
-
Make a side-by-side boxplot of habit and weight. What does the plot highlight about the relationship between these two variables?
-
-
The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following function to split the weight variable into the habit groups, then take the mean of each using the mean function.
-
by(nc$weight, nc$habit, mean)
-
There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test .
-
-
-
Inference
-
-
Check if the conditions necessary for inference are satisfied. Note that you will need to obtain sample sizes to check the conditions. You can compute the group size using the same by command above but replacing mean with length.
-
Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.
-
-
Next, we introduce a new function, inference, that we will use for conducting hypothesis tests and constructing confidence intervals.
-
inference(y = nc$weight, x = nc$habit, est ="mean", type ="ht", null =0,
- alternative ="twosided", method ="theoretical")
-
Let’s pause for a moment to go through the arguments of this custom function. The first argument is y, which is the response variable that we are interested in: nc$weight. The second argument is the explanatory variable, x, which is the variable that splits the data into two groups, smokers and non-smokers: nc$habit. The third argument, est, is the parameter we’re interested in: "mean" (other options are "median", or "proportion".) Next we decide on the type of inference we want: a hypothesis test ("ht") or a confidence interval ("ci"). When performing a hypothesis test, we also need to supply the null value, which in this case is 0, since the null hypothesis sets the two population means equal to each other. The alternative hypothesis can be "less", "greater", or "twosided". Lastly, the method of inference can be "theoretical" or "simulation" based.
-
-
Change the type argument to "ci" to construct and record a confidence interval for the difference between the weights of babies born to smoking and non-smoking mothers.
-
-
By default the function reports an interval for (\(\mu_{nonsmoker} - \mu_{smoker}\)) . We can easily change this order by using the order argument:
-
inference(y = nc$weight, x = nc$habit, est ="mean", type ="ci", null =0,
- alternative ="twosided", method ="theoretical",
- order =c("smoker","nonsmoker"))
-
-
-
-
On your own
-
-
Calculate a 95% confidence interval for the average length of pregnancies (weeks) and interpret it in context. Note that since you’re doing inference on a single population parameter, there is no explanatory variable, so you can omit the x variable from the function.
-
Calculate a new confidence interval for the same parameter at the 90% confidence level. You can change the confidence level by adding a new argument to the function: conflevel = 0.90.
-
Conduct a hypothesis test evaluating whether the average weight gained by younger mothers is different than the average weight gained by mature mothers.
-
Now, a non-inference task: Determine the age cutoff for younger and mature mothers. Use a method of your choice, and explain how your method works.
-
Pick a pair of numerical and categorical variables and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Answer your question using the inference function, report the statistical results, and also provide an explanation in plain language.
-
-
-This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics.
-
-
-
-
-
-
-
-
-
-
-
-
-
diff --git a/intro_to_data/intro_to_data.Rmd b/intro_to_data/intro_to_data.Rmd
deleted file mode 100644
index cc1937a..0000000
--- a/intro_to_data/intro_to_data.Rmd
+++ /dev/null
@@ -1,465 +0,0 @@
----
-title: "Introduction to data"
-output:
- html_document:
- theme: cerulean
- highlight: pygments
- css: ../lab.css
----
-
-Some define Statistics as the field that focuses on turning information into
-knowledge. The first step in that process is to summarize and describe the raw
-information - the data. In this lab, you will gain insight into public health
-by generating simple graphical and numerical summaries of a data set collected
-by the Centers for Disease Control and Prevention (CDC). As this is a large
-data set, along the way you'll also learn the indispensable skills of data
-processing and subsetting.
-
-
-## Getting started
-
-The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone
-survey of 350,000 people in the United States. As its name implies, the BRFSS
-is designed to identify risk factors in the adult population and report
-emerging health trends. For example, respondents are asked about their diet and
-weekly physical activity, their HIV/AIDS status, possible tobacco use, and even
-their level of healthcare coverage. The BRFSS Web site
-([http://www.cdc.gov/brfss](http://www.cdc.gov/brfss)) contains a complete
-description of the survey, including the research questions that motivate the
-study and many interesting results derived from the data.
-
-We will focus on a random sample of 20,000 people from the BRFSS survey
-conducted in 2000. While there are over 200 variables in this data set, we will
-work with a small subset.
-
-We begin by loading the data set of 20,000 observations into the R workspace.
-After launching RStudio, enter the following command.
-
-```{r load-data, eval=FALSE}
-source("http://www.openintro.org/stat/data/cdc.R")
-```
-
-The data set `cdc` that shows up in your workspace is a *data matrix*, with each
-row representing a *case* and each column representing a *variable*. R calls
-this data format a *data frame*, which is a term that will be used throughout
-the labs.
-
-To view the names of the variables, type the command
-
-```{r names, eval=FALSE}
-names(cdc)
-```
-
-This returns the names `genhlth`, `exerany`, `hlthplan`, `smoke100`, `height`,
-`weight`, `wtdesire`, `age`, and `gender`. Each one of these variables
-corresponds to a question that was asked in the survey. For example, for
-`genhlth`, respondents were asked to evaluate their general health, responding
-either excellent, very good, good, fair or poor. The `exerany` variable
-indicates whether the respondent exercised in the past month (1) or did not (0).
-Likewise, `hlthplan` indicates whether the respondent had some form of health
-coverage (1) or did not (0). The `smoke100` variable indicates whether the
-respondent had smoked at least 100 cigarettes in her lifetime. The other
-variables record the respondent's `height` in inches, `weight` in pounds as well
-as their desired weight, `wtdesire`, `age` in years, and `gender`.
-
-1. How many cases are there in this data set? How many variables? For each
- variable, identify its data type (e.g. categorical, discrete).
-
-We can have a look at the first few entries (rows) of our data with the command
-
-```{r head, eval=FALSE}
-head(cdc)
-```
-
-and similarly we can look at the last few by typing
-
-```{r tail, eval=FALSE}
-tail(cdc)
-```
-
-You could also look at *all* of the data frame at once by typing its name into
-the console, but that might be unwise here. We know `cdc` has 20,000 rows, so
-viewing the entire data set would mean flooding your screen. It's better to
-take small peeks at the data with `head`, `tail` or the subsetting techniques
-that you'll learn in a moment.
-
-## Summaries and tables
-
-The BRFSS questionnaire is a massive trove of information. A good first step in
-any analysis is to distill all of that information into a few summary statistics
-and graphics. As a simple example, the function `summary` returns a numerical
-summary: minimum, first quartile, median, mean, second quartile, and maximum.
-For `weight` this is
-
-```{r summary-weight, eval=FALSE}
-summary(cdc$weight)
-```
-
-R also functions like a very fancy calculator. If you wanted to compute the
-interquartile range for the respondents' weight, you would look at the output
-from the summary command above and then enter
-
-```{r weight-range-arith, eval=FALSE}
-190 - 140
-```
-
-R also has built-in functions to compute summary statistics one by one. For
-instance, to calculate the mean, median, and variance of `weight`, type
-
-```{r weight-mean-var-median, eval=FALSE}
-mean(cdc$weight)
-var(cdc$weight)
-median(cdc$weight)
-```
-
-While it makes sense to describe a quantitative variable like `weight` in terms
-of these statistics, what about categorical data? We would instead consider the
-sample frequency or relative frequency distribution. The function `table` does
-this for you by counting the number of times each kind of response was given.
-For example, to see the number of people who have smoked 100 cigarettes in their
-lifetime, type
-
-```{r table-smoke, eval=FALSE}
-table(cdc$smoke100)
-```
-
-or instead look at the relative frequency distribution by typing
-
-```{r table-smoke-prop, eval=FALSE}
-table(cdc$smoke100)/20000
-```
-
-Notice how R automatically divides all entries in the table by 20,000 in the
-command above. This is similar to something we observed in the Introduction to R;
-when we multiplied or divided a vector with a number, R applied that action across
-entries in the vectors. As we see above, this also works for tables. Next, we
-make a bar plot of the entries in the table by putting the table inside the
-`barplot` command.
-
-```{r table-smoke-barplot, eval=FALSE}
-barplot(table(cdc$smoke100))
-```
-
-Notice what we've done here! We've computed the table of `cdc$smoke100` and then
-immediately applied the graphical function, `barplot`. This is an important
-idea: R commands can be nested. You could also break this into two steps by
-typing the following:
-
-```{r table-smoke-barplot-twosteps, eval=FALSE}
-smoke <- table(cdc$smoke100)
-
-barplot(smoke)
-```
-
-Here, we've made a new object, a table, called `smoke` (the contents of which we
-can see by typing `smoke` into the console) and then used it in as the input for
-`barplot`. The special symbol `<-` performs an *assignment*, taking the output
-of one line of code and saving it into an object in your workspace. This is
-another important idea that we'll return to later.
-
-2. Create a numerical summary for `height` and `age`, and compute the
- interquartile range for each. Compute the relative frequency distribution for
- `gender` and `exerany`. How many males are in the sample? What proportion of
- the sample reports being in excellent health?
-
-The `table` command can be used to tabulate any number of variables that you
-provide. For example, to examine which participants have smoked across each
-gender, we could use the following.
-
-```{r table-smoke-gender, eval=FALSE}
-table(cdc$gender,cdc$smoke100)
-```
-
-Here, we see column labels of 0 and 1. Recall that 1 indicates a respondent has
-smoked at least 100 cigarettes. The rows refer to gender. To create a mosaic
-plot of this table, we would enter the following command.
-
-```{r mosaic-smoke-gender, eval=FALSE}
-mosaicplot(table(cdc$gender,cdc$smoke100))
-```
-
-We could have accomplished this in two steps by saving the table in one line and
-applying `mosaicplot` in the next (see the table/barplot example above).
-
-3. What does the mosaic plot reveal about smoking habits and gender?
-
-## Interlude: How R thinks about data
-
-We mentioned that R stores data in data frames, which you might think of as a
-type of spreadsheet. Each row is a different observation (a different respondent)
-and each column is a different variable (the first is `genhlth`, the second
-`exerany` and so on). We can see the size of the data frame next to the object
-name in the workspace or we can type
-
-```{r dim, eval=FALSE}
-dim(cdc)
-```
-
-which will return the number of rows and columns. Now, if we want to access a
-subset of the full data frame, we can use row-and-column notation. For example,
-to see the sixth variable of the 567th respondent, use the format
-
-```{r cdc-row567-column6, eval=FALSE}
-cdc[567,6]
-```
-
-which means we want the element of our data set that is in the 567th
-row (meaning the 567th person or observation) and the 6th
-column (in this case, weight). We know that `weight` is the 6th variable
-because it is the 6th entry in the list of variable names
-
-```{r names-again, eval=FALSE}
-names(cdc)
-```
-
-To see the weights for the first 10 respondents we can type
-
-```{r first-10-rows-sixth-column, eval=FALSE}
-cdc[1:10,6]
-```
-
-In this expression, we have asked just for rows in the range 1 through 10. R
-uses the `:` to create a range of values, so 1:10 expands to 1, 2, 3, 4, 5, 6,
-7, 8, 9, 10. You can see this by entering
-
-```{r numbers-1to10, eval=FALSE}
-1:10
-```
-
-Finally, if we want all of the data for the first 10 respondents, type
-
-```{r first-10-rows, eval=FALSE}
-cdc[1:10,]
-```
-
-By leaving out an index or a range (we didn't type anything between the comma
-and the square bracket), we get all the columns. When starting out in R, this is
-a bit counterintuitive. As a rule, we omit the column number to see all columns
-in a data frame. Similarly, if we leave out an index or range for the rows, we
-would access all the observations, not just the 567th, or rows 1
-through 10. Try the following to see the weights for all 20,000 respondents fly
-by on your screen
-
-```{r 6th-column, eval=FALSE}
-cdc[,6]
-```
-
-Recall that column 6 represents respondents' weight, so the command above
-reported all of the weights in the data set. An alternative method to access the
-weight data is by referring to the name. Previously, we typed `names(cdc)` to
-see all the variables contained in the cdc data set. We can use any of the
-variable names to select items in our data set.
-
-```{r weight, eval=FALSE}
-cdc$weight
-```
-
-The dollar-sign tells R to look in data frame `cdc` for the column called
-`weight`. Since that's a single vector, we can subset it with just a single
-index inside square brackets. We see the weight for the 567th
-respondent by typing
-
-```{r weight-567, eval=FALSE}
-cdc$weight[567]
-```
-
-Similarly, for just the first 10 respondents
-
-```{rweight-first10, eval=FALSE}
-cdc$weight[1:10]
-```
-
-The command above returns the same result as the `cdc[1:10,6]` command. Both
-row-and-column notation and dollar-sign notation are widely used, which one you
-choose to use depends on your personal preference.
-
-## A little more on subsetting
-
-It's often useful to extract all individuals (cases) in a data set that have
-specific characteristics. We accomplish this through *conditioning* commands.
-First, consider expressions like
-
-```{r true-male, eval=FALSE}
-cdc$gender == "m"
-```
-
-or
-
-```{r true-over30, eval=FALSE}
-cdc$age > 30
-```
-
-These commands produce a series of `TRUE` and `FALSE` values. There is one
-value for each respondent, where `TRUE` indicates that the person was male (via
-the first command) or older than 30 (second command).
-
-Suppose we want to extract just the data for the men in the sample, or just for
-those over 30. We can use the R function `subset` to do that for us. For example,
-the command
-
-```{r males, eval=FALSE}
-mdata <- subset(cdc, cdc$gender == "m")
-```
-
-will create a new data set called `mdata` that contains only the men from the
-`cdc` data set. In addition to finding it in your workspace alongside its
-dimensions, you can take a peek at the first several rows as usual
-
-```{r head-males, eval=FALSE}
-head(mdata)
-```
-
-This new data set contains all the same variables but just under half the rows.
-It is also possible to tell R to keep only specific variables, which is a topic
-we'll discuss in a future lab. For now, the important thing is that we can carve
-up the data based on values of one or more variables.
-
-As an aside, you can use several of these conditions together with `&` and `|`.
-The `&` is read "and" so that
-
-```{r males-and-over30, eval=FALSE}
-m_and_over30 <- subset(cdc, gender == "m" & age > 30)
-```
-
-will give you the data for men over the age of 30. The `|` character is read
-"or" so that
-
-```{r males-or-over30, eval=FALSE}
-m_or_over30 <- subset(cdc, gender == "m" | age > 30)
-```
-
-will take people who are men or over the age of 30 (why that's an interesting
-group is hard to say, but right now the mechanics of this are the important
-thing). In principle, you may use as many "and" and "or" clauses as you like
-when forming a subset.
-
-3. Create a new object called `under23_and_smoke` that contains all observations
- of respondents under the age of 23 that have smoked 100 cigarettes in their
- lifetime. Write the command you used to create the new object as the answer
- to this exercise.
-
-## Quantitative data
-
-With our subsetting tools in hand, we'll now return to the task of the day:
-making basic summaries of the BRFSS questionnaire. We've already looked at
-categorical data such as `smoke` and `gender` so now let's turn our attention to
-quantitative data. Two common ways to visualize quantitative data are with box
-plots and histograms. We can construct a box plot for a single variable with
-the following command.
-
-```{r boxplot-height, eval=FALSE}
-boxplot(cdc$height)
-```
-
-You can compare the locations of the components of the box by examining the
-summary statistics.
-
-```{r summary-height, eval=FALSE}
-summary(cdc$height)
-```
-
-Confirm that the median and upper and lower quartiles reported in the numerical
-summary match those in the graph. The purpose of a boxplot is to provide a
-thumbnail sketch of a variable for the purpose of comparing across several
-categories. So we can, for example, compare the heights of men and women with
-
-```{r boxplot-height-gender, eval=FALSE}
-boxplot(cdc$height ~ cdc$gender)
-```
-
-The notation here is new. The `~` character can be read *versus* or
-*as a function of*. So we're asking R to give us a box plots of heights where
-the groups are defined by gender.
-
-Next let's consider a new variable that doesn't show up directly in this data
-set: Body Mass Index (BMI)
-([http://en.wikipedia.org/wiki/Body_mass_index](http://en.wikipedia.org/wiki/Body_mass_index)).
-BMI is a weight to height ratio and can be calculated as:
-
-\[ BMI = \frac{weight~(lb)}{height~(in)^2} * 703 \]
-
-703 is the approximate conversion factor to change units from metric (meters and
-kilograms) to imperial (inches and pounds).
-
-The following two lines first make a new object called `bmi` and then creates
-box plots of these values, defining groups by the variable `cdc$genhlth`.
-
-```{r boxplot-bmi, eval=FALSE}
-bmi <- (cdc$weight / cdc$height^2) * 703
-boxplot(bmi ~ cdc$genhlth)
-```
-
-Notice that the first line above is just some arithmetic, but it's applied to
-all 20,000 numbers in the `cdc` data set. That is, for each of the 20,000
-participants, we take their weight, divide by their height-squared and then
-multiply by 703. The result is 20,000 BMI values, one for each respondent. This
-is one reason why we like R: it lets us perform computations like this using
-very simple expressions.
-
-4. What does this box plot show? Pick another categorical variable from the
- data set and see how it relates to BMI. List the variable you chose, why you
- might think it would have a relationship to BMI, and indicate what the
- figure seems to suggest.
-
-Finally, let's make some histograms. We can look at the histogram for the age of
-our respondents with the command
-
-```{r hist-age, eval=FALSE}
-hist(cdc$age)
-```
-
-Histograms are generally a very good way to see the shape of a single
-distribution, but that shape can change depending on how the data is split
-between the different bins. You can control the number of bins by adding an
-argument to the command. In the next two lines, we first make a default
-histogram of `bmi` and then one with 50 breaks.
-
-```{r hist-bmi, eval=FALSE}
-hist(bmi)
-hist(bmi, breaks = 50)
-```
-
-Note that you can flip between plots that you've created by clicking the forward
-and backward arrows in the lower right region of RStudio, just above the plots.
-How do these two histograms compare?
-
-At this point, we've done a good first pass at analyzing the information in the
-BRFSS questionnaire. We've found an interesting association between smoking and
-gender, and we can say something about the relationship between people's
-assessment of their general health and their own BMI. We've also picked up
-essential computing tools -- summary statistics, subsetting, and plots -- that
-will serve us well throughout this course.
-
-* * *
-
-## On Your Own
-
-- Make a scatterplot of weight versus desired weight. Describe the
- relationship between these two variables.
-
-- Let's consider a new variable: the difference between desired weight
- (`wtdesire`) and current weight (`weight`). Create this new variable by
- subtracting the two columns in the data frame and assigning them to a new
- object called `wdiff`.
-
-- What type of data is `wdiff`? If an observation `wdiff` is 0, what does
- this mean about the person's weight and desired weight. What if `wdiff` is
- positive or negative?
-
-- Describe the distribution of `wdiff` in terms of its center, shape, and
- spread, including any plots you use. What does this tell us about how people
- feel about their current weight?
-
-- Using numerical summaries and a side-by-side box plot, determine if men tend
- to view their weight differently than women.
-
-- Now it's time to get creative. Find the mean and standard deviation of
- `weight` and determine what proportion of the weights are within one
- standard deviation of the mean.
-
-
-This is a product of OpenIntro that is released under a
-[Creative Commons Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0).
-This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel
-from a lab written by Mark Hansen of UCLA Statistics.
-
\ No newline at end of file
diff --git a/intro_to_data/intro_to_data.html b/intro_to_data/intro_to_data.html
deleted file mode 100644
index bba2a4f..0000000
--- a/intro_to_data/intro_to_data.html
+++ /dev/null
@@ -1,243 +0,0 @@
-
-
-
-
-
-
-
-
-
-
-
-
-Introduction to data
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Introduction to data
-
-
-
-
Some define Statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information - the data. In this lab, you will gain insight into public health by generating simple graphical and numerical summaries of a data set collected by the Centers for Disease Control and Prevention (CDC). As this is a large data set, along the way you’ll also learn the indispensable skills of data processing and subsetting.
-
-
Getting started
-
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.
-
We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.
-
We begin by loading the data set of 20,000 observations into the R workspace. After launching RStudio, enter the following command.
The data set cdc that shows up in your workspace is a data matrix, with each row representing a case and each column representing a variable. R calls this data format a data frame, which is a term that will be used throughout the labs.
-
To view the names of the variables, type the command
-
names(cdc)
-
This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some form of health coverage (1) or did not (0). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.
-
-
How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).
-
-
We can have a look at the first few entries (rows) of our data with the command
-
head(cdc)
-
and similarly we can look at the last few by typing
-
tail(cdc)
-
You could also look at all of the data frame at once by typing its name into the console, but that might be unwise here. We know cdc has 20,000 rows, so viewing the entire data set would mean flooding your screen. It’s better to take small peeks at the data with head, tail or the subsetting techniques that you’ll learn in a moment.
-
-
-
Summaries and tables
-
The BRFSS questionnaire is a massive trove of information. A good first step in any analysis is to distill all of that information into a few summary statistics and graphics. As a simple example, the function summary returns a numerical summary: minimum, first quartile, median, mean, second quartile, and maximum. For weight this is
-
summary(cdc$weight)
-
R also functions like a very fancy calculator. If you wanted to compute the interquartile range for the respondents’ weight, you would look at the output from the summary command above and then enter
-
190 -140
-
R also has built-in functions to compute summary statistics one by one. For instance, to calculate the mean, median, and variance of weight, type
While it makes sense to describe a quantitative variable like weight in terms of these statistics, what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function table does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type
-
table(cdc$smoke100)
-
or instead look at the relative frequency distribution by typing
-
table(cdc$smoke100)/20000
-
Notice how R automatically divides all entries in the table by 20,000 in the command above. This is similar to something we observed in the Introduction to R; when we multiplied or divided a vector with a number, R applied that action across entries in the vectors. As we see above, this also works for tables. Next, we make a bar plot of the entries in the table by putting the table inside the barplot command.
-
barplot(table(cdc$smoke100))
-
Notice what we’ve done here! We’ve computed the table of cdc$smoke100 and then immediately applied the graphical function, barplot. This is an important idea: R commands can be nested. You could also break this into two steps by typing the following:
-
smoke <-table(cdc$smoke100)
-
-barplot(smoke)
-
Here, we’ve made a new object, a table, called smoke (the contents of which we can see by typing smoke into the console) and then used it in as the input for barplot. The special symbol <- performs an assignment, taking the output of one line of code and saving it into an object in your workspace. This is another important idea that we’ll return to later.
-
-
Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?
-
-
The table command can be used to tabulate any number of variables that you provide. For example, to examine which participants have smoked across each gender, we could use the following.
-
table(cdc$gender,cdc$smoke100)
-
Here, we see column labels of 0 and 1. Recall that 1 indicates a respondent has smoked at least 100 cigarettes. The rows refer to gender. To create a mosaic plot of this table, we would enter the following command.
-
mosaicplot(table(cdc$gender,cdc$smoke100))
-
We could have accomplished this in two steps by saving the table in one line and applying mosaicplot in the next (see the table/barplot example above).
-
-
What does the mosaic plot reveal about smoking habits and gender?
-
-
-
-
Interlude: How R thinks about data
-
We mentioned that R stores data in data frames, which you might think of as a type of spreadsheet. Each row is a different observation (a different respondent) and each column is a different variable (the first is genhlth, the second exerany and so on). We can see the size of the data frame next to the object name in the workspace or we can type
-
dim(cdc)
-
which will return the number of rows and columns. Now, if we want to access a subset of the full data frame, we can use row-and-column notation. For example, to see the sixth variable of the 567th respondent, use the format
-
cdc[567,6]
-
which means we want the element of our data set that is in the 567th row (meaning the 567th person or observation) and the 6th column (in this case, weight). We know that weight is the 6th variable because it is the 6th entry in the list of variable names
-
names(cdc)
-
To see the weights for the first 10 respondents we can type
-
cdc[1:10,6]
-
In this expression, we have asked just for rows in the range 1 through 10. R uses the : to create a range of values, so 1:10 expands to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. You can see this by entering
-
1:10
-
Finally, if we want all of the data for the first 10 respondents, type
-
cdc[1:10,]
-
By leaving out an index or a range (we didn’t type anything between the comma and the square bracket), we get all the columns. When starting out in R, this is a bit counterintuitive. As a rule, we omit the column number to see all columns in a data frame. Similarly, if we leave out an index or range for the rows, we would access all the observations, not just the 567th, or rows 1 through 10. Try the following to see the weights for all 20,000 respondents fly by on your screen
-
cdc[,6]
-
Recall that column 6 represents respondents’ weight, so the command above reported all of the weights in the data set. An alternative method to access the weight data is by referring to the name. Previously, we typed names(cdc) to see all the variables contained in the cdc data set. We can use any of the variable names to select items in our data set.
-
cdc$weight
-
The dollar-sign tells R to look in data frame cdc for the column called weight. Since that’s a single vector, we can subset it with just a single index inside square brackets. We see the weight for the 567th respondent by typing
-
cdc$weight[567]
-
Similarly, for just the first 10 respondents
-
cdc$weight[1:10]
-
The command above returns the same result as the cdc[1:10,6] command. Both row-and-column notation and dollar-sign notation are widely used, which one you choose to use depends on your personal preference.
-
-
-
A little more on subsetting
-
It’s often useful to extract all individuals (cases) in a data set that have specific characteristics. We accomplish this through conditioning commands. First, consider expressions like
-
cdc$gender == "m"
-
or
-
cdc$age >30
-
These commands produce a series of TRUE and FALSE values. There is one value for each respondent, where TRUE indicates that the person was male (via the first command) or older than 30 (second command).
-
Suppose we want to extract just the data for the men in the sample, or just for those over 30. We can use the R function subset to do that for us. For example, the command
-
mdata <-subset(cdc, cdc$gender == "m")
-
will create a new data set called mdata that contains only the men from the cdc data set. In addition to finding it in your workspace alongside its dimensions, you can take a peek at the first several rows as usual
-
head(mdata)
-
This new data set contains all the same variables but just under half the rows. It is also possible to tell R to keep only specific variables, which is a topic we’ll discuss in a future lab. For now, the important thing is that we can carve up the data based on values of one or more variables.
-
As an aside, you can use several of these conditions together with & and |. The & is read “and” so that
will give you the data for men over the age of 30. The | character is read “or” so that
-
m_or_over30 <-subset(cdc, gender == "m" |age >30)
-
will take people who are men or over the age of 30 (why that’s an interesting group is hard to say, but right now the mechanics of this are the important thing). In principle, you may use as many “and” and “or” clauses as you like when forming a subset.
-
-
Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.
-
-
-
-
Quantitative data
-
With our subsetting tools in hand, we’ll now return to the task of the day: making basic summaries of the BRFSS questionnaire. We’ve already looked at categorical data such as smoke and gender so now let’s turn our attention to quantitative data. Two common ways to visualize quantitative data are with box plots and histograms. We can construct a box plot for a single variable with the following command.
-
boxplot(cdc$height)
-
You can compare the locations of the components of the box by examining the summary statistics.
-
summary(cdc$height)
-
Confirm that the median and upper and lower quartiles reported in the numerical summary match those in the graph. The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of comparing across several categories. So we can, for example, compare the heights of men and women with
-
boxplot(cdc$height ~cdc$gender)
-
The notation here is new. The ~ character can be read versus or as a function of. So we’re asking R to give us a box plots of heights where the groups are defined by gender.
-
Next let’s consider a new variable that doesn’t show up directly in this data set: Body Mass Index (BMI) (http://en.wikipedia.org/wiki/Body_mass_index). BMI is a weight to height ratio and can be calculated as:
Notice that the first line above is just some arithmetic, but it’s applied to all 20,000 numbers in the cdc data set. That is, for each of the 20,000 participants, we take their weight, divide by their height-squared and then multiply by 703. The result is 20,000 BMI values, one for each respondent. This is one reason why we like R: it lets us perform computations like this using very simple expressions.
-
-
What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.
-
-
Finally, let’s make some histograms. We can look at the histogram for the age of our respondents with the command
-
hist(cdc$age)
-
Histograms are generally a very good way to see the shape of a single distribution, but that shape can change depending on how the data is split between the different bins. You can control the number of bins by adding an argument to the command. In the next two lines, we first make a default histogram of bmi and then one with 50 breaks.
-
hist(bmi)
-hist(bmi, breaks =50)
-
Note that you can flip between plots that you’ve created by clicking the forward and backward arrows in the lower right region of RStudio, just above the plots. How do these two histograms compare?
-
At this point, we’ve done a good first pass at analyzing the information in the BRFSS questionnaire. We’ve found an interesting association between smoking and gender, and we can say something about the relationship between people’s assessment of their general health and their own BMI. We’ve also picked up essential computing tools – summary statistics, subsetting, and plots – that will serve us well throughout this course.
-
-
-
-
On Your Own
-
-
Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
-
Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
-
What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?
-
Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
-
Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.
-
Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.
-
-
-This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.
-
-This is a product of OpenIntro that is released under a
-[Creative Commons Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0).
-This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel
-from a lab written by Mark Hansen of UCLA Statistics.
-
\ No newline at end of file
diff --git a/intro_to_r/intro_to_r.html b/intro_to_r/intro_to_r.html
deleted file mode 100644
index 493f124..0000000
--- a/intro_to_r/intro_to_r.html
+++ /dev/null
@@ -1,187 +0,0 @@
-
-
-
-
-
-
-
-
-
-
-
-
-Introduction to R and RStudio
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Introduction to R and RStudio
-
-
-
-
The goal of this lab is to introduce you to R and RStudio, which you’ll be using throughout the course both to learn the statistical concepts discussed in the texbook and also to analyze real data and come to informed conclusions. To straighten out which is which: R is the name of the programming language itself and RStudio is a convenient interface.
-
As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R. Today we begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.
-
-
The panel in the upper right contains your workspace as well as a history of the commands that you’ve previously entered. Any plots that you generate will show up in the panel in the lower right corner.
-
The panel on the left is where the action happens. It’s called the console. Everytime you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is really a request, a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.
-
To get you started, enter the following command at the R prompt (i.e. right after > on the console). You can either type it in manually or copy and paste it from this document.
This command instructs R to access the OpenIntro website and fetch some data: the Arbuthnot baptism counts for boys and girls. You should see that the workspace area in the upper righthand corner of the RStudio window now lists a data set called arbuthnot that has 82 observations on 3 variables. As you interact with R, you will create a series of objects. Sometimes you load them as we have done here, and sometimes you create them yourself as the byproduct of a computation or some analysis you have performed. Note that because you are accessing data from the web, this command (and the entire assignment) will work in a computer lab, in the library, or in your dorm room; anywhere you have access to the Internet.
-
-
The Data: Dr. Arbuthnot’s Baptism Records
-
The Arbuthnot data set refers to Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710. We can take a look at the data by typing its name into the console.
-
arbuthnot
-
What you should see are four columns of numbers, each row representing a different year: the first entry in each row is simply the row number (an index we can use to access the data from individual years if we want), the second is the year, and the third and fourth are the numbers of boys and girls baptized that year, respectively. Use the scrollbar on the right side of the console window to examine the complete data set.
-
Note that the row numbers in the first column are not part of Arbuthnot’s data. R adds them as part of its printout to help you make visual comparisons. You can think of them as the index that you see on the left side of a spreadsheet. In fact, the comparison to a spreadsheet will generally be helpful. R has stored Arbuthnot’s data in a kind of spreadsheet or table called a data frame.
-
You can see the dimensions of this data frame by typing:
-
dim(arbuthnot)
-
## [1] 82 3
-
This command should output [1] 82 3, indicating that there are 82 rows and 3 columns (we’ll get to what the [1] means in a bit), just as it says next to the object in your workspace. You can see the names of these columns (or variables) by typing:
-
names(arbuthnot)
-
## [1] "year" "boys" "girls"
-
You should see that the data frame contains the columns year, boys, and girls. At this point, you might notice that many of the commands in R look a lot like functions from math class; that is, invoking R commands means supplying a function with some number of arguments. The dim and names commands, for example, each took a single argument, the name of a data frame.
-
One advantage of RStudio is that it comes with a built-in data viewer. Click on the name arbuthnot in the Environment pane (upper right window) that lists the objects in your workspace. This will bring up an alternative display of the data set in the Data Viewer (upper left window). You can close the data viewer by clicking on the x in the upper lefthand corner.
-
-
-
Some Exploration
-
Let’s start to examine the data a little more closely. We can access the data in a single column of a data frame separately using a command like
-
arbuthnot$boys
-
This command will only show the number of boys baptized each year.
-
-
What command would you use to extract just the counts of girls baptized? Try it!
-
-
Notice that the way R has printed these data is different. When we looked at the complete data frame, we saw 82 rows, one on each line of the display. These data are no longer structured in a table with other variables, so they are displayed one right after another. Objects that print out in this way are called vectors; they represent a set of numbers. R has added numbers in [brackets] along the left side of the printout to indicate locations within the vector. For example, 5218 follows [1], indicating that 5218 is the first entry in the vector. And if [43] starts a line, then that would mean the first number on that line would represent the 43rd entry in the vector.
-
R has some powerful functions for making graphics. We can create a simple plot of the number of girls baptized per year with the command
-
plot(x = arbuthnot$year, y = arbuthnot$girls)
-
By default, R creates a scatterplot with each x,y pair indicated by an open circle. The plot itself should appear under the Plots tab of the lower right panel of RStudio. Notice that the command above again looks like a function, this time with two arguments separated by a comma. The first argument in the plot function specifies the variable for the x-axis and the second for the y-axis. If we wanted to connect the data points with lines, we could add a third argument, the letter l for line.
-
plot(x = arbuthnot$year, y = arbuthnot$girls, type ="l")
-
You might wonder how you are supposed to know that it was possible to add that third argument. Thankfully, R documents all of its functions extensively. To read what a function does and learn the arguments that are available to you, just type in a question mark followed by the name of the function that you’re interested in. Try the following.
-
?plot
-
Notice that the help file replaces the plot in the lower right panel. You can toggle between plots and help files using the tabs at the top of that panel.
-
-
Is there an apparent trend in the number of girls baptized over the years? How would you describe it?
-
-
Now, suppose we want to plot the total number of baptisms. To compute this, we could use the fact that R is really just a big calculator. We can type in mathematical expressions like
-
5218 +4683
-
to see the total number of baptisms in 1629. We could repeat this once for each year, but there is a faster way. If we add the vector for baptisms for boys and girls, R will compute all sums simultaneously.
-
arbuthnot$boys +arbuthnot$girls
-
What you will see are 82 numbers (in that packed display, because we aren’t looking at a data frame here), each one representing the sum we’re after. Take a look at a few of them and verify that they are right. Therefore, we can make a plot of the total number of baptisms per year with the command
-
plot(arbuthnot$year, arbuthnot$boys +arbuthnot$girls, type ="l")
-
This time, note that we left out the names of the first two arguments. We can do this because the help file shows that the default for plot is for the first argument to be the x-variable and the second argument to be the y-variable.
-
Similarly to how we computed the proportion of boys, we can compute the ratio of the number of boys to the number of girls baptized in 1629 with
-
5218 /4683
-
or we can act on the complete vectors with the expression
-
arbuthnot$boys /arbuthnot$girls
-
The proportion of newborns that are boys
-
5218 /(5218 +4683)
-
or this may also be computed for all years simultaneously:
-
arbuthnot$boys /(arbuthnot$boys +arbuthnot$girls)
-
Note that with R as with your calculator, you need to be conscious of the order of operations. Here, we want to divide the number of boys by the total number of newborns, so we have to use parentheses. Without them, R will first do the division, then the addition, giving you something that is not a proportion.
-
-
Now, make a plot of the proportion of boys over time. What do you see? Tip: If you use the up and down arrow keys, you can scroll through your previous commands, your so-called command history. You can also access it by clicking on the history tab in the upper right panel. This will save you a lot of typing in the future.
-
-
Finally, in addition to simple mathematical operators like subtraction and division, you can ask R to make comparisons like greater than, >, less than, <, and equality, ==. For example, we can ask if boys outnumber girls in each year with the expression
-
arbuthnot$boys >arbuthnot$girls
-
This command returns 82 values of either TRUE if that year had more boys than girls, or FALSE if that year did not (the answer may surprise you). This output shows a different kind of data than we have considered so far. In the arbuthnot data frame our values are numerical (the year, the number of boys and girls). Here, we’ve asked R to create logical data, data where the values are either TRUE or FALSE. In general, data analysis will involve many different kinds of data types, and one reason for using R is that it is able to represent and compute with many of them.
-
This seems like a fair bit for your first lab, so let’s stop here. To exit RStudio you can click the x in the upper right corner of the whole window. You will be prompted to save your workspace. If you click save, RStudio will save the history of your commands and all the objects in your workspace so that the next time you launch RStudio, you will see arbuthnot and you will have access to the commands you typed in your previous session. For now, click save, then start up RStudio again.
-
-
-
-
On Your Own
-
In the previous few pages, you recreated some of the displays and preliminary analysis of Arbuthnot’s baptism data. Your assignment involves repeating these steps, but for present day birth records in the United States. Load up the present day data with the following command.
The data are stored in a data frame called present.
-
-
What years are included in this data set? What are the dimensions of the data frame and what are the variable or column names?
-
How do these counts compare to Arbuthnot’s? Are they on a similar scale?
-
Make a plot that displays the boy-to-girl ratio for every year in the data set. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in your response.
These data come from a report by the Centers for Disease Control http://www.cdc.gov/nchs/data/nvsr/nvsr53/nvsr53_20.pdf. Check it out if you would like to read more about an analysis of sex ratios at birth in the United States.
-
That was a short introduction to R and RStudio, but we will provide you with more functions and a more complete sense of the language as the course progresses. Feel free to browse around the websites for R and RStudio if you’re interested in learning more, or find more labs for practice at http://openintro.org.
-
-This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.
-
Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. The article titled, “Beauty in the classroom: instructors’ pulchritude and putative pedagogical productivity” (Hamermesh and Parker, 2005) found that instructors who are viewed to be better looking receive higher instructional ratings. (Daniel S. Hamermesh, Amy Parker, Beauty in the classroom: instructors pulchritude and putative pedagogical productivity, Economics of Education Review, Volume 24, Issue 4, August 2005, Pages 369-376, ISSN 0272-7757, 10.1016/j.econedurev.2004.07.013. http://www.sciencedirect.com/science/article/pii/S0272775704001165.)
-
In this lab we will analyze the data from this study in order to learn what goes into a positive professor evaluation.
-
-
-
The data
-
The data were gathered from end of semester student evaluations for a large sample of professors from the University of Texas at Austin. In addition, six students rated the professors’ physical appearance. (This is aslightly modified version of the original data set that was released as part of the replication data for Data Analysis Using Regression and Multilevel/Hierarchical Models (Gelman and Hill, 2007).) The result is a data frame where each row contains a different course and columns represent variables about the courses and professors.
average professor evaluation score: (1) very unsatisfactory - (5) excellent.
-
-
-
rank
-
rank of professor: teaching, tenure track, tenured.
-
-
-
ethnicity
-
ethnicity of professor: not minority, minority.
-
-
-
gender
-
gender of professor: female, male.
-
-
-
language
-
language of school where professor received education: english or non-english.
-
-
-
age
-
age of professor.
-
-
-
cls_perc_eval
-
percent of students in class who completed evaluation.
-
-
-
cls_did_eval
-
number of students in class who completed evaluation.
-
-
-
cls_students
-
total number of students in class.
-
-
-
cls_level
-
class level: lower, upper.
-
-
-
cls_profs
-
number of professors teaching sections in course in sample: single, multiple.
-
-
-
cls_credits
-
number of credits of class: one credit (lab, PE, etc.), multi credit.
-
-
-
bty_f1lower
-
beauty rating of professor from lower level female: (1) lowest - (10) highest.
-
-
-
bty_f1upper
-
beauty rating of professor from upper level female: (1) lowest - (10) highest.
-
-
-
bty_f2upper
-
beauty rating of professor from second upper level female: (1) lowest - (10) highest.
-
-
-
bty_m1lower
-
beauty rating of professor from lower level male: (1) lowest - (10) highest.
-
-
-
bty_m1upper
-
beauty rating of professor from upper level male: (1) lowest - (10) highest.
-
-
-
bty_m2upper
-
beauty rating of professor from second upper level male: (1) lowest - (10) highest.
-
-
-
bty_avg
-
average beauty rating of professor.
-
-
-
pic_outfit
-
outfit of professor in picture: not formal, formal.
-
-
-
pic_color
-
color of professor’s picture: color, black & white.
-
-
-
-
-
-
Exploring the data
-
-
Is this an observational study or an experiment? The original research question posed in the paper is whether beauty leads directly to the differences in course evaluations. Given the study design, is it possible to answer this question as it is phrased? If not, rephrase the question.
-
Describe the distribution of score. Is the distribution skewed? What does that tell you about how students rate courses? Is this what you expected to see? Why, or why not?
-
Excluding score, select two other variables and describe their relationship using an appropriate visualization (scatterplot, side-by-side boxplots, or mosaic plot).
-
-
-
-
Simple linear regression
-
The fundamental phenomenon suggested by the study is that better looking teachers are evaluated more favorably. Let’s create a scatterplot to see if this appears to be the case:
-
plot(evals$score ~evals$bty_avg)
-
Before we draw conclusions about the trend, compare the number of observations in the data frame with the approximate number of points on the scatterplot. Is anything awry?
-
-
Replot the scatterplot, but this time use the function jitter() on the \(y\)- or the \(x\)-coordinate. (Use ?jitter to learn more.) What was misleading about the initial scatterplot?
-
Let’s see if the apparent trend in the plot is something more than natural variation. Fit a linear model called m_bty to predict average professor score by average beauty rating and add the line to your plot using abline(m_bty). Write out the equation for the linear model and interpret the slope. Is average beauty score a statistically significant predictor? Does it appear to be a practically significant predictor?
-
Use residual plots to evaluate whether the conditions of least squares regression are reasonable. Provide plots and comments for each one (see the Simple Regression Lab for a reminder of how to make these).
-
-
-
-
Multiple linear regression
-
The data set contains several variables on the beauty score of the professor: individual ratings from each of the six students who were asked to score the physical appearance of the professors and the average of these six scores. Let’s take a look at the relationship between one of these scores and the average beauty score.
As expected the relationship is quite strong - after all, the average score is calculated using the individual scores. We can actually take a look at the relationships between all beauty variables (columns 13 through 19) using the following command:
-
plot(evals[,13:19])
-
These variables are collinear (correlated), and adding more than one of these variables to the model would not add much value to the model. In this application and with these highly-correlated predictors, it is reasonable to use the average beauty score as the single representative of these variables.
-
In order to see if beauty is still a significant predictor of professor score after we’ve accounted for the gender of the professor, we can add the gender term into the model.
-
m_bty_gen <-lm(score ~bty_avg +gender, data = evals)
-summary(m_bty_gen)
-
-
P-values and parameter estimates should only be trusted if the conditions for the regression are reasonable. Verify that the conditions for this model are reasonable using diagnostic plots.
-
Is bty_avg still a significant predictor of score? Has the addition of gender to the model changed the parameter estimate for bty_avg?
-
-
Note that the estimate for gender is now called gendermale. You’ll see this name change whenever you introduce a categorical variable. The reason is that R recodes gender from having the values of female and male to being an indicator variable called gendermale that takes a value of \(0\) for females and a value of \(1\) for males. (Such variables are often referred to as “dummy” variables.)
-
As a result, for females, the parameter estimate is multiplied by zero, leaving the intercept and slope form familiar from simple regression.
We can plot this line and the line corresponding to males with the following custom function.
-
multiLines(m_bty_gen)
-
-
What is the equation of the line corresponding to males? (Hint: For males, the parameter estimate is multiplied by 1.) For two professors who received the same beauty rating, which gender tends to have the higher course evaluation score?
-
-
The decision to call the indicator variable gendermale instead ofgenderfemale has no deeper meaning. R simply codes the category that comes first alphabetically as a \(0\). (You can change the reference level of a categorical variable, which is the level that is coded as a 0, using therelevel function. Use ?relevel to learn more.)
-
-
Create a new model called m_bty_rank with gender removed and rank added in. How does R appear to handle categorical variables that have more than two levels? Note that the rank variable has three levels: teaching, tenure track, tenured.
-
-
The interpretation of the coefficients in multiple regression is slightly different from that of simple regression. The estimate for bty_avg reflects how much higher a group of professors is expected to score if they have a beauty rating that is one point higher while holding all other variables constant. In this case, that translates into considering only professors of the same rank with bty_avg scores that are one point apart.
-
-
-
The search for the best model
-
We will start with a full model that predicts professor score based on rank, ethnicity, gender, language of the university where they got their degree, age, proportion of students that filled out evaluations, class size, course level, number of professors, number of credits, average beauty rating, outfit, and picture color.
-
-
Which variable would you expect to have the highest p-value in this model? Why? Hint: Think about which variable would you expect to not have any association with the professor score.
Check your suspicions from the previous exercise. Include the model output in your response.
-
Interpret the coefficient associated with the ethnicity variable.
-
Drop the variable with the highest p-value and re-fit the model. Did the coefficients and significance of the other explanatory variables change? (One of the things that makes multiple regression interesting is that coefficient estimates depend on the other variables that are included in the model.) If not, what does this say about whether or not the dropped variable was collinear with the other explanatory variables?
-
Using backward-selection and p-value as the selection criterion, determine the best model. You do not need to show all steps in your answer, just the output for the final model. Also, write out the linear model for predicting score based on the final model you settle on.
-
Verify that the conditions for this model are reasonable using diagnostic plots.
-
The original paper describes how these data were gathered by taking a sample of professors from the University of Texas at Austin and including all courses that they have taught. Considering that each row represents a course, could this new information have an impact on any of the conditions of linear regression?
-
Based on your final model, describe the characteristics of a professor and course at University of Texas at Austin that would be associated with a high evaluation score.
-
Would you be comfortable generalizing your conclusions to apply to professors generally (at any university)? Why or why not?
In this lab we’ll investigate the probability distribution that is most central to statistics: the normal distribution. If we are confident that our data are nearly normal, that opens the door to many powerful statistical methods. Here we’ll use the graphical tools of R to assess the normality of our data and also learn how to generate random numbers from a normal distribution.
-
-
The Data
-
This week we’ll be working with measurements of body dimensions. This data set contains measurements from 247 men and 260 women, most of whom were considered healthy young adults.
Let’s take a quick peek at the first few rows of the data.
-
head(bdims)
-
You’ll see that for every observation we have 25 measurements, many of which are either diameters or girths. A key to the variable names can be found at http://www.openintro.org/stat/data/bdims.php, but we’ll be focusing on just three columns to get started: weight in kg (wgt), height in cm (hgt), and sex (1 indicates male, 0 indicates female).
-
Since males and females tend to have different body dimensions, it will be useful to create two additional data sets: one with only men and another with only women.
-
mdims <-subset(bdims, sex ==1)
-fdims <-subset(bdims, sex ==0)
-
-
Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?
-
-
-
-
The normal distribution
-
In your description of the distributions, did you use words like bell-shaped or normal? It’s tempting to say so when faced with a unimodal symmetric distribution.
-
To see how accurate that description is, we can plot a normal distribution curve on top of a histogram to see how closely the data follow a normal distribution. This normal curve should have the same mean and standard deviation as the data. We’ll be working with women’s heights, so let’s store them as a separate object and then calculate some statistics that will be referenced later.
Next we make a density histogram to use as the backdrop and use the lines function to overlay a normal probability curve. The difference between a frequency histogram and a density histogram is that while in a frequency histogram the heights of the bars add up to the total number of observations, in a density histogram the areas of the bars add up to 1. The area of each bar can be calculated as simply the height times the width of the bar. Using a density histogram allows us to properly overlay a normal distribution curve over the histogram since the curve is a normal probability density function. Frequency and density histograms both display the same exact shape; they only differ in their y-axis. You can verify this by comparing the frequency histogram you constructed earlier and the density histogram created by the commands below.
-
hist(fdims$hgt, probability =TRUE)
-x <-140:190
-y <-dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
-lines(x = x, y = y, col ="blue")
-
After plotting the density histogram with the first command, we create the x- and y-coordinates for the normal curve. We chose the x range as 140 to 190 in order to span the entire range of fheight. To create y, we use dnorm to calculate the density of each of those x-values in a distribution that is normal with mean fhgtmean and standard deviation fhgtsd. The final command draws a curve on the existing plot (the density histogram) by connecting each of the points specified by x and y. The argument col simply sets the color for the line to be drawn. If we left it out, the line would be drawn in black.
-
The top of the curve is cut off because the limits of the x- and y-axes are set to best fit the histogram. To adjust the y-axis you can add a third argument to the histogram function: ylim = c(0, 0.06).
-
-
Based on the this plot, does it appear that the data follow a nearly normal distribution?
-
-
-
-
Evaluating the normal distribution
-
Eyeballing the shape of the histogram is one way to determine if the data appear to be nearly normally distributed, but it can be frustrating to decide just how close the histogram is to the curve. An alternative approach involves constructing a normal probability plot, also called a normal Q-Q plot for “quantile-quantile”.
-
qqnorm(fdims$hgt)
-qqline(fdims$hgt)
-
A data set that is nearly normal will result in a probability plot where the points closely follow the line. Any deviations from normality leads to deviations of these points from the line. The plot for female heights shows points that tend to follow the line but with some errant points towards the tails. We’re left with the same problem that we encountered with the histogram above: how close is close enough?
-
A useful way to address this question is to rephrase it as: what do probability plots look like for data that I know came from a normal distribution? We can answer this by simulating data from a normal distribution using rnorm.
-
sim_norm <-rnorm(n =length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)
-
The first argument indicates how many numbers you’d like to generate, which we specify to be the same number of heights in the fdims data set using the length function. The last two arguments determine the mean and standard deviation of the normal distribution from which the simulated sample will be generated. We can take a look at the shape of our simulated data set, sim_norm, as well as its normal probability plot.
-
-
Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?
-
-
Even better than comparing the original plot to a single plot generated from a normal distribution is to compare it to many more plots using the following function. It may be helpful to click the zoom button in the plot window.
-
qqnormsim(fdims$hgt)
-
-
Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?
-
Using the same technique, determine whether or not female weights appear to come from a normal distribution.
-
-
-
-
Normal probabilities
-
Okay, so now you have a slew of tools to judge whether or not a variable is normally distributed. Why should we care?
-
It turns out that statisticians know a lot about the normal distribution. Once we decide that a random variable is approximately normal, we can answer all sorts of questions about that variable related to probability. Take, for example, the question of, “What is the probability that a randomly chosen young adult female is taller than 6 feet (about 182 cm)?” (The study that published this data set is clear to point out that the sample was not random and therefore inference to a general population is not suggested. We do so here only as an exercise.)
-
If we assume that female heights are normally distributed (a very close approximation is also okay), we can find this probability by calculating a Z score and consulting a Z table (also called a normal probability table). In R, this is done in one step with the function pnorm.
-
1 -pnorm(q =182, mean = fhgtmean, sd = fhgtsd)
-
Note that the function pnorm gives the area under the normal curve below a given value, q, with a given mean and standard deviation. Since we’re interested in the probability that someone is taller than 182 cm, we have to take one minus that probability.
-
Assuming a normal distribution has allowed us to calculate a theoretical probability. If we want to calculate the probability empirically, we simply need to determine how many observations fall above 182 then divide this number by the total sample size.
-
sum(fdims$hgt >182) /length(fdims$hgt)
-
Although the probabilities are not exactly the same, they are reasonably close. The closer that your distribution is to being normal, the more accurate the theoretical probabilities will be.
-
-
Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?
-
-
-
-
-
On Your Own
-
-
Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.
-
a. The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter ____.
-
b. The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter ____.
-
c. The histogram for general age (age) belongs to normal probability plot letter ____.
-
d. The histogram for female chest depth (che.de) belongs to normal probability plot letter ____.
-
Note that normal probability plots C and D have a slight stepwise pattern. Why do you think this is the case?
-
As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.
-
-
-
-This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.
-
-
-
-
-
-
-
-
-
-
-
-
-
diff --git a/oiLabs-mosaic.Rproj b/oiLabs-mosaic.Rproj
new file mode 100644
index 0000000..8e3c2eb
--- /dev/null
+++ b/oiLabs-mosaic.Rproj
@@ -0,0 +1,13 @@
+Version: 1.0
+
+RestoreWorkspace: Default
+SaveWorkspace: Default
+AlwaysSaveHistory: Default
+
+EnableCodeIndexing: Yes
+UseSpacesForTab: Yes
+NumSpacesForTab: 2
+Encoding: UTF-8
+
+RnwWeave: Sweave
+LaTeX: pdfLaTeX
diff --git a/oldLatex/lab0/labintro.Rnw b/oldLatex/lab0/labintro.Rnw
index 7aca129..3a89de9 100644
--- a/oldLatex/lab0/labintro.Rnw
+++ b/oldLatex/lab0/labintro.Rnw
@@ -27,7 +27,6 @@ To get you started, enter the following command at the R prompt (i.e. right afte
source("http://www.openintro.org/stat/data/arbuthnot.R")
@
-This command instructs R to access the OpenIntro website and fetch some data: the Arbuthnot baptism counts for boys and girls. You should see that the workspace area in the upper righthand corner of the RStudio window now lists a data set called \hlstd{arbuthnot} that has 82 observations on 3 variables. As you interact with R, you will create a series of objects. Sometimes you load them as we have done here, and sometimes you create them yourself as the byproduct of a computation or some analysis you have performed. Note that because you are accessing data from the web, this command (and the entire assignment) will work in a computer lab, in the library, or in your dorm room; anywhere you have access to the Internet.
\subsection*{The Data: Dr. Arbuthnot's Baptism Records}
@@ -47,15 +46,12 @@ You can see the dimensions of this data frame by typing:
dim(arbuthnot)
@
-This command should output \hlstd{[1] 82 3}, indicating that there are 82 rows and 3 columns (we'll get to what the \hlstd{[1]} means in a bit), just as it says next to the object in your workspace. You can see the names of these columns (or variables) by typing:
<>=
names(arbuthnot)
@
-You should see that the data frame contains the columns \hlstd{year}, \hlstd{boys}, and \hlstd{girls}. At this point, you might notice that many of the commands in R look a lot like functions from math class; that is, invoking R commands means supplying a function with some number of arguments. The \hlkwd{dim} and \hlkwd{names} commands, for example, each took a single argument, the name of a data frame.
-One advantage of RStudio is that it comes with a built-in data viewer. Click on the name \hlstd{arbuthnot} in the upper right window that lists the objects in your workspace. This will bring up an alternative display of the Arbuthnot counts in the upper left window. You can close the data viewer by clicking on the ``x'' in the upper lefthand corner.
\subsection*{Some Exploration}
Let's start to examine the data a little more closely. We can access the data in a single column of a data frame separately using a command like
@@ -70,30 +66,48 @@ This command will only show the number of boys baptized each year.
What command would you use to extract just the counts of girls baptized? Try it!
\end{exercise}
-Notice that the way R has printed these data is different. When we looked at the complete data frame, we saw 82 rows, one on each line of the display. These data are no longer structured in a table with other variables, so they are displayed one right after another. Objects that print out in this way are called \emph{vectors}; they represent a set of numbers. R has added numbers in [brackets] along the left side of the printout to indicate locations within the vector. For example, \hlstd{5218} follows \hlstd{[1]}, indicating that \hlstd{5218} is the first entry in the vector. And if \hlstd{[43]} starts a line, then that would mean the first number on that line would represent the 43$^{\textrm{rd}}$ entry in the vector.
-R has some powerful functions for making graphics. We can create a simple plot of the number of girls baptized per year with the command
+\paragraph{mosaic}
+
+There is an additional package for R called \textbf{mosaic} that streamlines all of the commands that you will need in this course. This package is specifically designed by a team of NSF-funded educators to make R more accessible to introductory statistics students like you. The \textbf{mosaic} package doesn't provide new functionality so much as it makes existing functionality more logical, consistent, all the while emphasizing importants concepts in statistics.
+
+To use the package, you will first need to install it.
+
+<>=
+install.packages("mosaic")
+@
+
+Note that you will only have to do this \emph{once}. However, once the package is installed, you will have to load it into the current workspace before it can be used.
+
+<>=
+require(mosaic)
+@
+
+Note that you will have to do this \emph{every} time you start a new R session.
+
+R has some powerful functions for making graphics.
+
+The centerpiece of the \textbf{mosaic} syntax is the use of the \emph{modeling language}. This involves the use of a tilde (~), which can be read as ``is a function of". For example, to plot the number of girls as a function of the year:
<>=
-plot(x = arbuthnot$year, y = arbuthnot$girls)
+# plot(x = arbuthnot$year, y = arbuthnot$girls) # basic R plot command
+xyplot(girls ~ year, data=arbuthnot) # mosaic-style syntax
@
-By default, R creates a scatterplot with each x,y pair indicated by an open circle. The plot itself should appear under the ``Plots'' tab of the lower right panel of RStudio. Notice that the command above again looks like a function, this time with two arguments separated by a comma. The first argument in the plot function specifies the variable for the x-axis and the second for the y-axis. If we wanted to connect the data points with lines, we could add a third argument, the letter ``l'' for \underline{l}ine.
<>=
-plot(x = arbuthnot$year, y = arbuthnot$girls, type = "l")
+xyplot(girls ~ year, data=arbuthnot, type="l")
@
You might wonder how you are supposed to know that it was possible to add that third argument. Thankfully, R documents all of its functions extensively. To read what a function does and learn the arguments that are available to you, just type in a question mark followed by the name of the function that you're interested in. Try the following.
<>=
-?plot
+?xyplot
@
Notice that the help file replaces the plot in the lower right panel. You can toggle between plots and help files using the tabs at the top of that panel.
\begin{exercise}
-Is there an apparent trend in the number of girls baptized over the years? How would you describe it?
\end{exercise}
Now, suppose we want to plot the total number of baptisms. To compute this, we could use the fact that R is really just a big calculator. We can type in mathematical expressions like
@@ -108,13 +122,18 @@ to see the total number of baptisms in 1629. We could repeat this once for each
arbuthnot$boys + arbuthnot$girls
@
+You can also use *with()* to avoid repeatedly typing the name of the data frame. *with()* instructs R to interpret everything else from within the data frame that you specify.
+
+<>=
+with(arbuthnot, boys + girls)
+@
+
What you will see are 82 numbers (in that packed display, because we aren't looking at a data frame here), each one representing the sum we're after. Take a look at a few of them and verify that they are right. Therefore, we can make a plot of the total number of baptisms per year with the command
<>=
-plot(arbuthnot$year, arbuthnot$boys + arbuthnot$girls, type = "l")
+xyplot((boys + girls) ~ year, data=arbuthnot, type="l")
@
-This time, note that we left out the names of the first two arguments. We can do this because the help file shows that the default for \hlkwd{plot} is for the first argument to be the x-variable and the second argument to be the y-variable.
Similarly to how we computed the proportion of boys, we can compute the ratio of the number of boys to the number of girls baptized in 1629 with
@@ -125,7 +144,8 @@ Similarly to how we computed the proportion of boys, we can compute the ratio of
or we can act on the complete vectors with the expression
<>=
-arbuthnot$boys / arbuthnot$girls
+# arbuthnot$boys / arbuthnot$girls
+with(arbuthnot, boys / girls)
@
The proportion of newborns that are boys
@@ -137,7 +157,8 @@ The proportion of newborns that are boys
or this may also be computed for all years simultaneously:
<>=
-arbuthnot$boys / (arbuthnot$boys + arbuthnot$girls)
+# arbuthnot$boys / (arbuthnot$boys + arbuthnot$girls)
+with(arbuthnot, boys / (boys + girls))
@
Note that with R as with your calculator, you need to be conscious of the order of operations. Here, we want to divide the number of boys by the total number of newborns, so we have to use parentheses. Without them, R will first do the division, then the addition, giving you something that is not a proportion.
@@ -149,12 +170,11 @@ Now, make a plot of the proportion of boys over time. What do you see? Tip: If y
Finally, in addition to simple mathematical operators like subtraction and division, you can ask R to make comparisons like greater than, $>$, less than, $<$, and equality, $==$. For example, we can ask if boys outnumber girls in each year with the expression
<>=
-arbuthnot$boys > arbuthnot$girls
+# arbuthnot$boys > arbuthnot$girls
+with(arbuthnot, boys > girls)
@
-This command returns 82 values of either \hlnum{TRUE} if that year had more boys than girls, or \hlnum{FALSE} if that year did not (the answer may surprise you). This output shows a different kind of data than we have considered so far. In the \hlstd{arbuthnot} data frame our values are numerical (the year, the number of boys and girls). Here, we've asked R to create \emph{logical} data, data where the values are either \hlnum{TRUE} or \hlnum{FALSE}. In general, data analysis will involve many different kinds of data types, and one reason for using R is that it is able to represent and compute with many of them.
-This seems like a fair bit for your first lab, so let's stop here. To exit RStudio you can click the ``x'' in the upper right corner of the whole window. You will be prompted to save your workspace. If you click ``save'', RStudio will save the history of your commands and all the objects in your workspace so that the next time you launch RStudio, you will see \hlstd{arbuthnot} and you will have access to the commands you typed in your previous session. For now, click ``save'', then start up RStudio again.
\vspace{2cm}
@@ -165,11 +185,10 @@ In the previous few pages, you recreated some of the displays and preliminary an
<>=
source("http://www.openintro.org/stat/data/present.R")
@
-The data are stored in a data frame called \hlstd{present}.
\begin{enumerate}
\item What years are included in this data set? What are the dimensions of the data frame and what are the variable or column names?
-\item How do these counts compare to Arbuthnot's? Are they on a similar scale?
+\item How do the counts of boys and girls in the present day birth records compare to Arbuthnot's? Are they on a similar scale?
\item Does Arbuthnot's observation about boys being born in greater proportion than girls hold up in the U.S.?
\item Make a plot that displays the boy-to-girl ratio for every year in the data set. What do you see?
\item In what year did we see the most total number of births in the U.S.? You can refer to the help files or the R reference card (\web{http://cran.r-project.org/doc/contrib/Short-refcard.pdf}) to find helpful commands. \\
diff --git a/oldLatex/lab1/lab1.Rnw b/oldLatex/lab1/lab1.Rnw
index 1fe8622..a44ea47 100644
--- a/oldLatex/lab1/lab1.Rnw
+++ b/oldLatex/lab1/lab1.Rnw
@@ -72,30 +72,35 @@ R also functions like a very fancy calculator. If you wanted to compute the int
R also has built-in functions to compute summary statistics one by one. For instance, to calculate the mean, median, and variance of \hlstd{weight}, type
-<>=
-mean(cdc$weight)
-
-var(cdc$weight)
-
-median(cdc$weight)
+<>=
+# mean(cdc$weight)
+require(mosaic)
+mean(~weight, data=cdc)
+# var(cdc$weight)
+var(~weight, data=cdc)
+# median(cdc$weight)
+median(~weight, data=cdc)
@
While it makes sense to describe a quantitative variable like \hlstd{weight} in terms of these statistics, what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function \hlkwd{table} does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type
<>=
-table(cdc$smoke100)
+# table(cdc$smoke100)
+tally(~smoke100, data=cdc)
@
or instead look at the relative frequency distribution by typing
<>=
-table(cdc$smoke100)/20000
+# table(cdc$smoke100)/20000
+tally(~smoke100, data=cdc, format="proportion")
@
Notice how R automatically divides all entries in the table by 20,000 in the command above. This is similar to something we observed in the last lab; when we multiplied or divided a vector with a number, R applied that action across entries in the vectors. As we see above, this also works for tables. Next, we make a bar plot of the entries in the table by putting the table inside the barplot command.
<>=
-barplot(table(cdc$smoke100))
+# barplot(table(cdc$smoke100))
+barchart(tally(~smoke100, data=cdc, margins=FALSE), horizontal=FALSE)
@
Notice what we've done here! We've computed the table of \hlstd{cdc\$smoke100} and then immediately applied the graphical function, \hlkwd{barplot}. This is an important idea: R commands can be nested. You could also break this into two steps by typing the following:
@@ -115,13 +120,15 @@ Create a numerical summary for \hlstd{height} and \hlstd{age}, and compute the i
The \hlkwd{table} command can be used to tabulate any number of variables that you provide. For example, to examine which participants have smoked across each gender, we could use the following.
<>=
-table(cdc$gender,cdc$smoke100)
+# table(cdc$gender,cdc$smoke100)
+tally(gender ~ smoke100, data=cdc, format="count")
@
Here, we see column labels of 0 and 1. Recall that 1 indicates a respondent has smoked at least 100 cigarettes. The rows refer to gender. To create a mosaic plot of this table, we would enter the following command.
<>=
-mosaicplot(table(cdc$gender,cdc$smoke100))
+# mosaicplot(table(cdc$gender,cdc$smoke100))
+mosaicplot(tally(gender ~ smoke100, data=cdc, margins=FALSE))
@
We could have accomplished this in two steps by saving the table in one line and applying \hlkwd{mosaicplot} in the next (see the table/barplot example above).
@@ -211,7 +218,7 @@ These commands produce a series of \hlnum{TRUE} and \hlnum{FALSE} values. There
Suppose we want to extract just the data for the men in the sample, or just for those over 30. We can use the R function \hlkwd{subset} to do that for us. For example, the command
<>=
-mdata <- subset(cdc, cdc$gender == "m")
+mdata <- subset(cdc, gender == "m")
@
will create a new data set called \hlstd{mdata} that contains only the men from the \hlstd{cdc} data set. In addition to finding it in your workspace alongside its dimensions, you can take a peek at the first several rows as usual
@@ -225,13 +232,13 @@ This new data set contains all the same variables but just under half the rows.
As an aside, you can use several of these conditions together with \texttt{\&} and \texttt{|}. The \texttt{\&} is read ``and'' so that
<>=
-m_and_over30 <- subset(cdc, cdc$gender == "m" & cdc$age > 30)
+m_and_over30 <- subset(cdc, gender == "m" & age > 30)
@
will give you the data for men over the age of 30. The \texttt{|} character is read ``or'' so that
<>=
-m_or_over30 <- subset(cdc, cdc$gender == "m" | cdc$age > 30)
+m_or_over30 <- subset(cdc, gender == "m" | age > 30)
@
will take people who are men or over the age of 30 (why that's an interesting group is hard to say, but right now the mechanics of this are the important thing). In principle, you may use as many ``and'' and ``or'' clauses as you like when forming a subset.
@@ -245,7 +252,8 @@ Create a new object called \hlstd{under23\_and\_smoke} that contains all observa
With our subsetting tools in hand, we'll now return to the task of the day: making basic summaries of the BRFSS questionnaire. We've already looked at categorical data such as \hlstd{smoke} and \hlstd{gender} so now let's turn our attention to quantitative data. Two common ways to visualize quantitative data are with box plots and histograms. We can construct a box plot for a single variable with the following command.
<>=
-boxplot(cdc$height)
+# boxplot(cdc$height)
+bwplot(~height, data=cdc)
@
You can compare the locations of the components of the box by examining the summary statistics.
@@ -257,7 +265,8 @@ summary(cdc$height)
Confirm that the median and upper and lower quartiles reported in the numerical summary match those in the graph. The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of comparing across several categories. So we can, for example, compare the heights of men and women with
<>=
-boxplot(cdc$height ~ cdc$gender)
+# boxplot(cdc$height ~ cdc$gender)
+bwplot(height ~ gender, data=cdc)
@
The notation here is new. The \texttildelow~character can be read ``versus'' or ``as a function of''. So we're asking R to give us a box plots of heights where the groups are defined by gender.
@@ -269,9 +278,11 @@ Next let's consider a new variable that doesn't show up directly in this data se
The following two lines first make a new object called \hlstd{bmi} and then creates box plots of these values, defining groups by the variable \hlstd{cdc\$genhlth}.
<>=
-bmi <- (cdc$weight / cdc$height^2) * 703
+# bmi <- (cdc$weight / cdc$height^2) * 703
+# boxplot(bmi ~ cdc$genhlth)
-boxplot(bmi ~ cdc$genhlth)
+cdc = transform(cdc, bmi = (weight / height^2) * 703)
+bwplot(bmi ~ genhlth, data=cdc)
@
Notice that the first line above is just some arithmetic, but it's applied to all 20,000 numbers in the \hlstd{cdc} data set. That is, for each of the 20,000 participants, we take their weight, divide by their height-squared and then multiply by 703. The result is 20,000 BMI values, one for each respondent. This is one reason why we like R: it lets us perform computations like this using very simple expressions.
@@ -283,15 +294,17 @@ What does this box plot show? Pick another categorical variable from the data se
Finally, let's make some histograms. We can look at the histogram for the age of our respondents with the command
<>=
-hist(cdc$age)
+# hist(cdc$age)
+histogram(~age, data=cdc)
@
Histograms are generally a very good way to see the shape of a single distribution, but that shape can change depending on how the data is split between the different bins. You can control the number of bins by adding an argument to the command. In the next two lines, we first make a default histogram of \hlstd{bmi} and then one with 50 breaks.
<>=
-hist(bmi)
-
-hist(bmi, breaks = 50)
+# hist(bmi)
+histogram(~bmi, data=cdc)
+# hist(bmi, breaks = 50)
+histogram(~bmi, data=cdc, nint=50)
@
Note that you can flip between plots that you've created by clicking the forward and backward arrows in the lower right region of RStudio, just above the plots. How do these two histograms compare?
diff --git a/oldLatex/lab1/lab1.tex b/oldLatex/lab1/lab1.tex
index 20ece62..4f683a0 100644
--- a/oldLatex/lab1/lab1.tex
+++ b/oldLatex/lab1/lab1.tex
@@ -142,11 +142,13 @@ \subsection*{Summaries and tables}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{mean}\hlstd{(cdc}\hlopt{$}\hlstd{weight)}
-
-\hlkwd{var}\hlstd{(cdc}\hlopt{$}\hlstd{weight)}
-
-\hlkwd{median}\hlstd{(cdc}\hlopt{$}\hlstd{weight)}
+\hlcom{# mean(cdc$weight)}
+\hlkwd{require}\hlstd{(mosaic)}
+\hlkwd{mean}\hlstd{(}\hlopt{\mytilde}\hlstd{weight,} \hlkwc{data} \hlstd{= cdc)}
+\hlcom{# var(cdc$weight)}
+\hlkwd{var}\hlstd{(}\hlopt{\mytilde}\hlstd{weight,} \hlkwc{data} \hlstd{= cdc)}
+\hlcom{# median(cdc$weight)}
+\hlkwd{median}\hlstd{(}\hlopt{\mytilde}\hlstd{weight,} \hlkwc{data} \hlstd{= cdc)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -156,7 +158,8 @@ \subsection*{Summaries and tables}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{table}\hlstd{(cdc}\hlopt{$}\hlstd{smoke100)}
+\hlcom{# table(cdc$smoke100)}
+\hlkwd{tally}\hlstd{(}\hlopt{\mytilde}\hlstd{smoke100,} \hlkwc{data} \hlstd{= cdc)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -166,7 +169,8 @@ \subsection*{Summaries and tables}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{table}\hlstd{(cdc}\hlopt{$}\hlstd{smoke100)}\hlopt{/}\hlnum{20000}
+\hlcom{# table(cdc$smoke100)/20000}
+\hlkwd{tally}\hlstd{(}\hlopt{\mytilde}\hlstd{smoke100,} \hlkwc{data} \hlstd{= cdc,} \hlkwc{format} \hlstd{=} \hlstr{"proportion"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -176,7 +180,8 @@ \subsection*{Summaries and tables}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{barplot}\hlstd{(}\hlkwd{table}\hlstd{(cdc}\hlopt{$}\hlstd{smoke100))}
+\hlcom{# barplot(table(cdc$smoke100))}
+\hlkwd{barchart}\hlstd{(}\hlkwd{tally}\hlstd{(}\hlopt{\mytilde}\hlstd{smoke100,} \hlkwc{data} \hlstd{= cdc,} \hlkwc{margins} \hlstd{=} \hlnum{FALSE}\hlstd{),} \hlkwc{horizontal} \hlstd{=} \hlnum{FALSE}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -204,7 +209,8 @@ \subsection*{Summaries and tables}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{table}\hlstd{(cdc}\hlopt{$}\hlstd{gender,cdc}\hlopt{$}\hlstd{smoke100)}
+\hlcom{# table(cdc$gender,cdc$smoke100)}
+\hlkwd{tally}\hlstd{(gender} \hlopt{\mytilde} \hlstd{smoke100,} \hlkwc{data} \hlstd{= cdc,} \hlkwc{format} \hlstd{=} \hlstr{"count"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -214,7 +220,8 @@ \subsection*{Summaries and tables}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{mosaicplot}\hlstd{(}\hlkwd{table}\hlstd{(cdc}\hlopt{$}\hlstd{gender,cdc}\hlopt{$}\hlstd{smoke100))}
+\hlcom{# mosaicplot(table(cdc$gender,cdc$smoke100))}
+\hlkwd{mosaicplot}\hlstd{(}\hlkwd{tally}\hlstd{(gender} \hlopt{\mytilde} \hlstd{smoke100,} \hlkwc{data} \hlstd{= cdc,} \hlkwc{margins} \hlstd{=} \hlnum{FALSE}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -356,7 +363,7 @@ \subsection*{A little more on subsetting}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlstd{mdata} \hlkwb{<-} \hlkwd{subset}\hlstd{(cdc, cdc}\hlopt{$}\hlstd{gender} \hlopt{==} \hlstr{"m"}\hlstd{)}
+\hlstd{mdata} \hlkwb{<-} \hlkwd{subset}\hlstd{(cdc, gender} \hlopt{==} \hlstr{"m"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -378,7 +385,7 @@ \subsection*{A little more on subsetting}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlstd{m_and_over30} \hlkwb{<-} \hlkwd{subset}\hlstd{(cdc, cdc}\hlopt{$}\hlstd{gender} \hlopt{==} \hlstr{"m"} \hlopt{&} \hlstd{cdc}\hlopt{$}\hlstd{age} \hlopt{>} \hlnum{30}\hlstd{)}
+\hlstd{m_and_over30} \hlkwb{<-} \hlkwd{subset}\hlstd{(cdc, gender} \hlopt{==} \hlstr{"m"} \hlopt{&} \hlstd{age} \hlopt{>} \hlnum{30}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -388,7 +395,7 @@ \subsection*{A little more on subsetting}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlstd{m_or_over30} \hlkwb{<-} \hlkwd{subset}\hlstd{(cdc, cdc}\hlopt{$}\hlstd{gender} \hlopt{==} \hlstr{"m"} \hlopt{|} \hlstd{cdc}\hlopt{$}\hlstd{age} \hlopt{>} \hlnum{30}\hlstd{)}
+\hlstd{m_or_over30} \hlkwb{<-} \hlkwd{subset}\hlstd{(cdc, gender} \hlopt{==} \hlstr{"m"} \hlopt{|} \hlstd{age} \hlopt{>} \hlnum{30}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -406,7 +413,8 @@ \subsection*{Quantitative data}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{boxplot}\hlstd{(cdc}\hlopt{$}\hlstd{height)}
+\hlcom{# boxplot(cdc$height)}
+\hlkwd{bwplot}\hlstd{(}\hlopt{\mytilde}\hlstd{height,} \hlkwc{data} \hlstd{= cdc)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -426,7 +434,8 @@ \subsection*{Quantitative data}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{boxplot}\hlstd{(cdc}\hlopt{$}\hlstd{height} \hlopt{\mytilde} \hlstd{cdc}\hlopt{$}\hlstd{gender)}
+\hlcom{# boxplot(cdc$height \mytilde cdc$gender)}
+\hlkwd{bwplot}\hlstd{(height} \hlopt{\mytilde} \hlstd{gender,} \hlkwc{data} \hlstd{= cdc)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -442,9 +451,10 @@ \subsection*{Quantitative data}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlstd{bmi} \hlkwb{<-} \hlstd{(cdc}\hlopt{$}\hlstd{weight} \hlopt{/} \hlstd{cdc}\hlopt{$}\hlstd{height}\hlopt{^}\hlnum{2}\hlstd{)} \hlopt{*} \hlnum{703}
+\hlcom{# bmi <- (cdc$weight / cdc$height^2) * 703 boxplot(bmi \mytilde cdc$genhlth)}
-\hlkwd{boxplot}\hlstd{(bmi} \hlopt{\mytilde} \hlstd{cdc}\hlopt{$}\hlstd{genhlth)}
+\hlstd{cdc} \hlkwb{=} \hlkwd{transform}\hlstd{(cdc,} \hlkwc{bmi} \hlstd{= (weight}\hlopt{/}\hlstd{height}\hlopt{^}\hlnum{2}\hlstd{)} \hlopt{*} \hlnum{703}\hlstd{)}
+\hlkwd{bwplot}\hlstd{(bmi} \hlopt{\mytilde} \hlstd{genhlth,} \hlkwc{data} \hlstd{= cdc)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -460,7 +470,8 @@ \subsection*{Quantitative data}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{hist}\hlstd{(cdc}\hlopt{$}\hlstd{age)}
+\hlcom{# hist(cdc$age)}
+\hlkwd{histogram}\hlstd{(}\hlopt{\mytilde}\hlstd{age,} \hlkwc{data} \hlstd{= cdc)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -470,9 +481,10 @@ \subsection*{Quantitative data}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{hist}\hlstd{(bmi)}
-
-\hlkwd{hist}\hlstd{(bmi,} \hlkwc{breaks} \hlstd{=} \hlnum{50}\hlstd{)}
+\hlcom{# hist(bmi)}
+\hlkwd{histogram}\hlstd{(}\hlopt{\mytilde}\hlstd{bmi,} \hlkwc{data} \hlstd{= cdc)}
+\hlcom{# hist(bmi, breaks = 50)}
+\hlkwd{histogram}\hlstd{(}\hlopt{\mytilde}\hlstd{bmi,} \hlkwc{data} \hlstd{= cdc,} \hlkwc{nint} \hlstd{=} \hlnum{50}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
diff --git a/oldLatex/lab2/lab2.Rnw b/oldLatex/lab2/lab2.Rnw
index 598e651..301dceb 100644
--- a/oldLatex/lab2/lab2.Rnw
+++ b/oldLatex/lab2/lab2.Rnw
@@ -107,8 +107,9 @@ While we don't have any data from a shooter we know to have independent shots, t
<>=
outcomes <- c("heads", "tails")
-
-sample(outcomes, size = 1, replace = TRUE)
+require(mosaic)
+# sample(outcomes, size = 1, replace = TRUE)
+resample(outcomes, size = 1)
@
The vector \hlstd{outcomes} can be thought of as a hat with two slips of paper in it: one slip says ``heads" and the other says ``tails". The function \hlkwd{sample} draws one slip from the hat and tells us if it was a head or a tail.
@@ -118,7 +119,8 @@ Run the second command listed above several times. Just like when flipping a coi
If you wanted to simulate flipping a fair coin 100 times, you could either run the function 100 times or, more simply, adjust the \hlkwc{size} argument, which governs how many samples to draw (the \hlkwc{replace = }\hlnum{TRUE} argument indicates we put the slip of paper back in the hat before drawing again). Save the resulting vector of heads and tails in a new object called \hlstd{sim\_fair\_coin}.
<>=
-sim_fair_coin <- sample(outcomes, size = 100, replace = TRUE)
+# sim_fair_coin <- sample(outcomes, size = 100, replace = TRUE)
+sim_fair_coin <- resample(outcomes, size = 100)
@
To view the results of this simulation, type the name of the object and then use \hlkwd{table} to count up the number of heads and tails.
@@ -132,7 +134,8 @@ table(sim_fair_coin)
Since there are only two elements in \hlstd{outcomes}, the probability that we ``flip'' a coin and it lands heads is 0.5. Say we're trying to simulate an unfair coin that we know only lands heads 20\% of the time. We can adjust for this by adding an argument called \hlkwc{prob}, which provides a vector of two probability weights.
<>=
-sim_unfair_coin <- sample(outcomes, size = 100, replace = TRUE, prob = c(0.2, 0.8))
+# sim_unfair_coin <- sample(outcomes, size = 100, replace = TRUE, prob = c(0.2, 0.8))
+sim_unfair_coin <- resample(outcomes, size = 100, prob = c(0.2, 0.8))
@
\hlkwc{prob}\hlkwc{=}\hlkwd{c}\hlnum{(0.2,0.8)} indicates that for the two elements in the \hlstd{outcomes} vector, we want to select the first one, \hlstr{heads}, with probability 0.2 and the second one, \hlstr{tails} with probability 0.8.\symbolfootnote[2]{Another way of thinking about this is to think of the outcome space as a bag of 10 chips, where 2 chips are labeled ``head" and 8 chips ``tail". Therefore at each draw, the probability of drawing a chip that says ``head" is 20\%, and ``tail" is 80\%.}
@@ -156,8 +159,8 @@ Simulating a basketball player who has independent shots uses the same mechanism
<>=
outcomes <- c("H", "M")
-
-sim_basket <- sample(outcomes, size = 1, replace = TRUE)
+# sim_basket <- sample(outcomes, size = 1, replace = TRUE)
+sim_basket <- resample(outcomes, size = 1)
@
To make a valid comparison between Kobe and our simulated independent shooter, we need to align both their shooting percentage and the number of attempted shots.
diff --git a/oldLatex/lab2/lab2.pdf b/oldLatex/lab2/lab2.pdf
index f3049d2..f00a097 100644
Binary files a/oldLatex/lab2/lab2.pdf and b/oldLatex/lab2/lab2.pdf differ
diff --git a/oldLatex/lab2/lab2.tex b/oldLatex/lab2/lab2.tex
index 3a39b98..a00212f 100644
--- a/oldLatex/lab2/lab2.tex
+++ b/oldLatex/lab2/lab2.tex
@@ -173,8 +173,9 @@ \subsection*{Simulations in R}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{outcomes} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlstr{"heads"}\hlstd{,} \hlstr{"tails"}\hlstd{)}
-
-\hlkwd{sample}\hlstd{(outcomes,} \hlkwc{size} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{replace} \hlstd{=} \hlnum{TRUE}\hlstd{)}
+\hlkwd{require}\hlstd{(mosaic)}
+\hlcom{# sample(outcomes, size = 1, replace = TRUE)}
+\hlkwd{resample}\hlstd{(outcomes,} \hlkwc{size} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -189,7 +190,8 @@ \subsection*{Simulations in R}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlstd{sim_fair_coin} \hlkwb{<-} \hlkwd{sample}\hlstd{(outcomes,} \hlkwc{size} \hlstd{=} \hlnum{100}\hlstd{,} \hlkwc{replace} \hlstd{=} \hlnum{TRUE}\hlstd{)}
+\hlcom{# sim_fair_coin <- sample(outcomes, size = 100, replace = TRUE)}
+\hlstd{sim_fair_coin} \hlkwb{<-} \hlkwd{resample}\hlstd{(outcomes,} \hlkwc{size} \hlstd{=} \hlnum{100}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -213,8 +215,9 @@ \subsection*{Simulations in R}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlstd{sim_unfair_coin} \hlkwb{<-} \hlkwd{sample}\hlstd{(outcomes,} \hlkwc{size} \hlstd{=} \hlnum{100}\hlstd{,} \hlkwc{replace} \hlstd{=} \hlnum{TRUE}\hlstd{,} \hlkwc{prob} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{0.2}\hlstd{,}
- \hlnum{0.8}\hlstd{))}
+\hlcom{# sim_unfair_coin <- sample(outcomes, size = 100, replace = TRUE, prob =}
+\hlcom{# c(0.2, 0.8))}
+\hlstd{sim_unfair_coin} \hlkwb{<-} \hlkwd{resample}\hlstd{(outcomes,} \hlkwc{size} \hlstd{=} \hlnum{100}\hlstd{,} \hlkwc{prob} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{0.2}\hlstd{,} \hlnum{0.8}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -248,8 +251,8 @@ \subsection*{Simulating the Independent Shooter}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{outcomes} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlstr{"H"}\hlstd{,} \hlstr{"M"}\hlstd{)}
-
-\hlstd{sim_basket} \hlkwb{<-} \hlkwd{sample}\hlstd{(outcomes,} \hlkwc{size} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{replace} \hlstd{=} \hlnum{TRUE}\hlstd{)}
+\hlcom{# sim_basket <- sample(outcomes, size = 1, replace = TRUE)}
+\hlstd{sim_basket} \hlkwb{<-} \hlkwd{resample}\hlstd{(outcomes,} \hlkwc{size} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
diff --git a/oldLatex/lab3/lab3.Rnw b/oldLatex/lab3/lab3.Rnw
index cd540ab..1590890 100644
--- a/oldLatex/lab3/lab3.Rnw
+++ b/oldLatex/lab3/lab3.Rnw
@@ -52,29 +52,35 @@ fhgtsd <- sd(fdims$hgt)
Next we make a density histogram to use as the backdrop and use the \hlkwd{lines} function to overlay a normal probability curve. The difference between a frequency histogram and a density histogram is that while in a frequency histogram the \emph{heights} of the bars add up to the total number of observations, in a density histogram the \emph{areas} of the bars add up to 1. The area of each bar can be calculated as simply the height $\times$ the width of the bar. Using a density histogram allows us to properly overlay a normal distribution curve over the histogram since the curve is a normal probability density function. Frequency and density histograms both display the same exact shape; they only differ in their y-axis. You can verify this by comparing the frequency histogram you constructed earlier and the density histogram created by the commands below.
-<>=
-hist(fdims$hgt, probability = TRUE)
-
-x <- 140:190
-
-y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
-
-lines(x = x, y = y, col = "blue")
+<>=
+# hist(fdims$hgt, probability = TRUE)
+# x <- 140:190
+# y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
+# lines(x = x, y = y, col = "blue")
+require(mosaic)
+histogram(~hgt, data=fdims, fit="normal", nint=8)
@
+Notice that adding the \hlnum{fit} argument to \hlkwd{histogram} overlays a normal distribution on plot. The optional \hlnum{nint} argument simply specificies the number of bins.
+
After plotting the density histogram with the first command, we create the x- and y-coordinates for the normal curve. We chose the \hlstd{x} range as 140 to 190 in order to span the entire range of \hlstd{fheight}. To create \hlstd{y}, we use \hlkwd{dnorm} to calculate the density of each of those x-values in a distribution that is normal with mean \hlstd{fhgtmean} and standard deviation \hlstd{fhgtsd}. The final command draws a curve on the existing plot (the density histogram) by connecting each of the points specified by \hlstd{x} and \hlstd{y}. The argument \hlkwc{col} simply sets the color for the line to be drawn. If we left it out, the line would be drawn in black.\symbolfootnote[2]{The top of the curve is cut off because the limits of the x- and y-axes are set to best fit the histogram. To adjust the y-axis you can add a third argument to the histogram function: \texttt{hist(fdims\$hgt, probability = TRUE, ylim = c(0, 0.06))}.}
+
+
\begin{exercise}
Based on the this plot, does it appear that the data follow a nearly normal distribution?
\end{exercise}
+
+
\subsection*{Evaluating the normal distribution}
Eyeballing the shape of the histogram is one way to determine if the data appear to be nearly normally distributed, but it can be frustrating to decide just how close the histogram is to the curve. An alternative approach involves constructing a normal probability plot, also called a normal Q-Q plot for ``quantile-quantile''.
<>=
-qqnorm(fdims$hgt)
+# qqnorm(fdims$hgt)
-qqline(fdims$hgt)
+# qqline(fdims$hgt)
+qqmath(~hgt, data=fdims, type=c("p", "r"))
@
A data set that is nearly normal will result in a probability plot where the points closely follow the line. Any deviations from normality leads to deviations of these points from the line. The plot for female heights shows points that tend to follow the line but with some errant points towards the tails. We're left with the same problem that we encountered with the histogram above: how close is close enough?
@@ -82,7 +88,7 @@ A data set that is nearly normal will result in a probability plot where the poi
A useful way to address this question is to rephrase it as: what do probability plots look like for data that I \emph{know} came from a normal distribution? We can answer this by simulating data from a normal distribution using \hlkwd{rnorm}.
<>=
-sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)
+sim_norm <- rnorm(n = nrow(fdims), mean = fhgtmean, sd = fhgtsd)
@
The first argument indicates how many numbers you'd like to generate, which we specify to be the same number of heights in the \hlstd{fdims} data set using the \hlkwd{length} function. The last two arguments determine the mean and standard deviation of the normal distribution from which the simulated sample will be generated. We can take a look at the shape of our simulated data set, \hlstd{sim\_norm}, as well as its normal probability plot.
@@ -118,6 +124,13 @@ If we assume that female heights are normally distributed (a very close approxim
Note that the function \hlkwd{pnorm} gives the area under the normal curve below a given value, \hlkwc{q}, with a given mean and standard deviation. Since we're interested in the probability that someone is taller than 182 cm, we have to take one minus that probability.
+For a graphical interpretation, try:
+
+<>=
+xpnorm(q = 182, mean = fhgtmean, sd = fhgtsd, lower.tail=FALSE)
+@
+
+
Assuming a normal distribution has allowed us to calculate a theoretical probability. If we want to calculate the probability empirically, we simply need to determine how many observations fall above 182 then divide this number by the total sample size.
<>=
diff --git a/oldLatex/lab3/lab3.pdf b/oldLatex/lab3/lab3.pdf
index 606de49..f9e02d4 100644
Binary files a/oldLatex/lab3/lab3.pdf and b/oldLatex/lab3/lab3.pdf differ
diff --git a/oldLatex/lab3/lab3.tex b/oldLatex/lab3/lab3.tex
index 48dca0f..a3257d2 100644
--- a/oldLatex/lab3/lab3.tex
+++ b/oldLatex/lab3/lab3.tex
@@ -125,33 +125,37 @@ \subsection*{The normal distribution}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{hist}\hlstd{(fdims}\hlopt{$}\hlstd{hgt,} \hlkwc{probability} \hlstd{=} \hlnum{TRUE}\hlstd{)}
-
-\hlstd{x} \hlkwb{<-} \hlnum{140}\hlopt{:}\hlnum{190}
-
-\hlstd{y} \hlkwb{<-} \hlkwd{dnorm}\hlstd{(}\hlkwc{x} \hlstd{= x,} \hlkwc{mean} \hlstd{= fhgtmean,} \hlkwc{sd} \hlstd{= fhgtsd)}
-
-\hlkwd{lines}\hlstd{(}\hlkwc{x} \hlstd{= x,} \hlkwc{y} \hlstd{= y,} \hlkwc{col} \hlstd{=} \hlstr{"blue"}\hlstd{)}
+\hlcom{# hist(fdims$hgt, probability = TRUE) x <- 140:190 y <- dnorm(x = x, mean =}
+\hlcom{# fhgtmean, sd = fhgtsd) lines(x = x, y = y, col = 'blue')}
+\hlkwd{require}\hlstd{(mosaic)}
+\hlkwd{histogram}\hlstd{(}\hlopt{~}\hlstd{hgt,} \hlkwc{data} \hlstd{= fdims,} \hlkwc{fit} \hlstd{=} \hlstr{"normal"}\hlstd{,} \hlkwc{nint} \hlstd{=} \hlnum{8}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
+Notice that adding the \hlnum{fit} argument to \hlkwd{histogram} overlays a normal distribution on plot. The optional \hlnum{nint} argument simply specificies the number of bins.
+
After plotting the density histogram with the first command, we create the x- and y-coordinates for the normal curve. We chose the \hlstd{x} range as 140 to 190 in order to span the entire range of \hlstd{fheight}. To create \hlstd{y}, we use \hlkwd{dnorm} to calculate the density of each of those x-values in a distribution that is normal with mean \hlstd{fhgtmean} and standard deviation \hlstd{fhgtsd}. The final command draws a curve on the existing plot (the density histogram) by connecting each of the points specified by \hlstd{x} and \hlstd{y}. The argument \hlkwc{col} simply sets the color for the line to be drawn. If we left it out, the line would be drawn in black.\symbolfootnote[2]{The top of the curve is cut off because the limits of the x- and y-axes are set to best fit the histogram. To adjust the y-axis you can add a third argument to the histogram function: \texttt{hist(fdims\$hgt, probability = TRUE, ylim = c(0, 0.06))}.}
+
+
\begin{exercise}
Based on the this plot, does it appear that the data follow a nearly normal distribution?
\end{exercise}
+
+
\subsection*{Evaluating the normal distribution}
Eyeballing the shape of the histogram is one way to determine if the data appear to be nearly normally distributed, but it can be frustrating to decide just how close the histogram is to the curve. An alternative approach involves constructing a normal probability plot, also called a normal Q-Q plot for ``quantile-quantile''.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{qqnorm}\hlstd{(fdims}\hlopt{$}\hlstd{hgt)}
+\hlcom{# qqnorm(fdims$hgt)}
-\hlkwd{qqline}\hlstd{(fdims}\hlopt{$}\hlstd{hgt)}
+\hlcom{# qqline(fdims$hgt)}
+\hlkwd{qqmath}\hlstd{(}\hlopt{~}\hlstd{hgt,} \hlkwc{data} \hlstd{= fdims,} \hlkwc{type} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"p"}\hlstd{,} \hlstr{"r"}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -164,7 +168,7 @@ \subsection*{Evaluating the normal distribution}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlstd{sim_norm} \hlkwb{<-} \hlkwd{rnorm}\hlstd{(}\hlkwc{n} \hlstd{=} \hlkwd{length}\hlstd{(fdims}\hlopt{$}\hlstd{hgt),} \hlkwc{mean} \hlstd{= fhgtmean,} \hlkwc{sd} \hlstd{= fhgtsd)}
+\hlstd{sim_norm} \hlkwb{<-} \hlkwd{rnorm}\hlstd{(}\hlkwc{n} \hlstd{=} \hlkwd{nrow}\hlstd{(fdims),} \hlkwc{mean} \hlstd{= fhgtmean,} \hlkwc{sd} \hlstd{= fhgtsd)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -213,6 +217,18 @@ \subsection*{Normal probabilities}
Note that the function \hlkwd{pnorm} gives the area under the normal curve below a given value, \hlkwc{q}, with a given mean and standard deviation. Since we're interested in the probability that someone is taller than 182 cm, we have to take one minus that probability.
+For a graphical interpretation, try:
+
+\begin{knitrout}
+\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
+\begin{alltt}
+\hlkwd{xpnorm}\hlstd{(}\hlkwc{q} \hlstd{=} \hlnum{182}\hlstd{,} \hlkwc{mean} \hlstd{= fhgtmean,} \hlkwc{sd} \hlstd{= fhgtsd,} \hlkwc{lower.tail} \hlstd{=} \hlnum{FALSE}\hlstd{)}
+\end{alltt}
+\end{kframe}
+\end{knitrout}
+
+
+
Assuming a normal distribution has allowed us to calculate a theoretical probability. If we want to calculate the probability empirically, we simply need to determine how many observations fall above 182 then divide this number by the total sample size.
\begin{knitrout}
diff --git a/oldLatex/lab4/lab4A.Rnw b/oldLatex/lab4/lab4A.Rnw
index 4b1a38a..df652cd 100644
--- a/oldLatex/lab4/lab4A.Rnw
+++ b/oldLatex/lab4/lab4A.Rnw
@@ -16,6 +16,7 @@ We consider real estate data from the city of Ames, Iowa. The details of every
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
+require(mosaic)
@
We see that there are quite a few variables in the data set, enough to do a very in-depth analysis. For this lab, we'll restrict our attention to just two of the variables: the above ground living area of the house in square feet (\hlstd{Gr.Liv.Area}) and the sale price (\hlstd{SalePrice}). To save some effort throughout the lab, create two variables with short names that represent these two variables.
@@ -31,7 +32,8 @@ Let's look at the distribution of area in our population of home sales by calcul
<>=
summary(area)
-hist(area)
+# hist(area)
+histogram(area)
@
\begin{exercise}
@@ -78,10 +80,20 @@ for(i in 1:5000){
hist(sample_means50)
@
+If you are familiar with computer programming, you know that a \hlnum{for} loop is a fundamental control-flow construct. However, you may find it easier to write loops using the \hlnum{mosaic} function \hlnum{do}, which simply repeats a statement and collects the results in a data frame.
+
+<>=
+sample_means50 = do(5000) * mean(sample(area, 50))
+histogram(~result, data=sample_means50)
+@
+
+
+
If you would like to adjust the bin width of your histogram to show a little more detail, you can do so by changing the \hlkwc{breaks} argument.
<>=
-hist(sample_means50, breaks = 25)
+# hist(sample_means50, breaks = 25)
+histogram(~result, data=sample_means50, nint=25)
@
Here we use R to take 5000 samples of size 50 from the population, calculate the mean of each sample, and store each result in a vector called \hlstd{sample\_means50}. On the next page, we'll review how this set of code works.
@@ -90,6 +102,8 @@ Here we use R to take 5000 samples of size 50 from the population, calculate the
How many elements are there in \hlstd{sample\_means50}? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?
\end{exercise}
+\paragraph{Nota Bene} The \hlnum{for} loop is an invaluable programming construction. However, everything that you need for this class can be done with \hlnum{do}. Please use whatever is most comfortable for you.
+
\subsection*{Interlude: The \texttt{for} loop}
Let's take a break from the statistics for a moment to let that last block of code sink in. You have just run your first for loop, a cornerstone of computer programming. The idea behind the for loop is \emph{iteration}: it allows you to execute code as many times as you want without having to type out every iteration. In the case above, we wanted to iterate the two lines of code inside the curly braces that take a random sample of size 50 from \hlstd{area} then save the mean of that sample into the \hlstd{sample\_means50} vector. Without the for loop, this would be painful:
@@ -141,7 +155,8 @@ To make sure you understand what you've done in this loop, try running a smaller
Mechanics aside, let's return to the reason we used a for loop: to compute a sampling distribution, specifically, this one.
<>=
-hist(sample_means50)
+# hist(sample_means50)
+histogram(~result, data=sample_means50)
@
The sampling distribution that we computed tells us much about estimating the average living area in homes in Ames. Because the sample mean is an unbiased estimator, the sampling distribution is centered at the true average living area of the the population, and the spread of the distribution indicates how much variability is induced by sampling only 50 home sales.
@@ -150,11 +165,14 @@ To get a sense of the effect that sample size has on our distribution, let's bui
<>=
sample_means10 <- rep(0, 5000)
+sample_means50 <- rep(0, 5000)
sample_means100 <- rep(0, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
+ samp <- sample(area, 50)
+ sample_means50[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}
@@ -182,6 +200,34 @@ par(mfrow = c(1, 1))
}. The \hlkwc{breaks} argument specifies the number of bins used in constructing the histogram. The \hlkwc{xlim} argument specifies the range of the x-axis of the histogram, and by setting it equal to \hlstd{xlimits} for each histogram, we ensure that all three histograms will be plotted with the same limits on the x-axis.
+An alternative, shorter version of the above process works as follows. First, generate the three sets of sample means of different sizes.
+
+<>=
+sample_means10 <- do(5000) * mean(sample(area, 10))
+sample_means50 <- do(5000) * mean(sample(area, 50))
+sample_means100 <- do(5000) * mean(sample(area, 100))
+@
+
+Next, combine these three $5000 \times 1$ data frames into one $15000 \times 1$ data frame.
+
+<>=
+samp.dist = rbind(sample_means10, sample_means50, sample_means100)
+@
+
+Then add a new column \hlnum{sample.size} to the resulting data frame that indicates the sample size in each case. This new variable is simply $10$ repeated 5000 times, followed by $50$ repeated 5000 times, followed by $100$ repeated 5000 times. The use of the \hlnum{factor} function will ensure that \hlnum{R} considers this to be a categorical variable, and not a numerical one.
+
+<>=
+samp.dist$sample.size = factor(rep(c(10, 50, 100), each=5000))
+@
+
+Finally, draw the histogram using the \hlnum{|} formula notation. If you want to have the histograms stacked vertically rather than horizontally, use the \hlkwc{layout} argument.
+
+<>=
+histogram(~result | sample.size, data=samp.dist, nint=20, layout=c(1,3))
+@
+
+
+
\begin{exercise}
When the sample size is larger, what happens to the center? What about the spread?
\end{exercise}
diff --git a/oldLatex/lab4/lab4A.tex b/oldLatex/lab4/lab4A.tex
index 64e3c1b..c4c3017 100644
--- a/oldLatex/lab4/lab4A.tex
+++ b/oldLatex/lab4/lab4A.tex
@@ -68,6 +68,7 @@ \subsection*{The data}
\hlkwd{download.file}\hlstd{(}\hlstr{"http://www.openintro.org/stat/data/ames.RData"}\hlstd{,} \hlkwc{destfile} \hlstd{=} \hlstr{"ames.RData"}\hlstd{)}
\hlkwd{load}\hlstd{(}\hlstr{"ames.RData"}\hlstd{)}
+\hlkwd{require}\hlstd{(mosaic)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -93,7 +94,8 @@ \subsection*{The data}
\begin{alltt}
\hlkwd{summary}\hlstd{(area)}
-\hlkwd{hist}\hlstd{(area)}
+\hlcom{# hist(area)}
+\hlkwd{histogram}\hlstd{(area)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -158,12 +160,27 @@ \subsection*{The unknown sampling distribution}
\end{knitrout}
+If you are familiar with computer programming, you know that a \hlnum{for} loop is a fundamental control-flow construct. However, you may find it easier to write loops using the \hlnum{mosaic} function \hlnum{do}, which simply repeats a statement and collects the results in a data frame.
+
+\begin{knitrout}
+\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
+\begin{alltt}
+\hlstd{sample_means50} \hlkwb{=} \hlkwd{do}\hlstd{(}\hlnum{5000}\hlstd{)} \hlopt{*} \hlkwd{mean}\hlstd{(}\hlkwd{sample}\hlstd{(area,} \hlnum{50}\hlstd{))}
+\hlkwd{histogram}\hlstd{(}\hlopt{~}\hlstd{result,} \hlkwc{data} \hlstd{= sample_means50)}
+\end{alltt}
+\end{kframe}
+\end{knitrout}
+
+
+
+
If you would like to adjust the bin width of your histogram to show a little more detail, you can do so by changing the \hlkwc{breaks} argument.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{hist}\hlstd{(sample_means50,} \hlkwc{breaks} \hlstd{=} \hlnum{25}\hlstd{)}
+\hlcom{# hist(sample_means50, breaks = 25)}
+\hlkwd{histogram}\hlstd{(}\hlopt{~}\hlstd{result,} \hlkwc{data} \hlstd{= sample_means50,} \hlkwc{nint} \hlstd{=} \hlnum{25}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -175,6 +192,8 @@ \subsection*{The unknown sampling distribution}
How many elements are there in \hlstd{sample\_means50}? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?
\end{exercise}
+\paragraph{Nota Bene} The \hlnum{for} loop is an invaluable programming construction. However, everything that you need for this class can be done with \hlnum{do}. Please use whatever is most comfortable for you.
+
\subsection*{Interlude: The \texttt{for} loop}
Let's take a break from the statistics for a moment to let that last block of code sink in. You have just run your first for loop, a cornerstone of computer programming. The idea behind the for loop is \emph{iteration}: it allows you to execute code as many times as you want without having to type out every iteration. In the case above, we wanted to iterate the two lines of code inside the curly braces that take a random sample of size 50 from \hlstd{area} then save the mean of that sample into the \hlstd{sample\_means50} vector. Without the for loop, this would be painful:
@@ -238,7 +257,8 @@ \subsection*{Sample size and the sampling distribution}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{hist}\hlstd{(sample_means50)}
+\hlcom{# hist(sample_means50)}
+\hlkwd{histogram}\hlstd{(}\hlopt{~}\hlstd{result,} \hlkwc{data} \hlstd{= sample_means50)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -252,11 +272,14 @@ \subsection*{Sample size and the sampling distribution}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{sample_means10} \hlkwb{<-} \hlkwd{rep}\hlstd{(}\hlnum{0}\hlstd{,} \hlnum{5000}\hlstd{)}
+\hlstd{sample_means50} \hlkwb{<-} \hlkwd{rep}\hlstd{(}\hlnum{0}\hlstd{,} \hlnum{5000}\hlstd{)}
\hlstd{sample_means100} \hlkwb{<-} \hlkwd{rep}\hlstd{(}\hlnum{0}\hlstd{,} \hlnum{5000}\hlstd{)}
\hlkwa{for} \hlstd{(i} \hlkwa{in} \hlnum{1}\hlopt{:}\hlnum{5000}\hlstd{) \{}
\hlstd{samp} \hlkwb{<-} \hlkwd{sample}\hlstd{(area,} \hlnum{10}\hlstd{)}
\hlstd{sample_means10[i]} \hlkwb{<-} \hlkwd{mean}\hlstd{(samp)}
+ \hlstd{samp} \hlkwb{<-} \hlkwd{sample}\hlstd{(area,} \hlnum{50}\hlstd{)}
+ \hlstd{sample_means50[i]} \hlkwb{<-} \hlkwd{mean}\hlstd{(samp)}
\hlstd{samp} \hlkwb{<-} \hlkwd{sample}\hlstd{(area,} \hlnum{100}\hlstd{)}
\hlstd{sample_means100[i]} \hlkwb{<-} \hlkwd{mean}\hlstd{(samp)}
\hlstd{\}}
@@ -297,6 +320,55 @@ \subsection*{Sample size and the sampling distribution}
}. The \hlkwc{breaks} argument specifies the number of bins used in constructing the histogram. The \hlkwc{xlim} argument specifies the range of the x-axis of the histogram, and by setting it equal to \hlstd{xlimits} for each histogram, we ensure that all three histograms will be plotted with the same limits on the x-axis.
+An alternative, shorter version of the above process works as follows. First, generate the three sets of sample means of different sizes.
+
+\begin{knitrout}
+\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
+\begin{alltt}
+\hlstd{sample_means10} \hlkwb{<-} \hlkwd{do}\hlstd{(}\hlnum{5000}\hlstd{)} \hlopt{*} \hlkwd{mean}\hlstd{(}\hlkwd{sample}\hlstd{(area,} \hlnum{10}\hlstd{))}
+\hlstd{sample_means50} \hlkwb{<-} \hlkwd{do}\hlstd{(}\hlnum{5000}\hlstd{)} \hlopt{*} \hlkwd{mean}\hlstd{(}\hlkwd{sample}\hlstd{(area,} \hlnum{50}\hlstd{))}
+\hlstd{sample_means100} \hlkwb{<-} \hlkwd{do}\hlstd{(}\hlnum{5000}\hlstd{)} \hlopt{*} \hlkwd{mean}\hlstd{(}\hlkwd{sample}\hlstd{(area,} \hlnum{100}\hlstd{))}
+\end{alltt}
+\end{kframe}
+\end{knitrout}
+
+
+Next, combine these three $5000 \times 1$ data frames into one $15000 \times 1$ data frame.
+
+\begin{knitrout}
+\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
+\begin{alltt}
+\hlstd{samp.dist} \hlkwb{=} \hlkwd{rbind}\hlstd{(sample_means10, sample_means50, sample_means100)}
+\end{alltt}
+\end{kframe}
+\end{knitrout}
+
+
+Then add a new column \hlnum{sample.size} to the resulting data frame that indicates the sample size in each case. This new variable is simply $10$ repeated 5000 times, followed by $50$ repeated 5000 times, followed by $100$ repeated 5000 times. The use of the \hlnum{factor} function will ensure that \hlnum{R} considers this to be a categorical variable, and not a numerical one.
+
+\begin{knitrout}
+\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
+\begin{alltt}
+\hlstd{samp.dist}\hlopt{$}\hlstd{sample.size} \hlkwb{=} \hlkwd{factor}\hlstd{(}\hlkwd{rep}\hlstd{(}\hlkwd{c}\hlstd{(}\hlnum{10}\hlstd{,} \hlnum{50}\hlstd{,} \hlnum{100}\hlstd{),} \hlkwc{each} \hlstd{=} \hlnum{5000}\hlstd{))}
+\end{alltt}
+\end{kframe}
+\end{knitrout}
+
+
+Finally, draw the histogram using the \hlnum{|} formula notation. If you want to have the histograms stacked vertically rather than horizontally, use the \hlkwc{layout} argument.
+
+\begin{knitrout}
+\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
+\begin{alltt}
+\hlkwd{histogram}\hlstd{(}\hlopt{~}\hlstd{result} \hlopt{|} \hlstd{sample.size,} \hlkwc{data} \hlstd{= samp.dist,} \hlkwc{nint} \hlstd{=} \hlnum{20}\hlstd{,} \hlkwc{layout} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,}
+ \hlnum{3}\hlstd{))}
+\end{alltt}
+\end{kframe}
+\end{knitrout}
+
+
+
+
\begin{exercise}
When the sample size is larger, what happens to the center? What about the spread?
\end{exercise}
diff --git a/oldLatex/lab4/lab4B.Rnw b/oldLatex/lab4/lab4B.Rnw
index 572c37e..fc25c4d 100644
--- a/oldLatex/lab4/lab4B.Rnw
+++ b/oldLatex/lab4/lab4B.Rnw
@@ -117,15 +117,15 @@ for(i in 1:50){
Lastly, we construct the confidence intervals.
<>=
-lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n)
+lower <- samp_mean - 1.96 * samp_sd / sqrt(n)
-upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
+upper <- samp_mean + 1.96 * samp_sd / sqrt(n)
@
-Lower bounds of these 50 confidence intervals are stored in \hlkwd{lower\_vector}, and the upper bounds are in \hlkwd{upper\_vector}. Let's view the first interval.
+Lower bounds of these 50 confidence intervals are stored in \hlnum{lower}, and the upper bounds are in \hlnum{upper}. Let's view the first interval.
<>=
-c(lower_vector[1],upper_vector[1])
+c(lower[1],upper[1])
@
\vspace{1.5cm}
@@ -137,7 +137,7 @@ c(lower_vector[1],upper_vector[1])
\item Using the following function (which was downloaded with the data set), plot all intervals. What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.\symbolfootnote[2]{This figure should look familiar (See Section 4.2.2.)}
<>=
-plot_ci(lower_vector, upper_vector, mean(population))
+plot_ci(lower, upper, mean(population))
@
\item Pick a confidence level of your choosing, provided it is not 95\%. What is the appropriate critical value?
diff --git a/oldLatex/lab4/lab4B.pdf b/oldLatex/lab4/lab4B.pdf
index 0932a8f..3914985 100644
Binary files a/oldLatex/lab4/lab4B.pdf and b/oldLatex/lab4/lab4B.pdf differ
diff --git a/oldLatex/lab4/lab4B.tex b/oldLatex/lab4/lab4B.tex
index b236edf..fa2366f 100644
--- a/oldLatex/lab4/lab4B.tex
+++ b/oldLatex/lab4/lab4B.tex
@@ -204,20 +204,20 @@ \subsection*{Confidence levels}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlstd{lower_vector} \hlkwb{<-} \hlstd{samp_mean} \hlopt{-} \hlnum{1.96} \hlopt{*} \hlstd{samp_sd}\hlopt{/}\hlkwd{sqrt}\hlstd{(n)}
+\hlstd{lower} \hlkwb{<-} \hlstd{samp_mean} \hlopt{-} \hlnum{1.96} \hlopt{*} \hlstd{samp_sd}\hlopt{/}\hlkwd{sqrt}\hlstd{(n)}
-\hlstd{upper_vector} \hlkwb{<-} \hlstd{samp_mean} \hlopt{+} \hlnum{1.96} \hlopt{*} \hlstd{samp_sd}\hlopt{/}\hlkwd{sqrt}\hlstd{(n)}
+\hlstd{upper} \hlkwb{<-} \hlstd{samp_mean} \hlopt{+} \hlnum{1.96} \hlopt{*} \hlstd{samp_sd}\hlopt{/}\hlkwd{sqrt}\hlstd{(n)}
\end{alltt}
\end{kframe}
\end{knitrout}
-Lower bounds of these 50 confidence intervals are stored in \hlkwd{lower\_vector}, and the upper bounds are in \hlkwd{upper\_vector}. Let's view the first interval.
+Lower bounds of these 50 confidence intervals are stored in \hlnum{lower}, and the upper bounds are in \hlnum{upper}. Let's view the first interval.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{c}\hlstd{(lower_vector[}\hlnum{1}\hlstd{], upper_vector[}\hlnum{1}\hlstd{])}
+\hlkwd{c}\hlstd{(lower[}\hlnum{1}\hlstd{], upper[}\hlnum{1}\hlstd{])}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -234,7 +234,7 @@ \subsection*{On your own}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{plot_ci}\hlstd{(lower_vector, upper_vector,} \hlkwd{mean}\hlstd{(population))}
+\hlkwd{plot_ci}\hlstd{(lower, upper,} \hlkwd{mean}\hlstd{(population))}
\end{alltt}
\end{kframe}
\end{knitrout}
diff --git a/oldLatex/lab5/lab5.Rnw b/oldLatex/lab5/lab5.Rnw
index 5039c5e..b2f9fdb 100644
--- a/oldLatex/lab5/lab5.Rnw
+++ b/oldLatex/lab5/lab5.Rnw
@@ -61,8 +61,10 @@ Make a side-by-side boxplot of \hlstd{habit} and \hlstd{weight}. What does the p
The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following function to split the \hlkwd{weight} variable into the \hlkwd{habit} groups, then take the mean of each using the \hlkwd{mean} function.
-<>=
-by(nc$weight, nc$habit, mean)
+<>=
+# by(nc$weight, nc$habit, mean)
+require(mosaic)
+mean(weight ~ habit, data=nc)
@
% mean color of function
diff --git a/oldLatex/lab5/lab5.pdf b/oldLatex/lab5/lab5.pdf
index 8822d83..cd87bd4 100644
Binary files a/oldLatex/lab5/lab5.pdf and b/oldLatex/lab5/lab5.pdf differ
diff --git a/oldLatex/lab5/lab5.tex b/oldLatex/lab5/lab5.tex
index 5fb55d5..26dce91 100644
--- a/oldLatex/lab5/lab5.tex
+++ b/oldLatex/lab5/lab5.tex
@@ -124,7 +124,9 @@ \subsection*{Exploratory analysis}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{by}\hlstd{(nc}\hlopt{$}\hlstd{weight, nc}\hlopt{$}\hlstd{habit, mean)}
+\hlcom{# by(nc$weight, nc$habit, mean)}
+\hlkwd{require}\hlstd{(mosaic)}
+\hlkwd{mean}\hlstd{(weight} \hlopt{~} \hlstd{habit,} \hlkwc{data} \hlstd{= nc)}
\end{alltt}
\end{kframe}
\end{knitrout}
diff --git a/oldLatex/lab6/lab6.Rnw b/oldLatex/lab6/lab6.Rnw
index 2e412d7..c182380 100644
--- a/oldLatex/lab6/lab6.Rnw
+++ b/oldLatex/lab6/lab6.Rnw
@@ -48,8 +48,17 @@ To investigate the link between these two ways of organizing this data, take a l
Using the command below, create a new dataframe called \hlkwd{us12} that contains only the rows in \hlkwd{atheism} associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table~6? If not, why?
<>=
-us12 <- subset(atheism, atheism$nationality == "United States" & atheism$year == "2012")
+us12 = subset(atheism, nationality == "United States" & year == "2012")
@
+%sum(us12$response == "atheist") / length(us12$response)
+<>=
+require(mosaic)
+tally(~response, data=us12, format="proportion")
+@
+
+\emph{Hint:} Consider using functions like \hlnum{tally, table, sum} and/or \hlnum{length}.
+
+%Note that the above piece of code first counts how many atheists there are in the sample, and then divides this number by the total sample size.
\end{exercise}
@@ -65,7 +74,7 @@ Write out the conditions for inference to construct a 95\% confidence interval f
If the conditions for inference are reasonable, we can either calculate the standard error and construct the interval by hand, or allow the \hlkwd{inference} function to do it for us.
<>=
-inference(y = us12$response, est = "proportion", type = "ci", method = "theoretical",
+inference(us12$response, est = "proportion", type = "ci", method = "theoretical",
success = "atheist")
@
@@ -92,7 +101,7 @@ The first step is to make a vector \hlkwd{p} that is a sequence from $0$ to $1$
n <- 1000
p <- seq(0, 1, 0.01)
me <- 2*sqrt(p*(1 - p)/n)
-plot(me ~ p)
+xyplot(me ~ p, ylab="Margin of Error", xlab="Population Proportion")
@
\begin{exercise}
@@ -102,6 +111,8 @@ Describe the relationship between \hlkwd{p} and \hlkwd{me}.
\subsection*{Success-failure condition}
The textbook emphasizes that you must always check conditions before making inference. For inference on proportions, the sample proportion can be assumed to be nearly normal if it is based upon a random sample of independent observations and if both $np \geq 10$ and $n(1 - p) \geq 10$. This rule of thumb is easy enough to follow, but it makes one wonder: what's so special about the number 10? The short answer is: nothing. You could argue that we would be fine with 9 or that we really should be using 11. What is the ``best'' value for such a rule of thumb is, at least to some degree, arbitrary.
+%However, when $np$ and $n(1-p)$ reaches 10 the sampling distribution is sufficiently normal to use confidence intervals and hypothesis tests that are based on that approximation.
+
We can investigate the interplay between $n$ and $p$ and the shape of the sampling distribution by using simulations. To start off, we simulate the process of drawing 5000 samples of size 1040 from a population with a true atheist proportion of $0.1$. For each of the 5000 samples we compute $\hat{p}$ and then plot a histogram to visualize their distribution.
<>=
@@ -130,6 +141,22 @@ Replicate the above simulation three more times but with modified sample sizes a
Once you're done, you can reset the layout of the plotting window by using the command \hlkwd{par(}\hlkwc{mfrow = }\hlkwd{c(}\hlkwd{1,1}\hlkwd{))} or clicking on ``Clear All" above the plotting window (if using RStudio). Note that the latter will get rid of all your previous plots.
+Here is another way to perform the above simulations and compare the resulting histograms. Note that the code to generate the \hlnum{p\_hats} is exactly the same -- only the code to do the loop and draw the plots is different.
+
+<>=
+n = 1040
+p = 0.1
+sim1 = do(1000) * c(sample.size = n, proportion = p, p_hats = sum(sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p)) == "atheist") / n)
+n = 400
+p = 0.2
+sim2 = do(1000) * c(sample.size = n, proportion = p, p_hats = sum(sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p)) == "atheist") / n)
+sim = rbind(sim1, sim2)
+sim = transform(sim, sample.size = as.factor(sample.size))
+sim = transform(sim, proportion = as.factor(proportion))
+histogram(~p_hats | proportion + sample.size, data=sim)
+@
+
+
\begin{exercise}
If you refer to Table 6, you'll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let's suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?
\end{exercise}
@@ -142,7 +169,7 @@ The question of atheism was asked by WIN-Gallup International in a similar surve
\item Answer the following two questions using the \hlkwd{inference} function. As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference.
\begin{enumerate}[(a)]
\item Is there convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012? \\
-\textit{Hint:} Create a new data set for respondents from Spain. Then use their responses as the first input on the \hlkwd{inference}, and use \hlkwd{year} as the grouping variable.
+\textit{Hint:} Create a new data set for respondents from Spain. Form confidence intervals for the true proportion of athiests in both years, and determine whether they overlap.
\item Is there convincing evidence that the United States has seen a change in its atheism index between 2005 and 2012?
\end{enumerate}
diff --git a/oldLatex/lab6/lab6.tex b/oldLatex/lab6/lab6.tex
index 50cd01e..4cac2cb 100644
--- a/oldLatex/lab6/lab6.tex
+++ b/oldLatex/lab6/lab6.tex
@@ -105,12 +105,18 @@ \subsection*{The data}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlstd{us12} \hlkwb{<-} \hlkwd{subset}\hlstd{(atheism, atheism}\hlopt{$}\hlstd{nationality} \hlopt{==} \hlstr{"United States"} \hlopt{&} \hlstd{atheism}\hlopt{$}\hlstd{year} \hlopt{==}
- \hlstr{"2012"}\hlstd{)}
+\hlstd{us12} \hlkwb{=} \hlkwd{subset}\hlstd{(atheism, nationality} \hlopt{==} \hlstr{"United States"} \hlopt{&} \hlstd{year} \hlopt{==} \hlstr{"2012"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
+%sum(us12$response == "atheist") / length(us12$response)
+
+
+
+\emph{Hint:} Consider using functions like \hlnum{tally, table, sum} and/or \hlnum{length}.
+
+%Note that the above piece of code first counts how many atheists there are in the sample, and then divides this number by the total sample size.
\end{exercise}
@@ -128,7 +134,7 @@ \subsection*{Inference on proportions}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{inference}\hlstd{(}\hlkwc{y} \hlstd{= us12}\hlopt{$}\hlstd{response,} \hlkwc{est} \hlstd{=} \hlstr{"proportion"}\hlstd{,} \hlkwc{type} \hlstd{=} \hlstr{"ci"}\hlstd{,} \hlkwc{method} \hlstd{=} \hlstr{"theoretical"}\hlstd{,}
+\hlkwd{inference}\hlstd{(us12}\hlopt{$}\hlstd{response,} \hlkwc{est} \hlstd{=} \hlstr{"proportion"}\hlstd{,} \hlkwc{type} \hlstd{=} \hlstr{"ci"}\hlstd{,} \hlkwc{method} \hlstd{=} \hlstr{"theoretical"}\hlstd{,}
\hlkwc{success} \hlstd{=} \hlstr{"atheist"}\hlstd{)}
\end{alltt}
\end{kframe}
@@ -160,7 +166,7 @@ \subsection*{How does the proportion affect the margin of error?}
\hlstd{n} \hlkwb{<-} \hlnum{1000}
\hlstd{p} \hlkwb{<-} \hlkwd{seq}\hlstd{(}\hlnum{0}\hlstd{,} \hlnum{1}\hlstd{,} \hlnum{0.01}\hlstd{)}
\hlstd{me} \hlkwb{<-} \hlnum{2} \hlopt{*} \hlkwd{sqrt}\hlstd{(p} \hlopt{*} \hlstd{(}\hlnum{1} \hlopt{-} \hlstd{p)}\hlopt{/}\hlstd{n)}
-\hlkwd{plot}\hlstd{(me} \hlopt{~} \hlstd{p)}
+\hlkwd{xyplot}\hlstd{(me} \hlopt{~} \hlstd{p,} \hlkwc{ylab} \hlstd{=} \hlstr{"Margin of Error"}\hlstd{,} \hlkwc{xlab} \hlstd{=} \hlstr{"Population Proportion"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -173,6 +179,8 @@ \subsection*{How does the proportion affect the margin of error?}
\subsection*{Success-failure condition}
The textbook emphasizes that you must always check conditions before making inference. For inference on proportions, the sample proportion can be assumed to be nearly normal if it is based upon a random sample of independent observations and if both $np \geq 10$ and $n(1 - p) \geq 10$. This rule of thumb is easy enough to follow, but it makes one wonder: what's so special about the number 10? The short answer is: nothing. You could argue that we would be fine with 9 or that we really should be using 11. What is the ``best'' value for such a rule of thumb is, at least to some degree, arbitrary.
+%However, when $np$ and $n(1-p)$ reaches 10 the sampling distribution is sufficiently normal to use confidence intervals and hypothesis tests that are based on that approximation.
+
We can investigate the interplay between $n$ and $p$ and the shape of the sampling distribution by using simulations. To start off, we simulate the process of drawing 5000 samples of size 1040 from a population with a true atheist proportion of $0.1$. For each of the 5000 samples we compute $\hat{p}$ and then plot a histogram to visualize their distribution.
\begin{knitrout}
@@ -207,6 +215,29 @@ \subsection*{Success-failure condition}
Once you're done, you can reset the layout of the plotting window by using the command \hlkwd{par(}\hlkwc{mfrow = }\hlkwd{c(}\hlkwd{1,1}\hlkwd{))} or clicking on ``Clear All" above the plotting window (if using RStudio). Note that the latter will get rid of all your previous plots.
+Here is another way to perform the above simulations and compare the resulting histograms. Note that the code to generate the \hlnum{p\_hats} is exactly the same -- only the code to do the loop and draw the plots is different.
+
+\begin{knitrout}
+\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
+\begin{alltt}
+\hlstd{n} \hlkwb{=} \hlnum{1040}
+\hlstd{p} \hlkwb{=} \hlnum{0.1}
+\hlstd{sim1} \hlkwb{=} \hlkwd{do}\hlstd{(}\hlnum{1000}\hlstd{)} \hlopt{*} \hlkwd{c}\hlstd{(}\hlkwc{sample.size} \hlstd{= n,} \hlkwc{proportion} \hlstd{= p,} \hlkwc{p_hats} \hlstd{=} \hlkwd{sum}\hlstd{(}\hlkwd{sample}\hlstd{(}\hlkwd{c}\hlstd{(}\hlstr{"atheist"}\hlstd{,}
+ \hlstr{"non_atheist"}\hlstd{), n,} \hlkwc{replace} \hlstd{=} \hlnum{TRUE}\hlstd{,} \hlkwc{prob} \hlstd{=} \hlkwd{c}\hlstd{(p,} \hlnum{1} \hlopt{-} \hlstd{p))} \hlopt{==} \hlstr{"atheist"}\hlstd{)}\hlopt{/}\hlstd{n)}
+\hlstd{n} \hlkwb{=} \hlnum{400}
+\hlstd{p} \hlkwb{=} \hlnum{0.2}
+\hlstd{sim2} \hlkwb{=} \hlkwd{do}\hlstd{(}\hlnum{1000}\hlstd{)} \hlopt{*} \hlkwd{c}\hlstd{(}\hlkwc{sample.size} \hlstd{= n,} \hlkwc{proportion} \hlstd{= p,} \hlkwc{p_hats} \hlstd{=} \hlkwd{sum}\hlstd{(}\hlkwd{sample}\hlstd{(}\hlkwd{c}\hlstd{(}\hlstr{"atheist"}\hlstd{,}
+ \hlstr{"non_atheist"}\hlstd{), n,} \hlkwc{replace} \hlstd{=} \hlnum{TRUE}\hlstd{,} \hlkwc{prob} \hlstd{=} \hlkwd{c}\hlstd{(p,} \hlnum{1} \hlopt{-} \hlstd{p))} \hlopt{==} \hlstr{"atheist"}\hlstd{)}\hlopt{/}\hlstd{n)}
+\hlstd{sim} \hlkwb{=} \hlkwd{rbind}\hlstd{(sim1, sim2)}
+\hlstd{sim} \hlkwb{=} \hlkwd{transform}\hlstd{(sim,} \hlkwc{sample.size} \hlstd{=} \hlkwd{as.factor}\hlstd{(sample.size))}
+\hlstd{sim} \hlkwb{=} \hlkwd{transform}\hlstd{(sim,} \hlkwc{proportion} \hlstd{=} \hlkwd{as.factor}\hlstd{(proportion))}
+\hlkwd{histogram}\hlstd{(}\hlopt{~}\hlstd{p_hats} \hlopt{|} \hlstd{proportion} \hlopt{+} \hlstd{sample.size,} \hlkwc{data} \hlstd{= sim)}
+\end{alltt}
+\end{kframe}
+\end{knitrout}
+
+
+
\begin{exercise}
If you refer to Table 6, you'll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let's suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?
\end{exercise}
@@ -219,7 +250,7 @@ \subsection*{On your own}
\item Answer the following two questions using the \hlkwd{inference} function. As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference.
\begin{enumerate}[(a)]
\item Is there convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012? \\
-\textit{Hint:} Create a new data set for respondents from Spain. Then use their responses as the first input on the \hlkwd{inference}, and use \hlkwd{year} as the grouping variable.
+\textit{Hint:} Create a new data set for respondents from Spain. Form confidence intervals for the true proportion of athiests in both years, and determine whether they overlap.
\item Is there convincing evidence that the United States has seen a change in its atheism index between 2005 and 2012?
\end{enumerate}
diff --git a/oldLatex/lab7/lab7.Rnw b/oldLatex/lab7/lab7.Rnw
index f6b39fc..33526da 100644
--- a/oldLatex/lab7/lab7.Rnw
+++ b/oldLatex/lab7/lab7.Rnw
@@ -39,8 +39,10 @@ What type of plot would you use to display the relationship between \hlkwd{runs}
If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.
-<>=
-cor(mlb11$runs, mlb11$at_bats)
+<>=
+# cor(mlb11$runs, mlb11$at_bats)
+require(mosaic)
+cor(runs ~ at_bats, data=mlb11)
@
\subsection*{Sum of squared residuals}
@@ -104,9 +106,10 @@ Fit a new model that uses \hlkwd{homeruns} to predict \hlkwd{runs}. Using the e
Let's create a scatterplot with the least squares line laid on top.
<>=
-plot(mlb11$runs ~ mlb11$at_bats)
-
-abline(m1)
+# plot(mlb11$runs ~ mlb11$at_bats)
+# abline(m1)
+xyplot(runs ~ at_bats, data=mlb11)
+ladd(panel.abline(m1))
@
The function \hlkwd{abline} plots a line based on its slope and intercept. Here, we used a shortcut by providing the model \hlkwd{m1}, which contains both parameter estimates. This line can be used to predict $y$ at any value of $x$. When predictions are made for values of $x$ that are beyond the range of the observed data, it is referred to as \emph{extrapolation} and is not usually recommended. However, predictions made within the range of the data are more reliable. They're also used to compute the residuals.
@@ -125,9 +128,9 @@ To assess whether the linear model is reliable, we need to check for (1) linear
\item Linearity: You already checked if the relationship between runs and at-bats is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. at-bats. Recall that any code following a \hlcom{\#} is intended to be a comment that helps understand the code but is ignored by R.
<>=
-plot(m1$residuals ~ mlb11$at_bats)
-
-abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0
+# plot(m1$residuals ~ mlb11$at_bats)
+# abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0
+xyplot(residuals(m1) ~ at_bats, data=mlb11, type=c("p", "r"), lty=3)
@
\begin{exercise}
@@ -137,15 +140,15 @@ Is there any apparent pattern in the residuals plot? What does this indicate abo
\item Nearly normal residuals: To check this condition, we can look at a histogram
<>=
-hist(m1$residuals)
+# hist(m1$residuals)
+histogram(~residuals(m1), fit="normal")
@
or a normal probability plot of the residuals.
<>=
-qqnorm(m1$residuals)
-
-qqline(m1$residuals) # adds diagonal line to the normal prob plot
+qqnorm(residuals(m1))
+qqline(residuals(m1)) # adds diagonal line to the normal prob plot
@
\begin{exercise}
diff --git a/oldLatex/lab7/lab7.pdf b/oldLatex/lab7/lab7.pdf
deleted file mode 100644
index dd96f18..0000000
Binary files a/oldLatex/lab7/lab7.pdf and /dev/null differ
diff --git a/oldLatex/lab7/lab7.tex b/oldLatex/lab7/lab7.tex
index 23f3e28..5662eb7 100644
--- a/oldLatex/lab7/lab7.tex
+++ b/oldLatex/lab7/lab7.tex
@@ -92,7 +92,9 @@ \subsection*{The data}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{cor}\hlstd{(mlb11}\hlopt{$}\hlstd{runs, mlb11}\hlopt{$}\hlstd{at_bats)}
+\hlcom{# cor(mlb11$runs, mlb11$at_bats)}
+\hlkwd{require}\hlstd{(mosaic)}
+\hlkwd{cor}\hlstd{(runs} \hlopt{\mytilde} \hlstd{at_bats,} \hlkwc{data} \hlstd{= mlb11)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -181,9 +183,9 @@ \subsection*{Prediction and prediction errors}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{plot}\hlstd{(mlb11}\hlopt{$}\hlstd{runs} \hlopt{\mytilde} \hlstd{mlb11}\hlopt{$}\hlstd{at_bats)}
-
-\hlkwd{abline}\hlstd{(m1)}
+\hlcom{# plot(mlb11$runs \mytilde mlb11$at_bats) abline(m1)}
+\hlkwd{xyplot}\hlstd{(runs} \hlopt{\mytilde} \hlstd{at_bats,} \hlkwc{data} \hlstd{= mlb11)}
+\hlkwd{ladd}\hlstd{(}\hlkwd{panel.abline}\hlstd{(m1))}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -207,9 +209,9 @@ \subsection*{Model diagnostics}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{plot}\hlstd{(m1}\hlopt{$}\hlstd{residuals} \hlopt{\mytilde} \hlstd{mlb11}\hlopt{$}\hlstd{at_bats)}
-
-\hlkwd{abline}\hlstd{(}\hlkwc{h} \hlstd{=} \hlnum{0}\hlstd{,} \hlkwc{lty} \hlstd{=} \hlnum{3}\hlstd{)} \hlcom{# adds a horizontal dashed line at y = 0}
+\hlcom{# plot(m1$residuals \mytilde mlb11$at_bats) abline(h = 0, lty = 3) # adds a}
+\hlcom{# horizontal dashed line at y = 0}
+\hlkwd{xyplot}\hlstd{(}\hlkwd{residuals}\hlstd{(m1)} \hlopt{\mytilde} \hlstd{at_bats,} \hlkwc{data} \hlstd{= mlb11,} \hlkwc{type} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"p"}\hlstd{,} \hlstr{"r"}\hlstd{),} \hlkwc{lty} \hlstd{=} \hlnum{3}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -224,7 +226,8 @@ \subsection*{Model diagnostics}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{hist}\hlstd{(m1}\hlopt{$}\hlstd{residuals)}
+\hlcom{# hist(m1$residuals)}
+\hlkwd{histogram}\hlstd{(}\hlopt{\mytilde}\hlkwd{residuals}\hlstd{(m1),} \hlkwc{fit} \hlstd{=} \hlstr{"normal"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
@@ -235,9 +238,8 @@ \subsection*{Model diagnostics}
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
-\hlkwd{qqnorm}\hlstd{(m1}\hlopt{$}\hlstd{residuals)}
-
-\hlkwd{qqline}\hlstd{(m1}\hlopt{$}\hlstd{residuals)} \hlcom{# adds diagonal line to the normal prob plot}
+\hlkwd{qqnorm}\hlstd{(}\hlkwd{residuals}\hlstd{(m1))}
+\hlkwd{qqline}\hlstd{(}\hlkwd{residuals}\hlstd{(m1))} \hlcom{# adds diagonal line to the normal prob plot}
\end{alltt}
\end{kframe}
\end{knitrout}
diff --git a/oldLatex/lab8/lab8.Rnw b/oldLatex/lab8/lab8.Rnw
index da13401..0a90bee 100644
--- a/oldLatex/lab8/lab8.Rnw
+++ b/oldLatex/lab8/lab8.Rnw
@@ -78,7 +78,8 @@ Excluding \hlstd{score}, select two other variables and describe their relations
The fundamental phenomenon suggested by the study is that better looking teachers are evaluated more favorably. Let's create a scatterplot to see if this appears to be the case:
<>=
-plot(evals$score ~ evals$bty_avg)
+# plot(evals$score ~ evals$bty_avg)
+plot(score ~ bty_avg, data=evals)
@
Before we draw conclusions about the trend, compare the number of observations in the data frame with the approximate number of points on the scatterplot. Is anything awry?
@@ -100,8 +101,8 @@ Use residual plots to evaluate whether the conditions of least squares regressio
The data set contains several variables on the beauty score of the professor: individual ratings from each of the six students who were asked to score the physical appearance of the professors and the average of these six scores. Let's take a look at the relationship between one of these scores and the average beauty score.
<>=
-plot(evals$bty_avg ~ evals$bty_f1lower)
-cor(evals$bty_avg, evals$bty_f1lower)
+plot(bty_avg ~ bty_f1lower, data=evals)
+with(evals, cor(bty_avg, bty_f1lower))
@
As expected the relationship is quite strong -- after all, the average score is calculated using the individual scores. We can actually take a look at the relationships between all beauty variables (columns 13 through 19) using the following command:
diff --git a/probability/probability.html b/probability/probability.html
deleted file mode 100644
index 6af0b92..0000000
--- a/probability/probability.html
+++ /dev/null
@@ -1,204 +0,0 @@
-
-
-
-
-
-
-
-
-
-
-
-
-Probability
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Probability
-
-
-
-
-
Hot Hands
-
Basketball players who make several baskets in succession are described as having a hot hand. Fans and players have long believed in the hot hand phenomenon, which refutes the assumption that each shot is independent of the next. However, a 1985 paper by Gilovich, Vallone, and Tversky collected evidence that contradicted this belief and showed that successive shots are independent events (http://psych.cornell.edu/sites/default/files/Gilo.Vallone.Tversky.pdf). This paper started a great controversy that continues to this day, as you can see by Googling hot hand basketball.
-
We do not expect to resolve this controversy today. However, in this lab we’ll apply one approach to answering questions like this. The goals for this lab are to (1) think about the effects of independent and dependent events, (2) learn how to simulate shooting streaks in R, and (3) to compare a simulation to actual data in order to determine if the hot hand phenomenon appears to be real.
-
-
-
Saving your code
-
Click on File -> New -> R Script. This will open a blank document above the console. As you go along you can copy and paste your code here and save it. This is a good way to keep track of your code and be able to reuse it later. To run your code from this document you can either copy and paste it into the console, highlight the code and hit the Run button, or highlight the code and hit command+enter or a mac or control+enter on a PC.
-
You’ll also want to save this script (code document). To do so click on the disk icon. The first time you hit save, RStudio will ask for a file name; you can name it anything you like. Once you hit save you’ll see the file appear under the Files tab in the lower right panel. You can reopen this file anytime by simply clicking on it.
-
-
-
Getting Started
-
Our investigation will focus on the performance of one player: Kobe Bryant of the Los Angeles Lakers. His performance against the Orlando Magic in the 2009 NBA finals earned him the title Most Valuable Player and many spectators commented on how he appeared to show a hot hand. Let’s load some data from those games and look at the first several rows.
In this data frame, every row records a shot taken by Kobe Bryant. If he hit the shot (made a basket), a hit, H, is recorded in the column named basket, otherwise a miss, M, is recorded.
-
Just looking at the string of hits and misses, it can be difficult to gauge whether or not it seems like Kobe was shooting with a hot hand. One way we can approach this is by considering the belief that hot hand shooters tend to go on shooting streaks. For this lab, we define the length of a shooting streak to be the number of consecutive baskets made until a miss occurs.
-
For example, in Game 1 Kobe had the following sequence of hits and misses from his nine shot attempts in the first quarter:
-
\[ \textrm{H M | M | H H M | M | M | M} \]
-
To verify this use the following command:
-
kobe$basket[1:9]
-
Within the nine shot attempts, there are six streaks, which are separated by a “|” above. Their lengths are one, zero, two, zero, zero, zero (in order of occurrence).
-
-
What does a streak length of 1 mean, i.e. how many hits and misses are in a streak of 1? What about a streak length of 0?
-
-
The custom function calc_streak, which was loaded in with the data, may be used to calculate the lengths of all shooting streaks and then look at the distribution.
Note that instead of making a histogram, we chose to make a bar plot from a table of the streak data. A bar plot is preferable here since our variable is discrete – counts – instead of continuous.
-
-
Describe the distribution of Kobe’s streak lengths from the 2009 NBA finals. What was his typical streak length? How long was his longest streak of baskets?
-
-
-
-
Compared to What?
-
We’ve shown that Kobe had some long shooting streaks, but are they long enough to support the belief that he had hot hands? What can we compare them to?
-
To answer these questions, let’s return to the idea of independence. Two processes are independent if the outcome of one process doesn’t effect the outcome of the second. If each shot that a player takes is an independent process, having made or missed your first shot will not affect the probability that you will make or miss your second shot.
-
A shooter with a hot hand will have shots that are not independent of one another. Specifically, if the shooter makes his first shot, the hot hand model says he will have a higher probability of making his second shot.
-
Let’s suppose for a moment that the hot hand model is valid for Kobe. During his career, the percentage of time Kobe makes a basket (i.e. his shooting percentage) is about 45%, or in probability notation,
-
\[ P(\textrm{shot 1 = H}) = 0.45 \]
-
If he makes the first shot and has a hot hand (not independent shots), then the probability that he makes his second shot would go up to, let’s say, 60%,
As a result of these increased probabilites, you’d expect Kobe to have longer streaks. Compare this to the skeptical perspective where Kobe does not have a hot hand, where each shot is independent of the next. If he hit his first shot, the probability that he makes the second is still 0.45.
In other words, making the first shot did nothing to effect the probability that he’d make his second shot. If Kobe’s shots are independent, then he’d have the same probability of hitting every shot regardless of his past shots: 45%.
-
Now that we’ve phrased the situation in terms of independent shots, let’s return to the question: how do we tell if Kobe’s shooting streaks are long enough to indicate that he has hot hands? We can compare his streak lengths to someone without hot hands: an independent shooter.
-
-
-
Simulations in R
-
While we don’t have any data from a shooter we know to have independent shots, that sort of data is very easy to simulate in R. In a simulation, you set the ground rules of a random process and then the computer uses random numbers to generate an outcome that adheres to those rules. As a simple example, you can simulate flipping a fair coin with the following.
The vector outcomes can be thought of as a hat with two slips of paper in it: one slip says heads and the other says tails. The function sample draws one slip from the hat and tells us if it was a head or a tail.
-
Run the second command listed above several times. Just like when flipping a coin, sometimes you’ll get a heads, sometimes you’ll get a tails, but in the long run, you’d expect to get roughly equal numbers of each.
-
If you wanted to simulate flipping a fair coin 100 times, you could either run the function 100 times or, more simply, adjust the size argument, which governs how many samples to draw (the replace = TRUE argument indicates we put the slip of paper back in the hat before drawing again). Save the resulting vector of heads and tails in a new object called sim_fair_coin.
To view the results of this simulation, type the name of the object and then use table to count up the number of heads and tails.
-
sim_fair_coin
-table(sim_fair_coin)
-
Since there are only two elements in outcomes, the probability that we “flip” a coin and it lands heads is 0.5. Say we’re trying to simulate an unfair coin that we know only lands heads 20% of the time. We can adjust for this by adding an argument called prob, which provides a vector of two probability weights.
prob=c(0.2, 0.8) indicates that for the two elements in the outcomes vector, we want to select the first one, heads, with probability 0.2 and the second one, tails with probability 0.8. Another way of thinking about this is to think of the outcome space as a bag of 10 chips, where 2 chips are labeled “head” and 8 chips “tail”. Therefore at each draw, the probability of drawing a chip that says “head”" is 20%, and “tail” is 80%.
-
-
In your simulation of flipping the unfair coin 100 times, how many flips came up heads?
-
-
In a sense, we’ve shrunken the size of the slip of paper that says “heads”, making it less likely to be drawn and we’ve increased the size of the slip of paper saying “tails”, making it more likely to be drawn. When we simulated the fair coin, both slips of paper were the same size. This happens by default if you don’t provide a prob argument; all elements in the outcomes vector have an equal probability of being drawn.
-
If you want to learn more about sample or any other function, recall that you can always check out its help file.
-
?sample
-
-
-
Simulating the Independent Shooter
-
Simulating a basketball player who has independent shots uses the same mechanism that we use to simulate a coin flip. To simulate a single shot from an independent shooter with a shooting percentage of 50% we type,
To make a valid comparison between Kobe and our simulated independent shooter, we need to align both their shooting percentage and the number of attempted shots.
-
-
What change needs to be made to the sample function so that it reflects a shooting percentage of 45%? Make this adjustment, then run a simulation to sample 133 shots. Assign the output of this simulation to a new object called sim_basket.
-
-
Note that we’ve named the new vector sim_basket, the same name that we gave to the previous vector reflecting a shooting percentage of 50%. In this situation, R overwrites the old object with the new one, so always make sure that you don’t need the information in an old vector before reassigning its name.
-
With the results of the simulation saved as sim_basket, we have the data necessary to compare Kobe to our independent shooter. We can look at Kobe’s data alongside our simulated data.
-
kobe$basket
-sim_basket
-
Both data sets represent the results of 133 shot attempts, each with the same shooting percentage of 45%. We know that our simulated data is from a shooter that has independent shots. That is, we know the simulated shooter does not have a hot hand.
-
-
-
-
On your own
-
-
Comparing Kobe Bryant to the Independent Shooter
-
Using calc_streak, compute the streak lengths of sim_basket.
-
-
Describe the distribution of streak lengths. What is the typical streak length for this simulated independent shooter with a 45% shooting percentage? How long is the player’s longest streak of baskets in 133 shots?
-
If you were to run the simulation of the independent shooter a second time, how would you expect its streak distribution to compare to the distribution from the question above? Exactly the same? Somewhat similar? Totally different? Explain your reasoning.
-
How does Kobe Bryant’s distribution of streak lengths compare to the distribution of streak lengths for the simulated shooter? Using this comparison, do you have evidence that the hot hand model fits Kobe’s shooting patterns? Explain.
-
-
-This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
diff --git a/sampling_distributions/more/ames.RData b/sampling_distributions/more/ames.RData
deleted file mode 100644
index 68d6cae..0000000
Binary files a/sampling_distributions/more/ames.RData and /dev/null differ
diff --git a/sampling_distributions/sampling_distributions.Rmd b/sampling_distributions/sampling_distributions.Rmd
deleted file mode 100644
index baa60f8..0000000
--- a/sampling_distributions/sampling_distributions.Rmd
+++ /dev/null
@@ -1,300 +0,0 @@
----
-title: 'Foundations for statistical inference - Sampling distributions'
-output:
- html_document:
- css: ../lab.css
- highlight: pygments
- theme: cerulean
- pdf_document: default
----
-
-In this lab, we investigate the ways in which the statistics from a random
-sample of data can serve as point estimates for population parameters. We're
-interested in formulating a *sampling distribution* of our estimate in order
-to learn about the properties of the estimate, such as its distribution.
-
-## The data
-
-We consider real estate data from the city of Ames, Iowa. The details of
-every real estate transaction in Ames is recorded by the City Assessor's
-office. Our particular focus for this lab will be all residential home sales
-in Ames between 2006 and 2010. This collection represents our population of
-interest. In this lab we would like to learn about these home sales by taking
-smaller samples from the full population. Let's load the data.
-
-```{r load-data, eval=FALSE}
-download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
-load("ames.RData")
-```
-
-We see that there are quite a few variables in the data set, enough to do a
-very in-depth analysis. For this lab, we'll restrict our attention to just
-two of the variables: the above ground living area of the house in square feet
-(`Gr.Liv.Area`) and the sale price (`SalePrice`). To save some effort
-throughout the lab, create two variables with short names that represent these
-two variables.
-
-```{r assign, eval=FALSE}
-area <- ames$Gr.Liv.Area
-price <- ames$SalePrice
-```
-
-Let's look at the distribution of area in our population of home sales by
-calculating a few summary statistics and making a histogram.
-
-```{r area, eval=FALSE}
-summary(area)
-hist(area)
-```
-
-1. Describe this population distribution.
-
-## The unknown sampling distribution
-
-In this lab we have access to the entire population, but this is rarely the
-case in real life. Gathering information on an entire population is often
-extremely costly or impossible. Because of this, we often take a sample of
-the population and use that to understand the properties of the population.
-
-If we were interested in estimating the mean living area in Ames based on a
-sample, we can use the following command to survey the population.
-
-```{r samp1, eval=FALSE}
-samp1 <- sample(area, 50)
-```
-
-This command collects a simple random sample of size 50 from the vector
-`area`, which is assigned to `samp1`. This is like going into the City
-Assessor's database and pulling up the files on 50 random home sales. Working
-with these 50 files would be considerably simpler than working with all 2930
-home sales.
-
-2. Describe the distribution of this sample. How does it compare to the
- distribution of the population?
-
-If we're interested in estimating the average living area in homes in Ames
-using the sample, our best single guess is the sample mean.
-
-```{r mean-samp1, eval=FALSE}
-mean(samp1)
-```
-
-Depending on which 50 homes you selected, your estimate could be a bit above
-or a bit below the true population mean of 1499.69 square feet. In general,
-though, the sample mean turns out to be a pretty good estimate of the average
-living area, and we were able to get it by sampling less than 3\% of the
-population.
-
-3. Take a second sample, also of size 50, and call it `samp2`. How does the
- mean of `samp2` compare with the mean of `samp1`? Suppose we took two
- more samples, one of size 100 and one of size 1000. Which would you think
- would provide a more accurate estimate of the population mean?
-
-Not surprisingly, every time we take another random sample, we get a different
-sample mean. It's useful to get a sense of just how much variability we
-should expect when estimating the population mean this way. The distribution
-of sample means, called the *sampling distribution*, can help us understand
-this variability. In this lab, because we have access to the population, we
-can build up the sampling distribution for the sample mean by repeating the
-above steps many times. Here we will generate 5000 samples and compute the
-sample mean of each.
-
-```{r loop, eval=FALSE}
-sample_means50 <- rep(NA, 5000)
-
-for(i in 1:5000){
- samp <- sample(area, 50)
- sample_means50[i] <- mean(samp)
- }
-
-hist(sample_means50)
-```
-
-If you would like to adjust the bin width of your histogram to show a little
-more detail, you can do so by changing the `breaks` argument.
-
-```{r hist-breaks, eval=FALSE}
-hist(sample_means50, breaks = 25)
-```
-
-Here we use R to take 5000 samples of size 50 from the population, calculate
-the mean of each sample, and store each result in a vector called
-`sample_means50`. On the next page, we'll review how this set of code works.
-
-4. How many elements are there in `sample_means50`? Describe the sampling
- distribution, and be sure to specifically note its center. Would you
- expect the distribution to change if we instead collected 50,000 sample
- means?
-
-## Interlude: The `for` loop
-
-Let's take a break from the statistics for a moment to let that last block of
-code sink in. You have just run your first `for` loop, a cornerstone of
-computer programming. The idea behind the for loop is *iteration*: it allows
-you to execute code as many times as you want without having to type out every
-iteration. In the case above, we wanted to iterate the two lines of code
-inside the curly braces that take a random sample of size 50 from `area` then
-save the mean of that sample into the `sample_means50` vector. Without the
-`for` loop, this would be painful:
-
-```{r loop-long, eval=FALSE}
-sample_means50 <- rep(NA, 5000)
-
-samp <- sample(area, 50)
-sample_means50[1] <- mean(samp)
-
-samp <- sample(area, 50)
-sample_means50[2] <- mean(samp)
-
-samp <- sample(area, 50)
-sample_means50[3] <- mean(samp)
-
-samp <- sample(area, 50)
-sample_means50[4] <- mean(samp)
-```
-
-and so on...
-
-With the for loop, these thousands of lines of code are compressed into a
-handful of lines. We've added one extra line to the code below, which prints
-the variable `i` during each iteration of the `for` loop. Run this code.
-
-```{r loop-again, eval=FALSE}
-sample_means50 <- rep(NA, 5000)
-
-for(i in 1:5000){
- samp <- sample(area, 50)
- sample_means50[i] <- mean(samp)
- print(i)
- }
-```
-
-Let's consider this code line by line to figure out what it does. In the
-first line we *initialized a vector*. In this case, we created a vector of
-5000 zeros called `sample_means50`. This vector will will store values
-generated within the `for` loop.
-
-The second line calls the `for` loop itself. The syntax can be loosely read as,
-"for every element `i` from 1 to 5000, run the following lines of code". You
-can think of `i` as the counter that keeps track of which loop you're on.
-Therefore, more precisely, the loop will run once when `i = 1`, then once when
-`i = 2`, and so on up to `i = 5000`.
-
-The body of the `for` loop is the part inside the curly braces, and this set of
-code is run for each value of `i`. Here, on every loop, we take a random
-sample of size 50 from `area`, take its mean, and store it as the
-$i$th element of `sample_means50`.
-
-In order to display that this is really happening, we asked R to print `i` at
-each iteration. This line of code is optional and is only used for displaying
-what's going on while the `for` loop is running.
-
-The `for` loop allows us to not just run the code 5000 times, but to neatly
-package the results, element by element, into the empty vector that we
-initialized at the outset.
-
-5. To make sure you understand what you've done in this loop, try running a
- smaller version. Initialize a vector of 100 zeros called
- `sample_means_small`. Run a loop that takes a sample of size 50 from
- `area` and stores the sample mean in `sample_means_small`, but only
- iterate from 1 to 100. Print the output to your screen (type
- `sample_means_small` into the console and press enter). How many elements
- are there in this object called `sample_means_small`? What does each
- element represent?
-
-## Sample size and the sampling distribution
-
-Mechanics aside, let's return to the reason we used a `for` loop: to compute a
-sampling distribution, specifically, this one.
-
-```{r hist, eval=FALSE}
-hist(sample_means50)
-```
-
-The sampling distribution that we computed tells us much about estimating
-the average living area in homes in Ames. Because the sample mean is an
-unbiased estimator, the sampling distribution is centered at the true average
-living area of the the population, and the spread of the distribution
-indicates how much variability is induced by sampling only 50 home sales.
-
-To get a sense of the effect that sample size has on our distribution, let's
-build up two more sampling distributions: one based on a sample size of 10 and
-another based on a sample size of 100.
-
-```{r samp-10-100, eval=FALSE}
-sample_means10 <- rep(NA, 5000)
-sample_means100 <- rep(NA, 5000)
-
-for(i in 1:5000){
- samp <- sample(area, 10)
- sample_means10[i] <- mean(samp)
- samp <- sample(area, 100)
- sample_means100[i] <- mean(samp)
-}
-```
-
-Here we're able to use a single `for` loop to build two distributions by adding
-additional lines inside the curly braces. Don't worry about the fact that
-`samp` is used for the name of two different objects. In the second command
-of the `for` loop, the mean of `samp` is saved to the relevant place in the
-vector `sample_means10`. With the mean saved, we're now free to overwrite the
-object `samp` with a new sample, this time of size 100. In general, anytime
-you create an object using a name that is already in use, the old object will
-get replaced with the new one.
-
-To see the effect that different sample sizes have on the sampling
-distribution, plot the three distributions on top of one another.
-
-```{r plot-samps, eval=FALSE, tidy = FALSE}
-par(mfrow = c(3, 1))
-
-xlimits <- range(sample_means10)
-
-hist(sample_means10, breaks = 20, xlim = xlimits)
-hist(sample_means50, breaks = 20, xlim = xlimits)
-hist(sample_means100, breaks = 20, xlim = xlimits)
-```
-
-The first command specifies that you'd like to divide the plotting area into 3
-rows and 1 column of plots (to return to the default setting of plotting one
-at a time, use `par(mfrow = c(1, 1))`). The `breaks` argument specifies the
-number of bins used in constructing the histogram. The `xlim` argument
-specifies the range of the x-axis of the histogram, and by setting it equal
-to `xlimits` for each histogram, we ensure that all three histograms will be
-plotted with the same limits on the x-axis.
-
-6. When the sample size is larger, what happens to the center? What about the spread?
-
-* * *
-## On your own
-
-So far, we have only focused on estimating the mean living area in homes in
-Ames. Now you'll try to estimate the mean home price.
-
-- Take a random sample of size 50 from `price`. Using this sample, what is
- your best point estimate of the population mean?
-
-- Since you have access to the population, simulate the sampling
- distribution for $\bar{x}_{price}$ by taking 5000 samples from the
- population of size 50 and computing 5000 sample means. Store these means
- in a vector called `sample_means50`. Plot the data, then describe the
- shape of this sampling distribution. Based on this sampling distribution,
- what would you guess the mean home price of the population to be? Finally,
- calculate and report the population mean.
-
-- Change your sample size from 50 to 150, then compute the sampling
- distribution using the same method as above, and store these means in a
- new vector called `sample_means150`. Describe the shape of this sampling
- distribution, and compare it to the sampling distribution for a sample
- size of 50. Based on this sampling distribution, what would you guess to
- be the mean sale price of homes in Ames?
-
-- Of the sampling distributions from 2 and 3, which has a smaller spread? If
- we're concerned with making estimates that are more often close to the
- true value, would we prefer a distribution with a large or small spread?
-
-
-This is a product of OpenIntro that is released under a [Creative Commons
-Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0).
-This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.
-
\ No newline at end of file
diff --git a/simple_regression/simple_regression.html b/simple_regression/simple_regression.html
deleted file mode 100644
index d361996..0000000
--- a/simple_regression/simple_regression.html
+++ /dev/null
@@ -1,201 +0,0 @@
-
-
-
-
-
-
-
-
-
-
-
-
-Introduction to linear regression
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Introduction to linear regression
-
-
-
-
-
Batter up
-
The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, betterpredict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.
-
In this lab we’ll be looking at data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.
In addition to runs scored, there are seven traditionally used variables in the data set: at-bats, hits, home runs, batting average, strikeouts, stolen bases, and wins. There are also three newer variables: on-base percentage, slugging percentage, and on-base plus slugging. For the first portion of the analysis we’ll consider the seven traditional variables. At the end of the lab, you’ll work with the newer variables on your own.
-
-
What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?
-
-
If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.
-
cor(mlb11$runs, mlb11$at_bats)
-
-
-
Sum of squared residuals
-
Think back to the way that we described the distribution of a single variable. Recall that we discussed characteristics such as center, spread, and shape. It’s also useful to be able to describe the relationship of two numerical variables, such as runs and at_bats above.
-
-
Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.
-
-
Just as we used the mean and standard deviation to summarize a single variable, we can summarize the relationship between these two variables by finding the line that best follows their association. Use the following interactive function to select the line that you think does the best job of going through the cloud of points.
-
plot_ss(x = mlb11$at_bats, y = mlb11$runs)
-
After running this command, you’ll be prompted to click two points on the plot to define a line. Once you’ve done that, the line you specified will be shown in black and the residuals in blue. Note that there are 30 residuals, one for each of the 30 observations. Recall that the residuals are the difference between the observed values and the values predicted by the line:
-
\[
- e_i = y_i - \hat{y}_i
-\]
-
The most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, you can rerun the plot command and add the argument showSquares = TRUE.
-
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares =TRUE)
-
Note that the output from the plot_ss function provides you with the slope and intercept of your line as well as the sum of squares.
-
-
Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?
-
-
-
-
The linear model
-
It is rather cumbersome to try to get the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error. Instead we can use the lm function in R to fit the linear model (a.k.a. regression line).
-
m1 <-lm(runs ~at_bats, data = mlb11)
-
The first argument in the function lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of at_bats. The second argument specifies that R should look in the mlb11 data frame to find the runs and at_bats variables.
-
The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this information using the summary function.
-
summary(m1)
-
Let’s consider this output piece by piece. First, the formula used to describe the model is shown at the top. After the formula you find the five-number summary of the residuals. The “Coefficients” table shown next is key; its first column displays the linear model’s y-intercept and the coefficient of at_bats. With this table, we can write down the least squares regression line for the linear model:
-
\[
- \hat{y} = -2789.2429 + 0.6305 * atbats
-\]
-
One last piece of information we will discuss from the summary output is the Multiple R-squared, or more simply, \(R^2\). The \(R^2\) value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 37.3% of the variability in runs is explained by at-bats.
-
-
Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?
-
-
-
-
Prediction and prediction errors
-
Let’s create a scatterplot with the least squares line laid on top.
-
plot(mlb11$runs ~mlb11$at_bats)
-abline(m1)
-
The function abline plots a line based on its slope and intercept. Here, we used a shortcut by providing the model m1, which contains both parameter estimates. This line can be used to predict \(y\) at any value of \(x\). When predictions are made for values of \(x\) that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.
-
-
If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?
-
-
-
-
Model diagnostics
-
To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
-
Linearity: You already checked if the relationship between runs and at-bats is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. at-bats. Recall that any code following a # is intended to be a comment that helps understand the code but is ignored by R.
-
plot(m1$residuals ~mlb11$at_bats)
-abline(h =0, lty =3) # adds a horizontal dashed line at y = 0
-
-
Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?
-
-
Nearly normal residuals: To check this condition, we can look at a histogram
-
hist(m1$residuals)
-
or a normal probability plot of the residuals.
-
qqnorm(m1$residuals)
-qqline(m1$residuals) # adds diagonal line to the normal prob plot
-
-
Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?
-
-
Constant variability:
-
-
Based on the plot in (1), does the constant variability condition appear to be met?
-
-
-
-
-
On Your Own
-
-
Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?
-
How does this relationship compare to the relationship between runs and at_bats? Use the R\(^2\) values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?
-
Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).
-
Now examine the three newer variables. These are the statistics used by the author of Moneyball to predict a teams success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?
-
Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.
-
-
-This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics.
-