diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..48cbb4b --- /dev/null +++ b/.gitignore @@ -0,0 +1,11 @@ +.html +.Rproj.user +.Rhistory +.RData +*.Rproj +.DS_Store +news.html +README.html +*key.Rmd +*key.html + diff --git a/.gitignore.orig b/.gitignore.orig new file mode 100644 index 0000000..1d59b3b --- /dev/null +++ b/.gitignore.orig @@ -0,0 +1,12 @@ +.Rproj.user +.Rhistory +.RData +*.Rproj +.DS_Store +<<<<<<< HEAD +*.html +*.orig +======= +news.html +README.html +>>>>>>> mydplyr/master diff --git a/01_intro_to_r/intro_to_r.Rmd b/01_intro_to_r/intro_to_r.Rmd new file mode 100644 index 0000000..a33fd31 --- /dev/null +++ b/01_intro_to_r/intro_to_r.Rmd @@ -0,0 +1,450 @@ +--- +title: "Introduction to R and RStudio" +output: + html_document: + css: ../lab.css + highlight: pygments + theme: cerulean + toc: true + toc_float: true +--- + +```{r global_options, include=FALSE} +knitr::opts_chunk$set(eval = FALSE) +library(dplyr) +library(ggplot2) +library(oilabs) +``` + + +## The RStudio Interface + +The goal of this lab is to introduce you to R and RStudio, which you'll be using +throughout the course both to learn the statistical concepts discussed in the +course and to analyze real data and come to informed conclusions. To clarify +which is which: R is the name of the programming language itself and RStudio +is a convenient interface. + +As the labs progress, you are encouraged to explore beyond what the labs dictate; +a willingness to experiment will make you a much better programmer. Before we +get to that stage, however, you need to build some basic fluency in R. Today we +begin with the fundamental building blocks of R and RStudio: the interface, +reading in data, and basic commands. + +Go ahead and launch RStudio. You should see a window that looks like the image +shown below. + +![](more/r-interface-2016.png) +
+ + +The panel on the lower left is where the action happens. It's called the *console*. +Everytime you launch RStudio, it will have the same text at the top of the +console telling you the version of R that you're running. Below that information +is the *prompt*. As its name suggests, this prompt is really a request: a +request for a command. Initially, interacting with R is all about typing commands +and interpreting the output. These commands and their syntax have evolved over +decades (literally) and now provide what many users feel is a fairly natural way +to access data and organize, describe, and invoke statistical computations. + +The panel in the upper right contains your *workspace* as well as a history of +the commands that you've previously entered. + +Any plots that you generate will show up in the panel in the lower right corner. +This is also where you can browse your files, access help, manage packages, etc. + +### R Packages + +R is an open-source programming language, meaning that users can contribute +packages that make our lives easier, and we can use them for free. For this lab, +and many others in the future, we will use the following R packages: + +- `dplyr`: for data wrangling +- `ggplot2`: for data visualization +- `oilabs`: for data and custom functions with the OpenIntro labs + +If these packages are not already available in your R environment, +install them by typing the following three lines of code into +the console of your RStudio session, pressing the enter/return key after each one. +Note that you can check to see which packages (and which versions) are installed by +inspecting the *Packages* tab in the lower right panel of RStudio. + +```{r install-packages, message = FALSE, eval=FALSE} +install.packages("dplyr") +install.packages("ggplot2") +install.packages("oilabs") +``` + +You may need to select a server from which to download; any of them will work. +Next, you need to load these packages in your working environment. We do this with +the `library` function. Run the following three lines in your console. + +```{r load-packages, message = FALSE, eval=TRUE} +library(dplyr) +library(ggplot2) +library(oilabs) +``` + +Note that you only need to *install* packages once, but you need to *load* +them each time you relaunch RStudio. + + +### Creating a reproducible lab report + +We will be using R Markdown to create reproducible lab reports. See the +following videos describing why and how: + +[**Why use R Markdown for Lab Reports?**](https://youtu.be/lNWVQ2oxNho) + + +[**Using R Markdown for Lab Reports in RStudio**](https://youtu.be/o0h-eVABe9M) + + +Going forward you should refrain from typing your code directly in the console, and +instead type any code (final correct answer, or anything you're just trying out) in +the R Markdown file and run the chunk using either the Run button on the chunk +(green sideways triangle) or by highlighting the code and clicking Run on the top +right corner of the R Markdown editor. If at any point you need to start over, you +can Run All Chunks above the chunk you're working in by clicking on the down +arrow in the code chunk. + +## Dr. Arbuthnot's Baptism Records + +To get you started, run the following command to load the data. + +```{r load-abrbuthnot-data, eval=TRUE} +data(arbuthnot) +``` + +You can do this by + +- clicking on the green arrow at the top right of the code chunk in the R Markdown (Rmd) +file, or +- putting your cursor on this line, and hit the **Run** button on the upper right +corner of the pane, or +- hitting `Ctrl-Shift-Enter`, or +- typing the code in the console. + +This command instructs R to load some data: +the Arbuthnot baptism counts for boys and girls. You should see that the +workspace area in the upper righthand corner of the RStudio window now lists a +data set called `arbuthnot` that has 82 observations on 3 variables. As you +interact with R, you will create a series of objects. Sometimes you load them as +we have done here, and sometimes you create them yourself as the byproduct of a +computation or some analysis you have performed. + +The Arbuthnot data set refers to Dr. John Arbuthnot, an 18th century +physician, writer, and mathematician. He was interested in the ratio of newborn +boys to newborn girls, so he gathered the baptism records for children born in +London for every year from 1629 to 1710. We can view the data by +typing its name into the console. + +```{r view-data} +arbuthnot +``` + +However printing the whole dataset in the console is not that useful. +One advantage of RStudio is that it comes with a built-in data viewer. Click on +the name `arbuthnot` in the *Environment* pane (upper right window) that lists +the objects in your workspace. This will bring up an alternative display of the +data set in the *Data Viewer* (upper left window). You can close the data viewer +by clicking on the `x` in the upper lefthand corner. + +What you should see are four columns of numbers, each row representing a +different year: the first entry in each row is simply the row number (an index +we can use to access the data from individual years if we want), the second is +the year, and the third and fourth are the numbers of boys and girls baptized +that year, respectively. Use the scrollbar on the right side of the console +window to examine the complete data set. + +Note that the row numbers in the first column are not part of Arbuthnot's data. +R adds them as part of its printout to help you make visual comparisons. You can +think of them as the index that you see on the left side of a spreadsheet. In +fact, the comparison to a spreadsheet will generally be helpful. R has stored +Arbuthnot's data in a kind of spreadsheet or table called a *data frame*. + +You can see the dimensions of this data frame as well as the names of the variables and the first few observations by typing: + +```{r glimpse-data} +glimpse(arbuthnot) +``` + +This command should output the following + +```{r glimpse-data-result, echo=FALSE, eval=TRUE} +glimpse(arbuthnot) +``` + +We can see that there are 82 observations and 3 variables in this dataset. The variable names are `year`, `boys`, and `girls`. At this point, you might notice +that many of the commands in R look a lot like functions from math class; that +is, invoking R commands means supplying a function with some number of arguments. +The `glimpse` command, for example, took a single argument, the name of a data frame. + +## Some Exploration + +Let's start to examine the data a little more closely. We can access the data in +a single column of a data frame separately using a command like + +```{r view-boys} +arbuthnot$boys +``` + +This command will only show the number of boys baptized each year. The dollar +sign basically says "go to the data frame that comes before me, and find the +variable that comes after me". + +1. What command would you use to extract just the counts of girls baptized? Try + it! + +Notice that the way R has printed these data is different. When we looked at the +complete data frame, we saw 82 rows, one on each line of the display. These data +are no longer structured in a table with other variables, so they are displayed +one right after another. Objects that print out in this way are called *vectors*; +they represent a set of numbers. R has added numbers in [brackets] along the left +side of the printout to indicate locations within the vector. For example, 5218 +follows [1], indicating that 5218 is the first entry in the vector. And if [43] +starts a line, then that would mean the first number on that line would represent +the 43rd entry in the vector. + + +### Data visualization + +R has some powerful functions for making graphics. We can create a simple plot +of the number of girls baptized per year with the command + +```{r plot-girls-vs-year} +qplot(x = year, y = girls, data = arbuthnot) +``` + +The `qplot()` function (meaning "quick plot") considers the type of data you have +provided it and makes the decision to visualize it with a scatterplot. The plot +should appear under the *Plots* tab of the lower right panel of RStudio. Notice +that the command above again looks like a function, this time with three arguments +separated by commas. The first two arguments in the `qplot()` function specify +the variables for the x-axis and the y-axis and the third provides the name of the +data set where they can be found. If we wanted to connect the data points with +lines, we could add a fourth argument to specify the geometry that we'd like. + +```{r plot-girls-vs-year-line} +qplot(x = year, y = girls, data = arbuthnot, geom = "line") +``` + +You might wonder how you are supposed to know that it was possible to add that +fourth argument. Thankfully, R documents all of its functions extensively. To +read what a function does and learn the arguments that are available to you, +just type in a question mark followed by the name of the function that you're +interested in. Try the following. + +```{r plot-help, tidy = FALSE} +?qplot +``` + +Notice that the help file replaces the plot in the lower right panel. You can +toggle between plots and help files using the tabs at the top of that panel. + +2. Is there an apparent trend in the number of girls baptized over the years? +How would you describe it? (To ensure that your lab report is comprehensive, +be sure to include the code needed to make the plot as well as your written +interpretation.) + +### R as a big calculator + +Now, suppose we want to plot the total number of baptisms. To compute this, we +could use the fact that R is really just a big calculator. We can type in +mathematical expressions like + +```{r calc-total-bapt-numbers} +5218 + 4683 +``` + +to see the total number of baptisms in 1629. We could repeat this once for each +year, but there is a faster way. If we add the vector for baptisms for boys to +that of girls, R will compute all sums simultaneously. + +```{r calc-total-bapt-vars} +arbuthnot$boys + arbuthnot$girls +``` + +What you will see are 82 numbers (in that packed display, because we aren’t +looking at a data frame here), each one representing the sum we’re after. Take a +look at a few of them and verify that they are right. + +### Adding a new variable to the data frame + +We'll be using this new vector to generate some plots, so we'll want to save it +as a permanent column in our data frame. + +```{r calc-total-bapt-vars-save} +arbuthnot <- arbuthnot %>% + mutate(total = boys + girls) +``` + +The `%>%` operator is called the **piping** +operator. It takes the output of the previous expression and pipes it into +the first argument of the function in the following one. +To continue our analogy with mathematical functions, `x %>% f(y)` is +equivalent to `f(x, y)`. + +
+**A note on piping: ** Note that we can read these three lines of code as the following: + +*"Take the `arbuthnot` dataset and **pipe** it into the `mutate` function. +Mutate the `arbuthnot` data set by creating a new variable called `total` that is the sum of the variables +called `boys` and `girls`. Then assign the resulting dataset to the object +called `arbuthnot`, i.e. overwrite the old `arbuthnot` dataset with the new one +containing the new variable."* + +This is equivalent to going through each row and adding up the `boys` +and `girls` counts for that year and recording that value in a new column called +`total`. +
+ +
+**Where is the new variable? ** When you make changes to variables in your dataset, +click on the name of the dataset again to update it in the data viewer. +
+ +You'll see that there is now a new column called `total` that has been tacked on +to the data frame. The special symbol `<-` performs an *assignment*, taking the +output of one line of code and saving it into an object in your workspace. In +this case, you already have an object called `arbuthnot`, so this command updates +that data set with the new mutated column. + +We can make a plot of the total number of baptisms per year with the command + +```{r plot-total-vs-year} +qplot(x = year, y = total, data = arbuthnot, geom = "line") +``` + +Similarly to how we computed the total number of births, we can compute the ratio +of the number of boys to the number of girls baptized in 1629 with + +```{r calc-prop-boys-to-girls-numbers} +5218 / 4683 +``` + +or we can act on the complete columns with the expression + +```{r calc-prop-boys-to-girls-vars} +arbuthnot <- arbuthnot %>% + mutate(boy_to_girl_ratio = boys / girls) +``` + +We can also compute the proportion of newborns that are boys in 1629 + +```{r calc-prop-boys-numbers} +5218 / (5218 + 4683) +``` + +or this may also be computed for all years simultaneously and append it to the dataset: + +```{r calc-prop-boys-vars} +arbuthnot <- arbuthnot %>% + mutate(boy_ratio = boys / total) +``` + +Note that we are using the new `total` variable we created earlier in our calculations. + +3. Now, generate a plot of the proportion of boys born over time. What do you see? + +
+**Tip: ** If you use the up and down arrow keys, you can scroll through your +previous commands, your so-called command history. You can also access it +by clicking on the history tab in the upper right panel. This will save +you a lot of typing in the future. +
+ +Finally, in addition to simple mathematical operators like subtraction and +division, you can ask R to make comparisons like greater than, `>`, less than, +`<`, and equality, `==`. For example, we can ask if boys outnumber girls in each +year with the expression + +```{r boys-more-than-girls} +arbuthnot <- arbuthnot %>% + mutate(more_boys = boys > girls) +``` + +This command add a new variable to the `arbuthnot` dataframe containing the values +of either `TRUE` if that year had more boys than girls, or `FALSE` if that year +did not (the answer may surprise you). This variable contains a different kind of +data than we have encountered so far. All other columns in the `arbuthnot` data +frame have values that are numerical (the year, the number of boys and girls). Here, +we've asked R to create *logical* data, data where the values are either `TRUE` +or `FALSE`. In general, data analysis will involve many different kinds of data +types, and one reason for using R is that it is able to represent and compute +with many of them. + +* * * + +## More Practice + +In the previous few pages, you recreated some of the displays and preliminary +analysis of Arbuthnot's baptism data. Your assignment involves repeating these +steps, but for present day birth records in the United States. Load the +present day data with the following command. + +```{r load-present-data} +data(present) +``` + +The data are stored in a data frame called `present`. + +4. What years are included in this data set? What are the dimensions of the + data frame? What are the variable (column) names? + +5. How do these counts compare to Arbuthnot's? Are they of a similar magnitude? + +6. Make a plot that displays the proportion of boys born over time. What do you see? + Does Arbuthnot's observation about boys being born in greater proportion than girls + hold up in the U.S.? Include the plot in your response. *Hint:* You should be + able to reuse your code from Ex 3 above, just replace the dataframe name. + +7. In what year did we see the most total number of births in the U.S.? *Hint:* + First calculate the totals and save it as a new variable. Then, sort your + dataset in descending order based on the total column. You can do this + interactively in the data viewer by clicking on the arrows next to the + variable names. To include the sorted result in your report you will need + to use two new functions: `arrange` (for sorting). We can arrange the data + in a descending order with another function: `desc` (for descending order). + Sample code provided below. + +```{r eval=FALSE} +present %>% + arrange(desc(total)) +``` + +These data come from reports by the Centers for Disease Control. You can learn more about them +by bringing up the help file using the command `?present`. + + +
+This is a product of OpenIntro that is released under a +[Creative Commons Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0). +This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel +from a lab written by Mark Hansen of UCLA Statistics. +
+ +* * * + +## Resources for learning R and working in RStudio + +That was a short introduction to R and RStudio, but we will provide you with more +functions and a more complete sense of the language as the course progresses. + +In this course we will be using R packages called `dplyr` for data wrangling +and `ggplot2` for data visualization. If you are googling for R code, make sure +to also include these package names in your search query. For example, instead +of googling "scatterplot in R", google "scatterplot in R with ggplot2". + +These cheatsheets may come in handy throughout the semester: + +- [RMarkdown cheatsheet](http://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf) +- [Data wrangling cheatsheet](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) +- [Data visualization cheatsheet](http://www.rstudio.com/wp-content/uploads/2015/12/ggplot2-cheatsheet-2.0.pdf) + +Chester Ismay has put together a resource for new users of R, RStudio, and R Markdown +[here](https://ismayc.github.io/rbasics-book). It includes examples showing working with R Markdown files +in RStudio recorded as GIFs. + +Note that some of the code on these cheatsheets may be too advanced for this course, +however majority of it will become useful throughout the semester. diff --git a/01_intro_to_r/intro_to_r.html b/01_intro_to_r/intro_to_r.html new file mode 100644 index 0000000..b8dc721 --- /dev/null +++ b/01_intro_to_r/intro_to_r.html @@ -0,0 +1,447 @@ + + + + + + + + + + + + + +Introduction to R and RStudio + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + +
+
+
+
+
+ +
+ + + + + + + +
+

The RStudio Interface

+

The goal of this lab is to introduce you to R and RStudio, which you’ll be using throughout the course both to learn the statistical concepts discussed in the course and to analyze real data and come to informed conclusions. To clarify which is which: R is the name of the programming language itself and RStudio is a convenient interface.

+

As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R. Today we begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.

+

Go ahead and launch RStudio. You should see a window that looks like the image shown below.

+


+

The panel on the lower left is where the action happens. It’s called the console. Everytime you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is really a request: a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.

+

The panel in the upper right contains your workspace as well as a history of the commands that you’ve previously entered.

+

Any plots that you generate will show up in the panel in the lower right corner. This is also where you can browse your files, access help, manage packages, etc.

+
+

R Packages

+

R is an open-source programming language, meaning that users can contribute packages that make our lives easier, and we can use them for free. For this lab, and many others in the future, we will use the following R packages:

+
    +
  • dplyr: for data wrangling
  • +
  • ggplot2: for data visualization
  • +
  • oilabs: for data and custom functions with the OpenIntro labs
  • +
+

If these packages are not already available in your R environment, install them by typing the following three lines of code into the console of your RStudio session, pressing the enter/return key after each one. Note that you can check to see which packages (and which versions) are installed by inspecting the Packages tab in the lower right panel of RStudio.

+
install.packages("dplyr")
+install.packages("ggplot2")
+install.packages("oilabs")
+

You may need to select a server from which to download; any of them will work. Next, you need to load these packages in your working environment. We do this with the library function. Run the following three lines in your console.

+
library(dplyr)
+library(ggplot2)
+library(oilabs)
+

Note that you only need to install packages once, but you need to load them each time you relaunch RStudio.

+
+
+

Creating a reproducible lab report

+

We will be using R Markdown to create reproducible lab reports. See the following videos describing why and how:

+

Why use R Markdown for Lab Reports?

+

Using R Markdown for Lab Reports in RStudio

+

Going forward you should refrain from typing your code directly in the console, and instead type any code (final correct answer, or anything you’re just trying out) in the R Markdown file and run the chunk using either the Run button on the chunk (green sideways triangle) or by highlighting the code and clicking Run on the top right corner of the R Markdown editor. If at any point you need to start over, you can Run All Chunks above the chunk you’re working in by clicking on the down arrow in the code chunk.

+
+
+
+

Dr. Arbuthnot’s Baptism Records

+

To get you started, run the following command to load the data.

+
data(arbuthnot)
+

You can do this by

+
    +
  • clicking on the green arrow at the top right of the code chunk in the R Markdown (Rmd) file, or
  • +
  • putting your cursor on this line, and hit the Run button on the upper right corner of the pane, or
  • +
  • hitting Ctrl-Shift-Enter, or
  • +
  • typing the code in the console.
  • +
+

This command instructs R to load some data: the Arbuthnot baptism counts for boys and girls. You should see that the workspace area in the upper righthand corner of the RStudio window now lists a data set called arbuthnot that has 82 observations on 3 variables. As you interact with R, you will create a series of objects. Sometimes you load them as we have done here, and sometimes you create them yourself as the byproduct of a computation or some analysis you have performed.

+

The Arbuthnot data set refers to Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710. We can view the data by typing its name into the console.

+
arbuthnot
+

However printing the whole dataset in the console is not that useful. One advantage of RStudio is that it comes with a built-in data viewer. Click on the name arbuthnot in the Environment pane (upper right window) that lists the objects in your workspace. This will bring up an alternative display of the data set in the Data Viewer (upper left window). You can close the data viewer by clicking on the x in the upper lefthand corner.

+

What you should see are four columns of numbers, each row representing a different year: the first entry in each row is simply the row number (an index we can use to access the data from individual years if we want), the second is the year, and the third and fourth are the numbers of boys and girls baptized that year, respectively. Use the scrollbar on the right side of the console window to examine the complete data set.

+

Note that the row numbers in the first column are not part of Arbuthnot’s data. R adds them as part of its printout to help you make visual comparisons. You can think of them as the index that you see on the left side of a spreadsheet. In fact, the comparison to a spreadsheet will generally be helpful. R has stored Arbuthnot’s data in a kind of spreadsheet or table called a data frame.

+

You can see the dimensions of this data frame as well as the names of the variables and the first few observations by typing:

+
glimpse(arbuthnot)
+

This command should output the following

+
## Observations: 82
+## Variables: 3
+## $ year  <int> 1629, 1630, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 16...
+## $ boys  <int> 5218, 4858, 4422, 4994, 5158, 5035, 5106, 4917, 4703, 53...
+## $ girls <int> 4683, 4457, 4102, 4590, 4839, 4820, 4928, 4605, 4457, 49...
+

We can see that there are 82 observations and 3 variables in this dataset. The variable names are year, boys, and girls. At this point, you might notice that many of the commands in R look a lot like functions from math class; that is, invoking R commands means supplying a function with some number of arguments. The glimpse command, for example, took a single argument, the name of a data frame.

+
+
+

Some Exploration

+

Let’s start to examine the data a little more closely. We can access the data in a single column of a data frame separately using a command like

+
arbuthnot$boys
+

This command will only show the number of boys baptized each year. The dollar sign basically says “go to the data frame that comes before me, and find the variable that comes after me”.

+
    +
  1. What command would you use to extract just the counts of girls baptized? Try it!
  2. +
+

Notice that the way R has printed these data is different. When we looked at the complete data frame, we saw 82 rows, one on each line of the display. These data are no longer structured in a table with other variables, so they are displayed one right after another. Objects that print out in this way are called vectors; they represent a set of numbers. R has added numbers in [brackets] along the left side of the printout to indicate locations within the vector. For example, 5218 follows [1], indicating that 5218 is the first entry in the vector. And if [43] starts a line, then that would mean the first number on that line would represent the 43rd entry in the vector.

+
+

Data visualization

+

R has some powerful functions for making graphics. We can create a simple plot of the number of girls baptized per year with the command

+
qplot(x = year, y = girls, data = arbuthnot)
+

The qplot() function (meaning “quick plot”) considers the type of data you have provided it and makes the decision to visualize it with a scatterplot. The plot should appear under the Plots tab of the lower right panel of RStudio. Notice that the command above again looks like a function, this time with three arguments separated by commas. The first two arguments in the qplot() function specify the variables for the x-axis and the y-axis and the third provides the name of the data set where they can be found. If we wanted to connect the data points with lines, we could add a fourth argument to specify the geometry that we’d like.

+
qplot(x = year, y = girls, data = arbuthnot, geom = "line")
+

You might wonder how you are supposed to know that it was possible to add that fourth argument. Thankfully, R documents all of its functions extensively. To read what a function does and learn the arguments that are available to you, just type in a question mark followed by the name of the function that you’re interested in. Try the following.

+
?qplot
+

Notice that the help file replaces the plot in the lower right panel. You can toggle between plots and help files using the tabs at the top of that panel.

+
    +
  1. Is there an apparent trend in the number of girls baptized over the years? How would you describe it? (To ensure that your lab report is comprehensive, be sure to include the code needed to make the plot as well as your written interpretation.)
  2. +
+
+
+

R as a big calculator

+

Now, suppose we want to plot the total number of baptisms. To compute this, we could use the fact that R is really just a big calculator. We can type in mathematical expressions like

+
5218 + 4683
+

to see the total number of baptisms in 1629. We could repeat this once for each year, but there is a faster way. If we add the vector for baptisms for boys to that of girls, R will compute all sums simultaneously.

+
arbuthnot$boys + arbuthnot$girls
+

What you will see are 82 numbers (in that packed display, because we aren’t looking at a data frame here), each one representing the sum we’re after. Take a look at a few of them and verify that they are right.

+
+
+

Adding a new variable to the data frame

+

We’ll be using this new vector to generate some plots, so we’ll want to save it as a permanent column in our data frame.

+
arbuthnot <- arbuthnot %>%
+  mutate(total = boys + girls)
+

The %>% operator is called the piping operator. It takes the output of the previous expression and pipes it into the first argument of the function in the following one. To continue our analogy with mathematical functions, x %>% f(y) is equivalent to f(x, y).

+
+

A note on piping: Note that we can read these three lines of code as the following:

+

“Take the arbuthnot dataset and pipe it into the mutate function. Mutate the arbuthnot data set by creating a new variable called total that is the sum of the variables called boys and girls. Then assign the resulting dataset to the object called arbuthnot, i.e. overwrite the old arbuthnot dataset with the new one containing the new variable.”

+

This is equivalent to going through each row and adding up the boys and girls counts for that year and recording that value in a new column called total.

+
+
+

Where is the new variable? When you make changes to variables in your dataset, click on the name of the dataset again to update it in the data viewer.

+
+

You’ll see that there is now a new column called total that has been tacked on to the data frame. The special symbol <- performs an assignment, taking the output of one line of code and saving it into an object in your workspace. In this case, you already have an object called arbuthnot, so this command updates that data set with the new mutated column.

+

We can make a plot of the total number of baptisms per year with the command

+
qplot(x = year, y = total, data = arbuthnot, geom = "line")
+

Similarly to how we computed the total number of births, we can compute the ratio of the number of boys to the number of girls baptized in 1629 with

+
5218 / 4683
+

or we can act on the complete columns with the expression

+
arbuthnot <- arbuthnot %>%
+  mutate(boy_to_girl_ratio = boys / girls)
+

We can also compute the proportion of newborns that are boys in 1629

+
5218 / (5218 + 4683)
+

or this may also be computed for all years simultaneously and append it to the dataset:

+
arbuthnot <- arbuthnot %>%
+  mutate(boy_ratio = boys / total)
+

Note that we are using the new total variable we created earlier in our calculations.

+
    +
  1. Now, generate a plot of the proportion of boys born over time. What do you see?
  2. +
+
+

Tip: If you use the up and down arrow keys, you can scroll through your previous commands, your so-called command history. You can also access it by clicking on the history tab in the upper right panel. This will save you a lot of typing in the future.

+
+

Finally, in addition to simple mathematical operators like subtraction and division, you can ask R to make comparisons like greater than, >, less than, <, and equality, ==. For example, we can ask if boys outnumber girls in each year with the expression

+
arbuthnot <- arbuthnot %>%
+  mutate(more_boys = boys > girls)
+

This command add a new variable to the arbuthnot dataframe containing the values of either TRUE if that year had more boys than girls, or FALSE if that year did not (the answer may surprise you). This variable contains a different kind of data than we have encountered so far. All other columns in the arbuthnot data frame have values that are numerical (the year, the number of boys and girls). Here, we’ve asked R to create logical data, data where the values are either TRUE or FALSE. In general, data analysis will involve many different kinds of data types, and one reason for using R is that it is able to represent and compute with many of them.

+
+
+
+
+

More Practice

+

In the previous few pages, you recreated some of the displays and preliminary analysis of Arbuthnot’s baptism data. Your assignment involves repeating these steps, but for present day birth records in the United States. Load the present day data with the following command.

+
data(present)
+

The data are stored in a data frame called present.

+
    +
  1. What years are included in this data set? What are the dimensions of the data frame? What are the variable (column) names?

  2. +
  3. How do these counts compare to Arbuthnot’s? Are they of a similar magnitude?

  4. +
  5. Make a plot that displays the proportion of boys born over time. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in your response. Hint: You should be able to reuse your code from Ex 3 above, just replace the dataframe name.

  6. +
  7. In what year did we see the most total number of births in the U.S.? Hint: First calculate the totals and save it as a new variable. Then, sort your dataset in descending order based on the total column. You can do this interactively in the data viewer by clicking on the arrows next to the variable names. To include the sorted result in your report you will need to use two new functions: arrange (for sorting). We can arrange the data in a descending order with another function: desc (for descending order). Sample code provided below.

  8. +
+
present %>%
+  arrange(desc(total))
+

These data come from reports by the Centers for Disease Control. You can learn more about them by bringing up the help file using the command ?present.

+
+

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.

+
+
+
+
+

Resources for learning R and working in RStudio

+

That was a short introduction to R and RStudio, but we will provide you with more functions and a more complete sense of the language as the course progresses.

+

In this course we will be using R packages called dplyr for data wrangling and ggplot2 for data visualization. If you are googling for R code, make sure to also include these package names in your search query. For example, instead of googling “scatterplot in R”, google “scatterplot in R with ggplot2”.

+

These cheatsheets may come in handy throughout the semester:

+ +

Chester Ismay has put together a resource for new users of R, RStudio, and R Markdown here. It includes examples showing working with R Markdown files in RStudio recorded as GIFs.

+

Note that some of the code on these cheatsheets may be too advanced for this course, however majority of it will become useful throughout the semester.

+
+ + + +
+
+ +
+ + + + + + + + diff --git a/intro_to_r/more/arbuthnot-readme.txt b/01_intro_to_r/more/arbuthnot-readme.txt similarity index 100% rename from intro_to_r/more/arbuthnot-readme.txt rename to 01_intro_to_r/more/arbuthnot-readme.txt diff --git a/intro_to_r/more/arbuthnot.r b/01_intro_to_r/more/arbuthnot.r similarity index 100% rename from intro_to_r/more/arbuthnot.r rename to 01_intro_to_r/more/arbuthnot.r diff --git a/intro_to_r/more/present-readme.txt b/01_intro_to_r/more/present-readme.txt similarity index 100% rename from intro_to_r/more/present-readme.txt rename to 01_intro_to_r/more/present-readme.txt diff --git a/intro_to_r/more/present-reference.pdf b/01_intro_to_r/more/present-reference.pdf similarity index 100% rename from intro_to_r/more/present-reference.pdf rename to 01_intro_to_r/more/present-reference.pdf diff --git a/intro_to_r/more/present.R b/01_intro_to_r/more/present.R similarity index 100% rename from intro_to_r/more/present.R rename to 01_intro_to_r/more/present.R diff --git a/01_intro_to_r/more/r-interface-2016.png b/01_intro_to_r/more/r-interface-2016.png new file mode 100644 index 0000000..73f51c2 Binary files /dev/null and b/01_intro_to_r/more/r-interface-2016.png differ diff --git a/01_intro_to_r/more/rmd-from-template.png b/01_intro_to_r/more/rmd-from-template.png new file mode 100644 index 0000000..8abb51f Binary files /dev/null and b/01_intro_to_r/more/rmd-from-template.png differ diff --git a/02_intro_to_data/intro_to_data.Rmd b/02_intro_to_data/intro_to_data.Rmd new file mode 100644 index 0000000..d2fe430 --- /dev/null +++ b/02_intro_to_data/intro_to_data.Rmd @@ -0,0 +1,369 @@ +--- +title: "Introduction to data" +output: + html_document: + theme: cerulean + highlight: pygments + css: ../lab.css + toc: true + toc_float: true +--- + +```{r global-options, include=FALSE} +knitr::opts_chunk$set(eval = FALSE) +library(dplyr) +library(ggplot2) +library(oilabs) +``` + +Some define statistics as the field that focuses on turning information into +knowledge. The first step in that process is to summarize and describe the raw +information -- the data. In this lab we explore flights, specifically a random +sample of domestic flights that departed from the three major +New York City airport in 2013. We will generate simple graphical and numerical +summaries of data on these flights and explore delay times. As this is a large +data set, along the way you'll also learn the indispensable skills of data +processing and subsetting. + + +## Getting started + +### Load packages + +In this lab we will explore the data using the `dplyr` package and visualize it +using the `ggplot2` package for data visualization. The data can be found in the +companion package for OpenIntro labs, `oilabs`. + +Let's load the packages. + +```{r load-packages, message=FALSE} +library(dplyr) +library(ggplot2) +library(oilabs) +``` + +### Creating a reproducible lab report + +Remember that we will be using R Markdown to create reproducible lab reports. +See the following video describing how to get started with creating these +reports for this lab, and all future labs: + +[**Basic R Markdown with an OpenIntro Lab**](https://www.youtube.com/watch?v=Pdc368lS2hk) + + +### The data + +The [Bureau of Transportation Statistics](http://www.rita.dot.gov/bts/about/) +(BTS) is a statistical agency that is a part of the Research and Innovative +Technology Administration (RITA). As its name implies, BTS collects and makes +available transportation data, such as the flights data we will be working with +in this lab. + +We begin by loading the `nycflights` data frame. Type the following in your console +to load the data: + +```{r load-data, eval = TRUE} +data(nycflights) +``` + +The data set `nycflights` that shows up in your workspace is a *data matrix*, +with each row representing an *observation* and each column representing a +*variable*. R calls this data format a **data frame**, which is a term that will +be used throughout the labs. For this data set, each *observation* is a single flight. + +To view the names of the variables, type the command + +```{r names} +names(nycflights) +``` + +This returns the names of the variables in this data frame. The **codebook** +(description of the variables) can be accessed by pulling up the help file: + +```{r} +?nycflights +``` + +One of the variables refers to the carrier (i.e. airline) of the flight, which +is coded according to the following system. + +- `carrier`: Two letter carrier abbreviation. + + `9E`: Endeavor Air Inc. + + `AA`: American Airlines Inc. + + `AS`: Alaska Airlines Inc. + + `B6`: JetBlue Airways + + `DL`: Delta Air Lines Inc. + + `EV`: ExpressJet Airlines Inc. + + `F9`: Frontier Airlines Inc. + + `FL`: AirTran Airways Corporation + + `HA`: Hawaiian Airlines Inc. + + `MQ`: Envoy Air + + `OO`: SkyWest Airlines Inc. + + `UA`: United Air Lines Inc. + + `US`: US Airways Inc. + + `VX`: Virgin America + + `WN`: Southwest Airlines Co. + + `YV`: Mesa Airlines Inc. + + +A very useful function for taking a quick peek at your data frame and viewing +its dimensions and data types is `str`, which stands for **str**ucture. + +```{r str} +str(nycflights) +``` + +The `nycflights` data frame is a massive trove of information. Let's think about +some questions we might want to answer with these data: + +- How delayed were flights that were headed to Los Angeles? +- How do departure delays vary over months? +- Which of the three major NYC airports has a better on time percentage for +departing flights? + + +## Analysis + +### Lab report +To record your analysis in a reproducible format, you can adapt the general Lab +Report template from the `oilabs` package. Watch the video above to learn how. + +### Departure delays + +Let's start by examing the distribution of departure delays of all flights with a +histogram. + +```{r hist-dep-delay} +qplot(x = dep_delay, data = nycflights, geom = "histogram") +``` + +This function says to plot the `dep_delay` variable from the `nycflights` data +frame on the x-axis. It also defines a `geom` (short for geometric object), +which describes the type of plot you will produce. + +Histograms are generally a very good way to see the shape of a single +distribution of numerical data, but that shape can change depending on how the +data is split between the different bins. You can easily define the binwidth you +want to use: + +```{r hist-dep-delay-bins} +qplot(x = dep_delay, data = nycflights, geom = "histogram", binwidth = 15) +qplot(x = dep_delay, data = nycflights, geom = "histogram", binwidth = 150) +``` + +1. Look carefully at these three histograms. How do they compare? Are features +revealed in one that are obscured in another? + +If we want to focus only on departure delays of flights headed to Los Angeles, +we need to first `filter` the data for flights with that destination (`dest == "LAX"`) +and then make a histogram of the departure delays of only those flights. + +```{r lax-flights-hist} +lax_flights <- nycflights %>% + filter(dest == "LAX") +qplot(x = dep_delay, data = lax_flights, geom = "histogram") +``` + +Let's decipher these two commands (OK, so it might look like three lines, but +the first two physical lines of code are actually part of the same command. It's +common to add a break to a new line after `%>%` to help readability). + +- Command 1: Take the `nycflights` data frame, `filter` for flights headed to LAX, and +save the result as a new data frame called `lax_flights`. + + `==` means "if it's equal to". + + `LAX` is in quotation marks since it is a character string. +- Command 2: Basically the same `qplot` call from earlier for making a histogram, +except that it uses the smaller data frame for flights headed to LAX instead of all +flights. + +
+**Logical operators: ** Filtering for certain observations (e.g. flights from a +particular airport) is often of interest in data frames where we might want to +examine observations with certain characteristics separately from the rest of +the data. To do so we use the `filter` function and a series of +**logical operators**. The most commonly used logical operators for data +analysis are as follows: + +- `==` means "equal to" +- `!=` means "not equal to" +- `>` or `<` means "greater than" or "less than" +- `>=` or `<=` means "greater than or equal to" or "less than or equal to" +
+ +We can also obtain numerical summaries for these flights: + +```{r lax-flights-summ} +lax_flights %>% + summarise(mean_dd = mean(dep_delay), median_dd = median(dep_delay), n = n()) +``` + +Note that in the `summarise` function we created a list of three different +numerical summaries that we were interested in. The names of these elements are +user defined, like `mean_dd`, `median_dd`, `n`, and you could customize these names +as you like (just don't use spaces in your names). Calculating these summary +statistics also require that you know the function calls. Note that `n()` reports +the sample size. + +
+**Summary statistics: ** Some useful function calls for summary statistics for a +single numerical variable are as follows: + +- `mean` +- `median` +- `sd` +- `var` +- `IQR` +- `min` +- `max` + +Note that each of these functions take a single vector as an argument, and +returns a single value. +
+ +We can also filter based on multiple criteria. Suppose we are interested in +flights headed to San Francisco (SFO) in February: + +```{r} +sfo_feb_flights <- nycflights %>% + filter(dest == "SFO", month == 2) +``` + +Note that we can separate the conditions using commas if we want flights that +are both headed to SFO **and** in February. If we are interested in either +flights headed to SFO **or** in February we can use the `|` instead of the comma. + +1. Create a new data frame that includes flights headed to SFO in February, + and save this data frame as `sfo_feb_flights`. How many flights + meet these criteria? + +1. Describe the distribution of the **arrival** delays of these flights using a + histogram and appropriate summary statistics. **Hint:** The summary + statistics you use should depend on the shape of the distribution. + +Another useful technique is quickly calculating summary +statistics for various groups in your data frame. For example, we can modify the +above command using the `group_by` function to get the same summary stats for +each origin airport: + +```{r summary-custom-list-origin} +sfo_feb_flights %>% + group_by(origin) %>% + summarise(median_dd = median(dep_delay), iqr_dd = IQR(dep_delay), n_flights = n()) +``` + +Here, we first grouped the data by `origin`, and then calculated the summary +statistics. + +1. Calculate the median and interquartile range for `arr_delay`s of flights in + in the `sfo_feb_flights` data frame, grouped by carrier. Which carrier + has the most variable arrival delays? + +### Departure delays over months + +Which month would you expect to have the highest average delay departing from an +NYC airport? + +Let's think about how we would answer this question: + +- First, calculate monthly averages for departure delays. With the new language +we are learning, we need to + + `group_by` months, then + + `summarise` mean departure delays. +- Then, we need to `arrange` these average delays in `desc`ending order + +```{r mean-dep-delay-months} +nycflights %>% + group_by(month) %>% + summarise(mean_dd = mean(dep_delay)) %>% + arrange(desc(mean_dd)) +``` + +1. Suppose you really dislike departure delays, and you want to schedule + your travel in a month that minimizes your potential departure delay leaving + NYC. One option is to choose the month with the lowest mean departure delay. + Another option is to choose the month with the lowest median departure delay. + What are the pros and cons of these two choices? + + + +### On time departure rate for NYC airports + +Suppose you will be flying out of NYC and want to know which of the +three major NYC airports has the best on time departure rate of departing flights. +Suppose also that for you a flight that is delayed for less than 5 minutes is +basically "on time". You consider any flight delayed for 5 minutes of more to be +"delayed". + +In order to determine which airport has the best on time departure rate, +we need to + +- first classify each flight as "on time" or "delayed", +- then group flights by origin airport, +- then calculate on time departure rates for each origin airport, +- and finally arrange the airports in descending order for on time departure +percentage. + +Let's start with classifying each flight as "on time" or "delayed" by +creating a new variable with the `mutate` function. + +```{r dep-type} +nycflights <- nycflights %>% + mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed")) +``` + +The first argument in the `mutate` function is the name of the new variable +we want to create, in this case `dep_type`. Then if `dep_delay < 5` we classify +the flight as `"on time"` and `"delayed"` if not, i.e. if the flight is delayed +for 5 or more minutes. + +Note that we are also overwriting the `nycflights` data frame with the new +version of this data frame that includes the new `dep_type` variable. + +We can handle all the remaining steps in one code chunk: + +```{r} +nycflights %>% + group_by(origin) %>% + summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>% + arrange(desc(ot_dep_rate)) +``` + +1. If you were selecting an airport simply based on on time departure + percentage, which NYC airport would you choose to fly out of? + +We can also visualize the distribution of on on time departure rate across +the three airports using a segmented bar plot. + +```{r} +qplot(x = origin, fill = dep_type, data = nycflights, geom = "bar") +``` + +* * * + +## More Practice + +1. Mutate the data frame so that it includes a new variable that contains the + average speed, `avg_speed` traveled by the plane for each flight (in mph). + **Hint:** Average speed can be calculated as distance divided by + number of hours of travel, and note that `air_time` is given in minutes. + +1. Make a scatterplot of `avg_speed` vs. `distance`. Describe the relationship + between average speed and distance. + **Hint:** Use `geom = "point"`. + +1. Replicate the following plot. **Hint:** The data frame plotted only + contains flights from American Airlines, Delta Airlines, and United + Airlines, and the points are `color`ed by `carrier`. Once you replicate + the plot, determine (roughly) what the cutoff point is for departure + delays where you can still expect to get to your destination on time. + +```{r echo=FALSE, eval=TRUE, fig.width=7, fig.height=4} +dl_aa_ua <- nycflights %>% + filter(carrier == "AA" | carrier == "DL" | carrier == "UA") +qplot(x = dep_delay, y = arr_delay, data = dl_aa_ua, color = carrier) +``` \ No newline at end of file diff --git a/02_intro_to_data/intro_to_data.html b/02_intro_to_data/intro_to_data.html new file mode 100644 index 0000000..1ec9381 --- /dev/null +++ b/02_intro_to_data/intro_to_data.html @@ -0,0 +1,476 @@ + + + + + + + + + + + + + +Introduction to data + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + +
+
+
+
+
+ +
+ + + + + + + +

Some define statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information – the data. In this lab we explore flights, specifically a random sample of domestic flights that departed from the three major New York City airport in 2013. We will generate simple graphical and numerical summaries of data on these flights and explore delay times. As this is a large data set, along the way you’ll also learn the indispensable skills of data processing and subsetting.

+
+

Getting started

+
+

Load packages

+

In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data can be found in the companion package for OpenIntro labs, oilabs.

+

Let’s load the packages.

+
library(dplyr)
+library(ggplot2)
+library(oilabs)
+
+
+

Creating a reproducible lab report

+

Remember that we will be using R Markdown to create reproducible lab reports. See the following video describing how to get started with creating these reports for this lab, and all future labs:

+

Basic R Markdown with an OpenIntro Lab

+
+
+

The data

+

The Bureau of Transportation Statistics (BTS) is a statistical agency that is a part of the Research and Innovative Technology Administration (RITA). As its name implies, BTS collects and makes available transportation data, such as the flights data we will be working with in this lab.

+

We begin by loading the nycflights data frame. Type the following in your console to load the data:

+
data(nycflights)
+

The data set nycflights that shows up in your workspace is a data matrix, with each row representing an observation and each column representing a variable. R calls this data format a data frame, which is a term that will be used throughout the labs. For this data set, each observation is a single flight.

+

To view the names of the variables, type the command

+
names(nycflights)
+

This returns the names of the variables in this data frame. The codebook (description of the variables) can be accessed by pulling up the help file:

+
?nycflights
+

One of the variables refers to the carrier (i.e. airline) of the flight, which is coded according to the following system.

+
    +
  • carrier: Two letter carrier abbreviation. +
      +
    • 9E: Endeavor Air Inc.
    • +
    • AA: American Airlines Inc.
    • +
    • AS: Alaska Airlines Inc.
    • +
    • B6: JetBlue Airways
    • +
    • DL: Delta Air Lines Inc.
    • +
    • EV: ExpressJet Airlines Inc.
    • +
    • F9: Frontier Airlines Inc.
    • +
    • FL: AirTran Airways Corporation
    • +
    • HA: Hawaiian Airlines Inc.
    • +
    • MQ: Envoy Air
    • +
    • OO: SkyWest Airlines Inc.
    • +
    • UA: United Air Lines Inc.
    • +
    • US: US Airways Inc.
    • +
    • VX: Virgin America
    • +
    • WN: Southwest Airlines Co.
    • +
    • YV: Mesa Airlines Inc.
    • +
  • +
+

A very useful function for taking a quick peek at your data frame and viewing its dimensions and data types is str, which stands for structure.

+
str(nycflights)
+

The nycflights data frame is a massive trove of information. Let’s think about some questions we might want to answer with these data:

+
    +
  • How delayed were flights that were headed to Los Angeles?
  • +
  • How do departure delays vary over months?
  • +
  • Which of the three major NYC airports has a better on time percentage for departing flights?
  • +
+
+
+
+

Analysis

+
+

Lab report

+

To record your analysis in a reproducible format, you can adapt the general Lab Report template from the oilabs package. Watch the video above to learn how.

+
+
+

Departure delays

+

Let’s start by examing the distribution of departure delays of all flights with a histogram.

+
qplot(x = dep_delay, data = nycflights, geom = "histogram")
+

This function says to plot the dep_delay variable from the nycflights data frame on the x-axis. It also defines a geom (short for geometric object), which describes the type of plot you will produce.

+

Histograms are generally a very good way to see the shape of a single distribution of numerical data, but that shape can change depending on how the data is split between the different bins. You can easily define the binwidth you want to use:

+
qplot(x = dep_delay, data = nycflights, geom = "histogram", binwidth = 15)
+qplot(x = dep_delay, data = nycflights, geom = "histogram", binwidth = 150)
+
    +
  1. Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another?
  2. +
+

If we want to focus only on departure delays of flights headed to Los Angeles, we need to first filter the data for flights with that destination (dest == "LAX") and then make a histogram of the departure delays of only those flights.

+
lax_flights <- nycflights %>%
+  filter(dest == "LAX")
+qplot(x = dep_delay, data = lax_flights, geom = "histogram")
+

Let’s decipher these two commands (OK, so it might look like three lines, but the first two physical lines of code are actually part of the same command. It’s common to add a break to a new line after %>% to help readability).

+
    +
  • Command 1: Take the nycflights data frame, filter for flights headed to LAX, and save the result as a new data frame called lax_flights. +
      +
    • == means “if it’s equal to”.
    • +
    • LAX is in quotation marks since it is a character string.
    • +
  • +
  • Command 2: Basically the same qplot call from earlier for making a histogram, except that it uses the smaller data frame for flights headed to LAX instead of all flights.
  • +
+
+

Logical operators: Filtering for certain observations (e.g. flights from a particular airport) is often of interest in data frames where we might want to examine observations with certain characteristics separately from the rest of the data. To do so we use the filter function and a series of logical operators. The most commonly used logical operators for data analysis are as follows:

+
    +
  • == means “equal to”
  • +
  • != means “not equal to”
  • +
  • > or < means “greater than” or “less than”
  • +
  • >= or <= means “greater than or equal to” or “less than or equal to”
  • +
+
+

We can also obtain numerical summaries for these flights:

+
lax_flights %>%
+  summarise(mean_dd = mean(dep_delay), median_dd = median(dep_delay), n = n())
+

Note that in the summarise function we created a list of three different numerical summaries that we were interested in. The names of these elements are user defined, like mean_dd, median_dd, n, and you could customize these names as you like (just don’t use spaces in your names). Calculating these summary statistics also require that you know the function calls. Note that n() reports the sample size.

+
+

Summary statistics: Some useful function calls for summary statistics for a single numerical variable are as follows:

+
    +
  • mean
  • +
  • median
  • +
  • sd
  • +
  • var
  • +
  • IQR
  • +
  • min
  • +
  • max
  • +
+

Note that each of these functions take a single vector as an argument, and returns a single value.

+
+

We can also filter based on multiple criteria. Suppose we are interested in flights headed to San Francisco (SFO) in February:

+
sfo_feb_flights <- nycflights %>%
+  filter(dest == "SFO", month == 2)
+

Note that we can separate the conditions using commas if we want flights that are both headed to SFO and in February. If we are interested in either flights headed to SFO or in February we can use the | instead of the comma.

+
    +
  1. Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?

  2. +
  3. Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.

  4. +
+

Another useful technique is quickly calculating summary statistics for various groups in your data frame. For example, we can modify the above command using the group_by function to get the same summary stats for each origin airport:

+
sfo_feb_flights %>%
+  group_by(origin) %>%
+  summarise(median_dd = median(dep_delay), iqr_dd = IQR(dep_delay), n_flights = n())
+

Here, we first grouped the data by origin, and then calculated the summary statistics.

+
    +
  1. Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?
  2. +
+
+
+

Departure delays over months

+

Which month would you expect to have the highest average delay departing from an NYC airport?

+

Let’s think about how we would answer this question:

+
    +
  • First, calculate monthly averages for departure delays. With the new language we are learning, we need to +
      +
    • group_by months, then
    • +
    • summarise mean departure delays.
    • +
  • +
  • Then, we need to arrange these average delays in descending order
  • +
+
nycflights %>%
+  group_by(month) %>%
+  summarise(mean_dd = mean(dep_delay)) %>%
+  arrange(desc(mean_dd))
+
    +
  1. Suppose you really dislike departure delays, and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?
  2. +
+ +
+
+

On time departure rate for NYC airports

+

Suppose you will be flying out of NYC and want to know which of the three major NYC airports has the best on time departure rate of departing flights. Suppose also that for you a flight that is delayed for less than 5 minutes is basically “on time”. You consider any flight delayed for 5 minutes of more to be “delayed”.

+

In order to determine which airport has the best on time departure rate, we need to

+
    +
  • first classify each flight as “on time” or “delayed”,
  • +
  • then group flights by origin airport,
  • +
  • then calculate on time departure rates for each origin airport,
  • +
  • and finally arrange the airports in descending order for on time departure percentage.
  • +
+

Let’s start with classifying each flight as “on time” or “delayed” by creating a new variable with the mutate function.

+
nycflights <- nycflights %>%
+  mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
+

The first argument in the mutate function is the name of the new variable we want to create, in this case dep_type. Then if dep_delay < 5 we classify the flight as "on time" and "delayed" if not, i.e. if the flight is delayed for 5 or more minutes.

+

Note that we are also overwriting the nycflights data frame with the new version of this data frame that includes the new dep_type variable.

+

We can handle all the remaining steps in one code chunk:

+
nycflights %>%
+  group_by(origin) %>%
+  summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
+  arrange(desc(ot_dep_rate))
+
    +
  1. If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?
  2. +
+

We can also visualize the distribution of on on time departure rate across the three airports using a segmented bar plot.

+
qplot(x = origin, fill = dep_type, data = nycflights, geom = "bar")
+
+
+
+
+

More Practice

+
    +
  1. Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.

  2. +
  3. Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom = "point".

  4. +
  5. Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.

  6. +
+

+
+ + + +
+
+ +
+ + + + + + + + diff --git a/normal_distribution/more/Body.csv b/03_normal_distribution/more/Body.csv similarity index 100% rename from normal_distribution/more/Body.csv rename to 03_normal_distribution/more/Body.csv diff --git a/normal_distribution/more/bdims.RData b/03_normal_distribution/more/bdims.RData similarity index 100% rename from normal_distribution/more/bdims.RData rename to 03_normal_distribution/more/bdims.RData diff --git a/normal_distribution/more/description-of-bdims.txt b/03_normal_distribution/more/description-of-bdims.txt similarity index 100% rename from normal_distribution/more/description-of-bdims.txt rename to 03_normal_distribution/more/description-of-bdims.txt diff --git a/normal_distribution/more/histQQmatch.pdf b/03_normal_distribution/more/histQQmatch.pdf similarity index 100% rename from normal_distribution/more/histQQmatch.pdf rename to 03_normal_distribution/more/histQQmatch.pdf diff --git a/03_normal_distribution/more/histQQmatch.png b/03_normal_distribution/more/histQQmatch.png new file mode 100644 index 0000000..65b7718 Binary files /dev/null and b/03_normal_distribution/more/histQQmatch.png differ diff --git a/03_normal_distribution/more/histQQmatchgg.png b/03_normal_distribution/more/histQQmatchgg.png new file mode 100644 index 0000000..a28b3fd Binary files /dev/null and b/03_normal_distribution/more/histQQmatchgg.png differ diff --git a/normal_distribution/more/qqnormsim.R b/03_normal_distribution/more/qqnormsim.R similarity index 100% rename from normal_distribution/more/qqnormsim.R rename to 03_normal_distribution/more/qqnormsim.R diff --git a/normal_distribution/normal_distribution.Rmd b/03_normal_distribution/normal_distribution.Rmd similarity index 53% rename from normal_distribution/normal_distribution.Rmd rename to 03_normal_distribution/normal_distribution.Rmd index e772911..46d95bb 100644 --- a/normal_distribution/normal_distribution.Rmd +++ b/03_normal_distribution/normal_distribution.Rmd @@ -2,11 +2,17 @@ title: "The normal distribution" output: html_document: - theme: cerulean - highlight: pygments css: ../lab.css + highlight: pygments + theme: cerulean + toc: true + toc_float: true --- +```{r echo = FALSE} +knitr::opts_chunk$set(eval = FALSE) +``` + In this lab we'll investigate the probability distribution that is most central to statistics: the normal distribution. If we are confident that our data are nearly normal, that opens the door to many powerful statistical methods. Here @@ -17,36 +23,41 @@ learn how to generate random numbers from a normal distribution. This week we'll be working with measurements of body dimensions. This data set contains measurements from 247 men and 260 women, most of whom were considered -healthy young adults. - -```{r load-data, eval=FALSE} -download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData") -load("bdims.RData") +healthy young adults. Let's take a quick peek at the first few rows of the data. + +```{r load-data} +library(mosaic) +library(dplyr) +library(ggplot2) +library(oilabs) +data(bdims) +head(bdims) ``` -Let's take a quick peek at the first few rows of the data. +You'll see that for every observation we have 25 measurements, many of which are +either diameters or girths. You can learn about what the variable names mean by +bringing up the help page. -```{r head-data, eval=FALSE} -head(bdims) +```{r help-bdims} +?bdims ``` -You'll see that for every observation we have 25 measurements, many of which are -either diameters or girths. A key to the variable names can be found at -[http://www.openintro.org/stat/data/bdims.php](http://www.openintro.org/stat/data/bdims.php), -but we'll be focusing on just three columns to get started: weight in kg (`wgt`), -height in cm (`hgt`), and `sex` (`1` indicates male, `0` indicates female). +We'll be focusing on just three columns to get started: weight in kg (`wgt`), +height in cm (`hgt`), and `sex` (`m` indicates male, `f` indicates female). Since males and females tend to have different body dimensions, it will be useful to create two additional data sets: one with only men and another with only women. -```{r male-female, eval=FALSE} -mdims <- subset(bdims, sex == 1) -fdims <- subset(bdims, sex == 0) +```{r male-female} +mdims <- bdims %>% + filter(sex == "m") +fdims <- bdims %>% + filter(sex == "f") ``` -1. Make a histogram of men's heights and a histogram of women's heights. How - would you compare the various aspects of the two distributions? +1. Make a plot (or plots) to visualize the distributions of men's and women's heights. + How do their centers, shapes, and spreads compare? ## The normal distribution @@ -60,43 +71,38 @@ This normal curve should have the same mean and standard deviation as the data. We'll be working with women's heights, so let's store them as a separate object and then calculate some statistics that will be referenced later. -```{r female-hgt-mean-sd, eval=FALSE} -fhgtmean <- mean(fdims$hgt) -fhgtsd <- sd(fdims$hgt) + +```{r female-hgt-mean-sd} +fhgtmean <- mean(~hgt, data = fdims) +fhgtsd <- sd(~hgt, data = fdims) ``` Next we make a density histogram to use as the backdrop and use the `lines` function to overlay a normal probability curve. The difference between a frequency histogram and a density histogram is that while in a frequency histogram the *heights* of the bars add up to the total number of observations, -in a density histogram the *areas* of the bars add up to 1. The area of each bar +in a density histogram the *areas* of the bars add up to 1. The area of each bar can be calculated as simply the height *times* the width of the bar. Using a -density histogram allows us to properly overlay a normal distribution curve over -the histogram since the curve is a normal probability density function. -Frequency and density histograms both display the same exact shape; they only -differ in their y-axis. You can verify this by comparing the frequency histogram -you constructed earlier and the density histogram created by the commands below. - -```{r hist-height, eval=FALSE} -hist(fdims$hgt, probability = TRUE) -x <- 140:190 -y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd) -lines(x = x, y = y, col = "blue") +density histogram allows us to properly overlay a normal distribution curve over +the histogram since the curve is a normal probability density function that also +has area under the curve of 1. Frequency and density histograms both display the +same exact shape; they only differ in their y-axis. You can verify this by +comparing the frequency histogram you constructed earlier and the density +histogram created by the commands below. + +```{r hist-height} +qplot(x = hgt, data = fdims, geom = "blank") + + geom_histogram(aes(y = ..density..)) + + stat_function(fun = dnorm, args = c(mean = fhgtmean, sd = fhgtsd), col = "tomato") ``` -After plotting the density histogram with the first command, we create the x- -and y-coordinates for the normal curve. We chose the `x` range as 140 to 190 in -order to span the entire range of `fheight`. To create `y`, we use `dnorm` to -calculate the density of each of those x-values in a distribution that is normal -with mean `fhgtmean` and standard deviation `fhgtsd`. The final command draws a -curve on the existing plot (the density histogram) by connecting each of the -points specified by `x` and `y`. The argument `col` simply sets the color for +After initializing a blank plot with the first command, the `ggplot2` package +allows us to add additional layers. The first layer is a density histogram. The +second layer is a statistical function -- the density of the normal curve, `dnorm`. +We specify that we want the curve to have the same mean and standard deviation +as the column of female heights. The argument `col` simply sets the color for the line to be drawn. If we left it out, the line would be drawn in black. -The top of the curve is cut off because the limits of the x- and y-axes are set -to best fit the histogram. To adjust the y-axis you can add a third argument to -the histogram function: `ylim = c(0, 0.06)`. - 2. Based on the this plot, does it appear that the data follow a nearly normal distribution? @@ -109,47 +115,54 @@ close the histogram is to the curve. An alternative approach involves constructing a normal probability plot, also called a normal Q-Q plot for "quantile-quantile". -```{r qq, eval=FALSE} -qqnorm(fdims$hgt) -qqline(fdims$hgt) +```{r qq} +qplot(sample = hgt, data = fdims, geom = "qq") ``` -A data set that is nearly normal will result in a probability plot where the -points closely follow the line. Any deviations from normality leads to -deviations of these points from the line. The plot for female heights shows -points that tend to follow the line but with some errant points towards the -tails. We're left with the same problem that we encountered with the histogram -above: how close is close enough? +The x-axis values correspond to the quantiles of a theoretically normal curve +with mean 0 and standard deviation 1 (i.e., the standard normal distribution). The +y-axis values correspond to the quantiles of the original unstandardized sample +data. However, even if we were to standardize the sample data values, the Q-Q +plot would look identical. A data set that is nearly normal will result in a +probability plot where the points closely follow a diagonal line. Any deviations from +normality leads to deviations of these points from that line. + +The plot for female heights shows points that tend to follow the line but with +some errant points towards the tails. We're left with the same problem that we +encountered with the histogram above: how close is close enough? A useful way to address this question is to rephrase it as: what do probability plots look like for data that I *know* came from a normal distribution? We can answer this by simulating data from a normal distribution using `rnorm`. -```{r sim-norm, eval=FALSE} -sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd) +```{r sim-norm} +sim_norm <- rnorm(n = nrow(fdims), mean = fhgtmean, sd = fhgtsd) ``` The first argument indicates how many numbers you'd like to generate, which we specify to be the same number of heights in the `fdims` data set using the -`length` function. The last two arguments determine the mean and standard +`nrow()` function. The last two arguments determine the mean and standard deviation of the normal distribution from which the simulated sample will be generated. We can take a look at the shape of our simulated data set, `sim_norm`, as well as its normal probability plot. 3. Make a normal probability plot of `sim_norm`. Do all of the points fall on the line? How does this plot compare to the probability plot for the real - data? + data? (Since `sim_norm` is not a dataframe, it can be put directly into the + `sample` argument and the `data` argument can be dropped.) Even better than comparing the original plot to a single plot generated from a normal distribution is to compare it to many more plots using the following -function. It may be helpful to click the zoom button in the plot window. +function. It shows the Q-Q plot corresponding to the original data in the top +left corner, and the Q-Q plots of 8 different simulated normal data. It may be +helpful to click the zoom button in the plot window. -```{r qqnormsim, eval=FALSE} -qqnormsim(fdims$hgt) +```{r qqnormsim} +qqnormsim(sample = hgt, data = fdims) ``` -4. Does the normal probability plot for `fdims$hgt` look similar to the plots - created for the simulated data? That is, do plots provide evidence that the +4. Does the normal probability plot for female heights look similar to the plots + created for the simulated data? That is, do the plots provide evidence that the female heights are nearly normal? 5. Using the same technique, determine whether or not female weights appear to @@ -172,13 +185,13 @@ exercise.) If we assume that female heights are normally distributed (a very close approximation is also okay), we can find this probability by calculating a Z score and consulting a Z table (also called a normal probability table). In R, -this is done in one step with the function `pnorm`. +this is done in one step with the function `pnorm()`. -```{r pnorm, eval=FALSE} +```{r pnorm} 1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd) ``` -Note that the function `pnorm` gives the area under the normal curve below a +Note that the function `pnorm()` gives the area under the normal curve below a given value, `q`, with a given mean and standard deviation. Since we're interested in the probability that someone is taller than 182 cm, we have to take one minus that probability. @@ -188,8 +201,10 @@ probability. If we want to calculate the probability empirically, we simply need to determine how many observations fall above 182 then divide this number by the total sample size. -```{r probability, eval=FALSE} -sum(fdims$hgt > 182) / length(fdims$hgt) +```{r probability} +fdims %>% + filter(hgt > 182) %>% + summarise(percent = n() / nrow(fdims)) ``` Although the probabilities are not exactly the same, they are reasonably close. @@ -197,16 +212,16 @@ The closer that your distribution is to being normal, the more accurate the theoretical probabilities will be. 6. Write out two probability questions that you would like to answer; one - regarding female heights and one regarding female weights. Calculate the + regarding female heights and one regarding female weights. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods? * * * -## On Your Own +## More Practice -- Now let's consider some of the other variables in the body dimensions data +7. Now let's consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the @@ -225,17 +240,94 @@ theoretical probabilities will be. **d.** The histogram for female chest depth (`che.de`) belongs to normal probability plot letter ____. -- Note that normal probability plots C and D have a slight stepwise pattern. +8. Note that normal probability plots C and D have a slight stepwise pattern. Why do you think this is the case? -- As you can see, normal probability plots can be used both to assess +9. As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (`kne.di`). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings. -![histQQmatch](more/histQQmatch.png) +```{r hists-and-qqs, echo=FALSE, eval=FALSE} +sdata <- fdims %>% + mutate(sdata = (bii.di - mean(bii.di))/sd(bii.di)) %>% + select(sdata) +p1 <- ggplot(sdata, aes(x = sdata)) + + geom_histogram() + + ggtitle("Histogram for female bii.di") +p4 <- qplot(sample = sdata, data = sdata, stat = "qq") + + ggtitle("Normal QQ plot B") +sdata <- fdims %>% + mutate(sdata = (elb.di - mean(elb.di))/sd(elb.di)) %>% + select(sdata) +p3 <- ggplot(sdata, aes(x = sdata)) + + geom_histogram() + + ggtitle("Histogram for female elb.di") +p6 <- qplot(sample = sdata, data = sdata, stat = "qq") + + ggtitle("Normal QQ plot C") +sdata <- bdims %>% + mutate(sdata = (age - mean(age))/sd(age)) %>% + select(sdata) +p5 <- ggplot(sdata, aes(x = sdata)) + + geom_histogram() + + ggtitle("Histogram for general age") +p8 <- qplot(sample = sdata, data = sdata, stat = "qq") + + ggtitle("Normal QQ plot D") +sdata <- fdims %>% + mutate(sdata = (che.de - mean(che.de))/sd(che.de)) %>% + select(sdata) +p7 <- ggplot(sdata, aes(x = sdata)) + + geom_histogram() + + ggtitle("Histogram for general age") +p2 <- qplot(sample = sdata, data = sdata, stat = "qq") + + ggtitle("Normal QQ plot A") + +multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) { + library(grid) + + # Make a list from the ... arguments and plotlist + plots <- c(list(...), plotlist) + + numPlots = length(plots) + + # If layout is NULL, then use 'cols' to determine layout + if (is.null(layout)) { + # Make the panel + # ncol: Number of columns of plots + # nrow: Number of rows needed, calculated from # of cols + layout <- matrix(seq(1, cols * ceiling(numPlots/cols)), + ncol = cols, nrow = ceiling(numPlots/cols)) + } + + if (numPlots==1) { + print(plots[[1]]) + + } else { + # Set up the page + grid.newpage() + pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout)))) + + # Make each plot, in the correct location + for (i in 1:numPlots) { + # Get the i,j matrix positions of the regions that contain this subplot + matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE)) + + print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row, + layout.pos.col = matchidx$col)) + } + } +} + +png("more/histQQmatch.png", height = 1600, width = 1200, res = 150) +multiplot(p1, p2, p3, p4, p5, p6, p7, p8, + layout = matrix(1:8, ncol = 2, byrow = TRUE)) +dev.off() +``` + + +![](more/histQQmatchgg.png)
This is a product of OpenIntro that is released under a diff --git a/03_normal_distribution/normal_distribution.html b/03_normal_distribution/normal_distribution.html new file mode 100644 index 0000000..1a378c8 --- /dev/null +++ b/03_normal_distribution/normal_distribution.html @@ -0,0 +1,372 @@ + + + + + + + + + + + + + +The normal distribution + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + +
+
+
+
+
+ +
+ + + + + + + +

In this lab we’ll investigate the probability distribution that is most central to statistics: the normal distribution. If we are confident that our data are nearly normal, that opens the door to many powerful statistical methods. Here we’ll use the graphical tools of R to assess the normality of our data and also learn how to generate random numbers from a normal distribution.

+
+

The Data

+

This week we’ll be working with measurements of body dimensions. This data set contains measurements from 247 men and 260 women, most of whom were considered healthy young adults. Let’s take a quick peek at the first few rows of the data.

+
library(mosaic)
+library(dplyr)
+library(ggplot2)
+library(oilabs)
+data(bdims)
+head(bdims)
+

You’ll see that for every observation we have 25 measurements, many of which are either diameters or girths. You can learn about what the variable names mean by bringing up the help page.

+
?bdims
+

We’ll be focusing on just three columns to get started: weight in kg (wgt), height in cm (hgt), and sex (m indicates male, f indicates female).

+

Since males and females tend to have different body dimensions, it will be useful to create two additional data sets: one with only men and another with only women.

+
mdims <- bdims %>%
+  filter(sex == "m")
+fdims <- bdims %>%
+  filter(sex == "f")
+
    +
  1. Make a plot (or plots) to visualize the distributions of men’s and women’s heights.
    +How do their centers, shapes, and spreads compare?
  2. +
+
+
+

The normal distribution

+

In your description of the distributions, did you use words like bell-shaped or normal? It’s tempting to say so when faced with a unimodal symmetric distribution.

+

To see how accurate that description is, we can plot a normal distribution curve on top of a histogram to see how closely the data follow a normal distribution. This normal curve should have the same mean and standard deviation as the data. We’ll be working with women’s heights, so let’s store them as a separate object and then calculate some statistics that will be referenced later.

+
fhgtmean <- mean(~hgt, data = fdims)
+fhgtsd <- sd(~hgt, data = fdims)
+

Next we make a density histogram to use as the backdrop and use the lines function to overlay a normal probability curve. The difference between a frequency histogram and a density histogram is that while in a frequency histogram the heights of the bars add up to the total number of observations, in a density histogram the areas of the bars add up to 1. The area of each bar can be calculated as simply the height times the width of the bar. Using a density histogram allows us to properly overlay a normal distribution curve over the histogram since the curve is a normal probability density function that also has area under the curve of 1. Frequency and density histograms both display the same exact shape; they only differ in their y-axis. You can verify this by comparing the frequency histogram you constructed earlier and the density histogram created by the commands below.

+
qplot(x = hgt, data = fdims, geom = "blank") +
+  geom_histogram(aes(y = ..density..)) +
+  stat_function(fun = dnorm, args = c(mean = fhgtmean, sd = fhgtsd), col = "tomato")
+

After initializing a blank plot with the first command, the ggplot2 package allows us to add additional layers. The first layer is a density histogram. The second layer is a statistical function – the density of the normal curve, dnorm. We specify that we want the curve to have the same mean and standard deviation as the column of female heights. The argument col simply sets the color for the line to be drawn. If we left it out, the line would be drawn in black.

+
    +
  1. Based on the this plot, does it appear that the data follow a nearly normal distribution?
  2. +
+
+
+

Evaluating the normal distribution

+

Eyeballing the shape of the histogram is one way to determine if the data appear to be nearly normally distributed, but it can be frustrating to decide just how close the histogram is to the curve. An alternative approach involves constructing a normal probability plot, also called a normal Q-Q plot for “quantile-quantile”.

+
qplot(sample = hgt, data = fdims, geom = "qq")
+

The x-axis values correspond to the quantiles of a theoretically normal curve with mean 0 and standard deviation 1 (i.e., the standard normal distribution). The y-axis values correspond to the quantiles of the original unstandardized sample data. However, even if we were to standardize the sample data values, the Q-Q plot would look identical. A data set that is nearly normal will result in a probability plot where the points closely follow a diagonal line. Any deviations from normality leads to deviations of these points from that line.

+

The plot for female heights shows points that tend to follow the line but with some errant points towards the tails. We’re left with the same problem that we encountered with the histogram above: how close is close enough?

+

A useful way to address this question is to rephrase it as: what do probability plots look like for data that I know came from a normal distribution? We can answer this by simulating data from a normal distribution using rnorm.

+
sim_norm <- rnorm(n = nrow(fdims), mean = fhgtmean, sd = fhgtsd)
+

The first argument indicates how many numbers you’d like to generate, which we specify to be the same number of heights in the fdims data set using the nrow() function. The last two arguments determine the mean and standard deviation of the normal distribution from which the simulated sample will be generated. We can take a look at the shape of our simulated data set, sim_norm, as well as its normal probability plot.

+
    +
  1. Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a dataframe, it can be put directly into the sample argument and the data argument can be dropped.)
  2. +
+

Even better than comparing the original plot to a single plot generated from a normal distribution is to compare it to many more plots using the following function. It shows the Q-Q plot corresponding to the original data in the top left corner, and the Q-Q plots of 8 different simulated normal data. It may be helpful to click the zoom button in the plot window.

+
qqnormsim(sample = hgt, data = fdims)
+
    +
  1. Does the normal probability plot for female heights look similar to the plots created for the simulated data? That is, do the plots provide evidence that the female heights are nearly normal?

  2. +
  3. Using the same technique, determine whether or not female weights appear to come from a normal distribution.

  4. +
+
+
+

Normal probabilities

+

Okay, so now you have a slew of tools to judge whether or not a variable is normally distributed. Why should we care?

+

It turns out that statisticians know a lot about the normal distribution. Once we decide that a random variable is approximately normal, we can answer all sorts of questions about that variable related to probability. Take, for example, the question of, “What is the probability that a randomly chosen young adult female is taller than 6 feet (about 182 cm)?” (The study that published this data set is clear to point out that the sample was not random and therefore inference to a general population is not suggested. We do so here only as an exercise.)

+

If we assume that female heights are normally distributed (a very close approximation is also okay), we can find this probability by calculating a Z score and consulting a Z table (also called a normal probability table). In R, this is done in one step with the function pnorm().

+
1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)
+

Note that the function pnorm() gives the area under the normal curve below a given value, q, with a given mean and standard deviation. Since we’re interested in the probability that someone is taller than 182 cm, we have to take one minus that probability.

+

Assuming a normal distribution has allowed us to calculate a theoretical probability. If we want to calculate the probability empirically, we simply need to determine how many observations fall above 182 then divide this number by the total sample size.

+
fdims %>% 
+  filter(hgt > 182) %>%
+  summarise(percent = n() / nrow(fdims))
+

Although the probabilities are not exactly the same, they are reasonably close. The closer that your distribution is to being normal, the more accurate the theoretical probabilities will be.

+
    +
  1. Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?
  2. +
+
+
+
+

More Practice

+
    +
  1. Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.

    +

    a. The histogram for female biiliac (pelvic) diameter (bii.di) belongs to normal probability plot letter ____.

    +

    b. The histogram for female elbow diameter (elb.di) belongs to normal probability plot letter ____.

    +

    c. The histogram for general age (age) belongs to normal probability plot letter ____.

    +

    d. The histogram for female chest depth (che.de) belongs to normal probability plot letter ____.

  2. +
  3. Note that normal probability plots C and D have a slight stepwise pattern.
    +Why do you think this is the case?

  4. +
  5. As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for female knee diameter (kne.di). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

  6. +
+
+ + +
+
+

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.

+
+
+ + + +
+
+ +
+ + + + + + + + diff --git a/probability/more/calc_streak.R b/04_probability/more/calc_streak.R similarity index 100% rename from probability/more/calc_streak.R rename to 04_probability/more/calc_streak.R diff --git a/probability/more/kobe-readme.txt b/04_probability/more/kobe-readme.txt similarity index 100% rename from probability/more/kobe-readme.txt rename to 04_probability/more/kobe-readme.txt diff --git a/probability/more/kobe.RData b/04_probability/more/kobe.RData similarity index 100% rename from probability/more/kobe.RData rename to 04_probability/more/kobe.RData diff --git a/probability/more/kobe.csv b/04_probability/more/kobe.csv similarity index 100% rename from probability/more/kobe.csv rename to 04_probability/more/kobe.csv diff --git a/probability/more/kobe_data.xls b/04_probability/more/kobe_data.xls similarity index 100% rename from probability/more/kobe_data.xls rename to 04_probability/more/kobe_data.xls diff --git a/probability/probability.Rmd b/04_probability/probability.Rmd similarity index 66% rename from probability/probability.Rmd rename to 04_probability/probability.Rmd index 0f8e958..cb83570 100644 --- a/probability/probability.Rmd +++ b/04_probability/probability.Rmd @@ -2,19 +2,28 @@ title: "Probability" output: html_document: - theme: cerulean - highlight: pygments css: ../lab.css + highlight: pygments + theme: cerulean + toc: true + toc_float: true --- -## Hot Hands +```{r global_options, include=FALSE} +knitr::opts_chunk$set(eval = FALSE) +library(dplyr) +library(ggplot2) +library(oilabs) +``` + +## The Hot Hand Basketball players who make several baskets in succession are described as having a *hot hand*. Fans and players have long believed in the hot hand phenomenon, which refutes the assumption that each shot is independent of the -next. However, a 1985 paper by Gilovich, Vallone, and Tversky collected evidence +next. However, [a 1985 paper](http://www.sciencedirect.com/science/article/pii/0010028585900106) by Gilovich, Vallone, and Tversky collected evidence that contradicted this belief and showed that successive shots are independent -events ([http://psych.cornell.edu/sites/default/files/Gilo.Vallone.Tversky.pdf](http://psych.cornell.edu/sites/default/files/Gilo.Vallone.Tversky.pdf)). This paper started a great controversy that continues to this day, as you can +events. This paper started a great controversy that continues to this day, as you can see by Googling *hot hand basketball*. We do not expect to resolve this controversy today. However, in this lab we'll @@ -23,38 +32,41 @@ to (1) think about the effects of independent and dependent events, (2) learn how to simulate shooting streaks in R, and (3) to compare a simulation to actual data in order to determine if the hot hand phenomenon appears to be real. -## Saving your code +## Getting Started -Click on File -> New -> R Script. This will open a blank document above the -console. As you go along you can copy and paste your code here and save it. This -is a good way to keep track of your code and be able to reuse it later. To run -your code from this document you can either copy and paste it into the console, -highlight the code and hit the Run button, or highlight the code and hit -command+enter or a mac or control+enter on a PC. +### Load packages -You'll also want to save this script (code document). To do so click on the disk -icon. The first time you hit save, RStudio will ask for a file name; you can -name it anything you like. Once you hit save you'll see the file appear under -the Files tab in the lower right panel. You can reopen this file anytime by -simply clicking on it. +In this lab we will explore the data using the `dplyr` package and visualize it +using the `ggplot2` package for data visualization. The data can be found in the +companion package for OpenIntro labs, `oilabs`. -## Getting Started +Let's load the packages. + +```{r load-packages, message=FALSE} +library(dplyr) +library(ggplot2) +library(oilabs) +``` + +### Creating a reproducible lab report -Our investigation will focus on the performance of one player: Kobe Bryant of -the Los Angeles Lakers. His performance against the Orlando Magic in the 2009 -NBA finals earned him the title *Most Valuable Player* and many spectators -commented on how he appeared to show a hot hand. Let's load some data from those -games and look at the first several rows. +To create your new lab report, start by opening a new R Markdown document... From Template... then select Lab Report from the `oilabs` package. -```{r load-data, eval=FALSE} -download.file("http://www.openintro.org/stat/data/kobe.RData", destfile = "kobe.RData") -load("kobe.RData") -head(kobe) +### Data + +Our investigation will focus on the performance of one player: [Kobe Bryant](https://en.wikipedia.org/wiki/Kobe_Bryant) of +the Los Angeles Lakers. His performance against the Orlando Magic in the [2009 +NBA Finals](https://en.wikipedia.org/wiki/2009_NBA_Finals) earned him the title *Most Valuable Player* and many spectators +commented on how he appeared to show a hot hand. Let's load some necessary files +that we will need for this lab. + +```{r load-data} +data(kobe_basket) ``` -In this data frame, every row records a shot taken by Kobe Bryant. If he hit the -shot (made a basket), a hit, `H`, is recorded in the column named `basket`, -otherwise a miss, `M`, is recorded. +This data frame contains 133 observations and 6 variables, where every +row records a shot taken by Kobe Bryant. The `shot` variable in this dataset +indicates whether the shot was a hit (`H`) or a miss (`M`). Just looking at the string of hits and misses, it can be difficult to gauge whether or not it seems like Kobe was shooting with a hot hand. One way we can @@ -67,11 +79,7 @@ his nine shot attempts in the first quarter: \[ \textrm{H M | M | H H M | M | M | M} \] -To verify this use the following command: - -```{r first9, eval=FALSE} -kobe$basket[1:9] -``` +You can verify this by viewing the first 8 rows of the data in the data viewer. Within the nine shot attempts, there are six streaks, which are separated by a "|" above. Their lengths are one, zero, two, zero, zero, zero (in order of @@ -80,26 +88,28 @@ occurrence). 1. What does a streak length of 1 mean, i.e. how many hits and misses are in a streak of 1? What about a streak length of 0? -The custom function `calc_streak`, which was loaded in with the data, may be -used to calculate the lengths of all shooting streaks and then look at the -distribution. +Counting streak lengths manually for all 133 shots would get tedious, so we'll +use the custom function `calc_streak` to calculate them, and store the results +in a data frame called `kobe_streak` as the `length` variable. -```{r calc-streak-kobe, eval=FALSE} -kobe_streak <- calc_streak(kobe$basket) -barplot(table(kobe_streak)) +```{r calc-streak-kobe} +kobe_streak <- calc_streak(kobe_basket$shot) ``` -Note that instead of making a histogram, we chose to make a bar plot from a -table of the streak data. A bar plot is preferable here since our variable is -discrete -- counts -- instead of continuous. +We can then take a look at the distribution of these streak lengths. + +```{r plot-streak-kobe} +qplot(data = kobe_streak, x = length, geom = "bar") +``` 2. Describe the distribution of Kobe's streak lengths from the 2009 NBA finals. - What was his typical streak length? How long was his longest streak of baskets? + What was his typical streak length? How long was his longest streak of + baskets? Make sure to include the accompanying plot in your answer. ## Compared to What? We've shown that Kobe had some long shooting streaks, but are they long enough -to support the belief that he had hot hands? What can we compare them to? +to support the belief that he had a hot hand? What can we compare them to? To answer these questions, let's return to the idea of *independence*. Two processes are independent if the outcome of one process doesn't effect the outcome @@ -135,8 +145,8 @@ same probability of hitting every shot regardless of his past shots: 45%. Now that we've phrased the situation in terms of independent shots, let's return to the question: how do we tell if Kobe's shooting streaks are long enough to -indicate that he has hot hands? We can compare his streak lengths to someone -without hot hands: an independent shooter. +indicate that he has a hot hand? We can compare his streak lengths to someone +without a hot hand: an independent shooter. ## Simulations in R @@ -146,12 +156,12 @@ ground rules of a random process and then the computer uses random numbers to generate an outcome that adheres to those rules. As a simple example, you can simulate flipping a fair coin with the following. -```{r head-tail, eval=FALSE} -outcomes <- c("heads", "tails") -sample(outcomes, size = 1, replace = TRUE) +```{r head-tail} +coin_outcomes <- c("heads", "tails") +sample(coin_outcomes, size = 1, replace = TRUE) ``` -The vector `outcomes` can be thought of as a hat with two slips of paper in it: +The vector `coin_outcomes` can be thought of as a hat with two slips of paper in it: one slip says `heads` and the other says `tails`. The function `sample` draws one slip from the hat and tells us if it was a head or a tail. @@ -165,25 +175,26 @@ governs how many samples to draw (the `replace = TRUE` argument indicates we put the slip of paper back in the hat before drawing again). Save the resulting vector of heads and tails in a new object called `sim_fair_coin`. -```{r sim-fair-coin, eval=FALSE} -sim_fair_coin <- sample(outcomes, size = 100, replace = TRUE) +```{r sim-fair-coin} +sim_fair_coin <- sample(coin_outcomes, size = 100, replace = TRUE) ``` To view the results of this simulation, type the name of the object and then use `table` to count up the number of heads and tails. -```{r table-sim-fair-coin, eval=FALSE} +```{r table-sim-fair-coin} sim_fair_coin table(sim_fair_coin) ``` -Since there are only two elements in `outcomes`, the probability that we "flip" +Since there are only two elements in `coin_outcomes`, the probability that we "flip" a coin and it lands heads is 0.5. Say we're trying to simulate an unfair coin that we know only lands heads 20% of the time. We can adjust for this by adding an argument called `prob`, which provides a vector of two probability weights. -```{r sim-unfair-coin, eval=FALSE} -sim_unfair_coin <- sample(outcomes, size = 100, replace = TRUE, prob = c(0.2, 0.8)) +```{r sim-unfair-coin} +sim_unfair_coin <- sample(coin_outcomes, size = 100, replace = TRUE, + prob = c(0.2, 0.8)) ``` `prob=c(0.2, 0.8)` indicates that for the two elements in the `outcomes` vector, @@ -194,7 +205,25 @@ think of the outcome space as a bag of 10 chips, where 2 chips are labeled chip that says "head"" is 20%, and "tail" is 80%. 3. In your simulation of flipping the unfair coin 100 times, how many flips - came up heads? + came up heads? Include the code for sampling the unfair coin in your response. + Since the markdown file will run the code, and generate a new sample each time + you *Knit* it, you should also "set a seed" **before** you sample. Read more + about setting a seed below. + +
+**A note on setting a seed:** Setting a seed will cause R to select the same +sample each time you knit your document. This will make sure your results don't +change each time you knit, and it will also ensure reproducibility of your work +(by setting the same seed it will be possible to reproduce your results). You can +set a seed like this: +```{r set-seed} +set.seed(35797) # make sure to change the seed +``` +The number above is completely arbitraty. If you need inspiration, you can use your +ID, birthday, or just a random string of numbers. The important thing is that you +use each seed only once. Remember to do this **before** you sample in the exercise +above. +
In a sense, we've shrunken the size of the slip of paper that says "heads", making it less likely to be drawn and we've increased the size of the slip of @@ -206,7 +235,7 @@ an equal probability of being drawn. If you want to learn more about `sample` or any other function, recall that you can always check out its help file. -```{r help-sample, eval=FALSE,tidy = FALSE} +```{r help-sample,tidy = FALSE} ?sample ``` @@ -216,9 +245,9 @@ Simulating a basketball player who has independent shots uses the same mechanism that we use to simulate a coin flip. To simulate a single shot from an independent shooter with a shooting percentage of 50% we type, -```{r sim-basket, eval=FALSE} -outcomes <- c("H", "M") -sim_basket <- sample(outcomes, size = 1, replace = TRUE) +```{r sim-basket} +shot_outcomes <- c("H", "M") +sim_basket <- sample(shot_outcomes, size = 1, replace = TRUE) ``` To make a valid comparison between Kobe and our simulated independent shooter, @@ -235,13 +264,7 @@ R overwrites the old object with the new one, so always make sure that you don't need the information in an old vector before reassigning its name. With the results of the simulation saved as `sim_basket`, we have the data -necessary to compare Kobe to our independent shooter. We can look at Kobe's data -alongside our simulated data. - -```{r compare-basket, eval=FALSE} -kobe$basket -sim_basket -``` +necessary to compare Kobe to our independent shooter. Both data sets represent the results of 133 shot attempts, each with the same shooting percentage of 45%. We know that our simulated data is from a shooter @@ -250,26 +273,31 @@ a hot hand. * * * -## On your own +## More Practice ### Comparing Kobe Bryant to the Independent Shooter -Using `calc_streak`, compute the streak lengths of `sim_basket`. +5. Using `calc_streak`, compute the streak lengths of `sim_basket`, and + save the results in a data frame called `sim_streak`. -- Describe the distribution of streak lengths. What is the typical streak +6. Describe the distribution of streak lengths. What is the typical streak length for this simulated independent shooter with a 45% shooting percentage? - How long is the player's longest streak of baskets in 133 shots? + How long is the player's longest streak of baskets in 133 shots? Make sure + to include a plot in your answer. -- If you were to run the simulation of the independent shooter a second time, +7. If you were to run the simulation of the independent shooter a second time, how would you expect its streak distribution to compare to the distribution from the question above? Exactly the same? Somewhat similar? Totally different? Explain your reasoning. -- How does Kobe Bryant's distribution of streak lengths compare to the +8. How does Kobe Bryant's distribution of streak lengths compare to the distribution of streak lengths for the simulated shooter? Using this comparison, do you have evidence that the hot hand model fits Kobe's shooting patterns? Explain. + +
+
This is a product of OpenIntro that is released under a [Creative Commons Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0). diff --git a/04_probability/probability.html b/04_probability/probability.html new file mode 100644 index 0000000..e5de7d6 --- /dev/null +++ b/04_probability/probability.html @@ -0,0 +1,395 @@ + + + + + + + + + + + + + +Probability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + +
+
+
+
+
+ +
+ + + + + + + +
+

The Hot Hand

+

Basketball players who make several baskets in succession are described as having a hot hand. Fans and players have long believed in the hot hand phenomenon, which refutes the assumption that each shot is independent of the next. However, a 1985 paper by Gilovich, Vallone, and Tversky collected evidence that contradicted this belief and showed that successive shots are independent events. This paper started a great controversy that continues to this day, as you can see by Googling hot hand basketball.

+

We do not expect to resolve this controversy today. However, in this lab we’ll apply one approach to answering questions like this. The goals for this lab are to (1) think about the effects of independent and dependent events, (2) learn how to simulate shooting streaks in R, and (3) to compare a simulation to actual data in order to determine if the hot hand phenomenon appears to be real.

+
+
+

Getting Started

+
+

Load packages

+

In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data can be found in the companion package for OpenIntro labs, oilabs.

+

Let’s load the packages.

+
library(dplyr)
+library(ggplot2)
+library(oilabs)
+
+
+

Creating a reproducible lab report

+

To create your new lab report, start by opening a new R Markdown document… From Template… then select Lab Report from the oilabs package.

+
+
+

Data

+

Our investigation will focus on the performance of one player: Kobe Bryant of the Los Angeles Lakers. His performance against the Orlando Magic in the 2009 NBA Finals earned him the title Most Valuable Player and many spectators commented on how he appeared to show a hot hand. Let’s load some necessary files that we will need for this lab.

+
data(kobe_basket)
+

This data frame contains 133 observations and 6 variables, where every row records a shot taken by Kobe Bryant. The shot variable in this dataset indicates whether the shot was a hit (H) or a miss (M).

+

Just looking at the string of hits and misses, it can be difficult to gauge whether or not it seems like Kobe was shooting with a hot hand. One way we can approach this is by considering the belief that hot hand shooters tend to go on shooting streaks. For this lab, we define the length of a shooting streak to be the number of consecutive baskets made until a miss occurs.

+

For example, in Game 1 Kobe had the following sequence of hits and misses from his nine shot attempts in the first quarter:

+

\[ \textrm{H M | M | H H M | M | M | M} \]

+

You can verify this by viewing the first 8 rows of the data in the data viewer.

+

Within the nine shot attempts, there are six streaks, which are separated by a “|” above. Their lengths are one, zero, two, zero, zero, zero (in order of occurrence).

+
    +
  1. What does a streak length of 1 mean, i.e. how many hits and misses are in a streak of 1? What about a streak length of 0?
  2. +
+

Counting streak lengths manually for all 133 shots would get tedious, so we’ll use the custom function calc_streak to calculate them, and store the results in a data frame called kobe_streak as the length variable.

+
kobe_streak <- calc_streak(kobe_basket$shot)
+

We can then take a look at the distribution of these streak lengths.

+
qplot(data = kobe_streak, x = length, geom = "bar")
+
    +
  1. Describe the distribution of Kobe’s streak lengths from the 2009 NBA finals. What was his typical streak length? How long was his longest streak of baskets? Make sure to include the accompanying plot in your answer.
  2. +
+
+
+
+

Compared to What?

+

We’ve shown that Kobe had some long shooting streaks, but are they long enough to support the belief that he had a hot hand? What can we compare them to?

+

To answer these questions, let’s return to the idea of independence. Two processes are independent if the outcome of one process doesn’t effect the outcome of the second. If each shot that a player takes is an independent process, having made or missed your first shot will not affect the probability that you will make or miss your second shot.

+

A shooter with a hot hand will have shots that are not independent of one another. Specifically, if the shooter makes his first shot, the hot hand model says he will have a higher probability of making his second shot.

+

Let’s suppose for a moment that the hot hand model is valid for Kobe. During his career, the percentage of time Kobe makes a basket (i.e. his shooting percentage) is about 45%, or in probability notation,

+

\[ P(\textrm{shot 1 = H}) = 0.45 \]

+

If he makes the first shot and has a hot hand (not independent shots), then the probability that he makes his second shot would go up to, let’s say, 60%,

+

\[ P(\textrm{shot 2 = H} \, | \, \textrm{shot 1 = H}) = 0.60 \]

+

As a result of these increased probabilites, you’d expect Kobe to have longer streaks. Compare this to the skeptical perspective where Kobe does not have a hot hand, where each shot is independent of the next. If he hit his first shot, the probability that he makes the second is still 0.45.

+

\[ P(\textrm{shot 2 = H} \, | \, \textrm{shot 1 = H}) = 0.45 \]

+

In other words, making the first shot did nothing to effect the probability that he’d make his second shot. If Kobe’s shots are independent, then he’d have the same probability of hitting every shot regardless of his past shots: 45%.

+

Now that we’ve phrased the situation in terms of independent shots, let’s return to the question: how do we tell if Kobe’s shooting streaks are long enough to indicate that he has a hot hand? We can compare his streak lengths to someone without a hot hand: an independent shooter.

+
+
+

Simulations in R

+

While we don’t have any data from a shooter we know to have independent shots, that sort of data is very easy to simulate in R. In a simulation, you set the ground rules of a random process and then the computer uses random numbers to generate an outcome that adheres to those rules. As a simple example, you can simulate flipping a fair coin with the following.

+
coin_outcomes <- c("heads", "tails")
+sample(coin_outcomes, size = 1, replace = TRUE)
+

The vector coin_outcomes can be thought of as a hat with two slips of paper in it: one slip says heads and the other says tails. The function sample draws one slip from the hat and tells us if it was a head or a tail.

+

Run the second command listed above several times. Just like when flipping a coin, sometimes you’ll get a heads, sometimes you’ll get a tails, but in the long run, you’d expect to get roughly equal numbers of each.

+

If you wanted to simulate flipping a fair coin 100 times, you could either run the function 100 times or, more simply, adjust the size argument, which governs how many samples to draw (the replace = TRUE argument indicates we put the slip of paper back in the hat before drawing again). Save the resulting vector of heads and tails in a new object called sim_fair_coin.

+
sim_fair_coin <- sample(coin_outcomes, size = 100, replace = TRUE)
+

To view the results of this simulation, type the name of the object and then use table to count up the number of heads and tails.

+
sim_fair_coin
+table(sim_fair_coin)
+

Since there are only two elements in coin_outcomes, the probability that we “flip” a coin and it lands heads is 0.5. Say we’re trying to simulate an unfair coin that we know only lands heads 20% of the time. We can adjust for this by adding an argument called prob, which provides a vector of two probability weights.

+
sim_unfair_coin <- sample(coin_outcomes, size = 100, replace = TRUE, 
+                          prob = c(0.2, 0.8))
+

prob=c(0.2, 0.8) indicates that for the two elements in the outcomes vector, we want to select the first one, heads, with probability 0.2 and the second one, tails with probability 0.8. Another way of thinking about this is to think of the outcome space as a bag of 10 chips, where 2 chips are labeled “head” and 8 chips “tail”. Therefore at each draw, the probability of drawing a chip that says “head”" is 20%, and “tail” is 80%.

+
    +
  1. In your simulation of flipping the unfair coin 100 times, how many flips came up heads? Include the code for sampling the unfair coin in your response. Since the markdown file will run the code, and generate a new sample each time you Knit it, you should also “set a seed” before you sample. Read more about setting a seed below.
  2. +
+
+

A note on setting a seed: Setting a seed will cause R to select the same sample each time you knit your document. This will make sure your results don’t change each time you knit, and it will also ensure reproducibility of your work (by setting the same seed it will be possible to reproduce your results). You can set a seed like this:

+
set.seed(35797)                 # make sure to change the seed
+

The number above is completely arbitraty. If you need inspiration, you can use your ID, birthday, or just a random string of numbers. The important thing is that you use each seed only once. Remember to do this before you sample in the exercise above.

+
+

In a sense, we’ve shrunken the size of the slip of paper that says “heads”, making it less likely to be drawn and we’ve increased the size of the slip of paper saying “tails”, making it more likely to be drawn. When we simulated the fair coin, both slips of paper were the same size. This happens by default if you don’t provide a prob argument; all elements in the outcomes vector have an equal probability of being drawn.

+

If you want to learn more about sample or any other function, recall that you can always check out its help file.

+
?sample
+
+
+

Simulating the Independent Shooter

+

Simulating a basketball player who has independent shots uses the same mechanism that we use to simulate a coin flip. To simulate a single shot from an independent shooter with a shooting percentage of 50% we type,

+
shot_outcomes <- c("H", "M")
+sim_basket <- sample(shot_outcomes, size = 1, replace = TRUE)
+

To make a valid comparison between Kobe and our simulated independent shooter, we need to align both their shooting percentage and the number of attempted shots.

+
    +
  1. What change needs to be made to the sample function so that it reflects a shooting percentage of 45%? Make this adjustment, then run a simulation to sample 133 shots. Assign the output of this simulation to a new object called sim_basket.
  2. +
+

Note that we’ve named the new vector sim_basket, the same name that we gave to the previous vector reflecting a shooting percentage of 50%. In this situation, R overwrites the old object with the new one, so always make sure that you don’t need the information in an old vector before reassigning its name.

+

With the results of the simulation saved as sim_basket, we have the data necessary to compare Kobe to our independent shooter.

+

Both data sets represent the results of 133 shot attempts, each with the same shooting percentage of 45%. We know that our simulated data is from a shooter that has independent shots. That is, we know the simulated shooter does not have a hot hand.

+
+
+
+

More Practice

+
+

Comparing Kobe Bryant to the Independent Shooter

+
    +
  1. Using calc_streak, compute the streak lengths of sim_basket, and save the results in a data frame called sim_streak.

  2. +
  3. Describe the distribution of streak lengths. What is the typical streak length for this simulated independent shooter with a 45% shooting percentage? How long is the player’s longest streak of baskets in 133 shots? Make sure to include a plot in your answer.

  4. +
  5. If you were to run the simulation of the independent shooter a second time, how would you expect its streak distribution to compare to the distribution from the question above? Exactly the same? Somewhat similar? Totally different? Explain your reasoning.

  6. +
  7. How does Kobe Bryant’s distribution of streak lengths compare to the distribution of streak lengths for the simulated shooter? Using this comparison, do you have evidence that the hot hand model fits Kobe’s shooting patterns? Explain.

  8. +
+
+
+

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.

+
+
+
+ + + +
+
+ +
+ + + + + + + + diff --git a/confidence_intervals/more/AmesHousing.csv b/05_sampling_distributions/more/AmesHousing.csv similarity index 100% rename from confidence_intervals/more/AmesHousing.csv rename to 05_sampling_distributions/more/AmesHousing.csv diff --git a/confidence_intervals/more/AmesHousing.xls b/05_sampling_distributions/more/AmesHousing.xls similarity index 100% rename from confidence_intervals/more/AmesHousing.xls rename to 05_sampling_distributions/more/AmesHousing.xls diff --git a/confidence_intervals/more/ames-readme.txt b/05_sampling_distributions/more/ames-readme.txt similarity index 100% rename from confidence_intervals/more/ames-readme.txt rename to 05_sampling_distributions/more/ames-readme.txt diff --git a/confidence_intervals/more/ames.csv b/05_sampling_distributions/more/ames.csv similarity index 100% rename from confidence_intervals/more/ames.csv rename to 05_sampling_distributions/more/ames.csv diff --git a/confidence_intervals/more/ames_dataprep.R b/05_sampling_distributions/more/ames_dataprep.R similarity index 100% rename from confidence_intervals/more/ames_dataprep.R rename to 05_sampling_distributions/more/ames_dataprep.R diff --git a/05_sampling_distributions/sampling_distributions.Rmd b/05_sampling_distributions/sampling_distributions.Rmd new file mode 100644 index 0000000..5adcdb6 --- /dev/null +++ b/05_sampling_distributions/sampling_distributions.Rmd @@ -0,0 +1,363 @@ +--- +title: "Foundations for statistical inference - Sampling distributions" +runtime: shiny +output: + html_document: + css: lab.css + highlight: pygments + theme: cerulean + toc: true + toc_float: true +--- + +```{r global_options, include=FALSE} +knitr::opts_chunk$set(eval = FALSE) +library(dplyr) +library(ggplot2) +library(oilabs) +data(ames) +``` + +In this lab, we investigate the ways in which the statistics from a random +sample of data can serve as point estimates for population parameters. We're +interested in formulating a *sampling distribution* of our estimate in order +to learn about the properties of the estimate, such as its distribution. + +
+**Setting a seed:** We will take some random samples and build sampling distributions +in this lab, which means you should set a seed on top of your lab. If this concept +is new to you, review the lab concerning probability. +
+ +## Getting Started + +### Load packages + +In this lab we will explore the data using the `dplyr` package and visualize it +using the `ggplot2` package for data visualization. The data can be found in the +companion package for OpenIntro labs, `oilabs`. + +Let's load the packages. + +```{r load-packages, message=FALSE} +library(dplyr) +library(ggplot2) +library(oilabs) +``` + +### Creating a reproducible lab report + +To create your new lab report, start by opening a new R Markdown document... From Template... then select Lab Report from the `oilabs` package. + +### The data + +We consider real estate data from the city of Ames, Iowa. The details of +every real estate transaction in Ames is recorded by the City Assessor's +office. Our particular focus for this lab will be all residential home sales +in Ames between 2006 and 2010. This collection represents our population of +interest. In this lab we would like to learn about these home sales by taking +smaller samples from the full population. Let's load the data. + +```{r load-data} +data(ames) +``` + +We see that there are quite a few variables in the data set, enough to do a +very in-depth analysis. For this lab, we'll restrict our attention to just +two of the variables: the above ground living area of the house in square feet +(`area`) and the sale price (`price`). + +We can explore the distribution of areas of homes in the population of home +sales visually and with summary statistics. Let's first create a visualization, +a histogram: + +```{r area-hist} +qplot(data = ames, x = area, binwidth = 250, geom = "histogram") +``` + +Let's also obtain some summary statistics. Note that we can do this using the +`summarise` function. We can calculate as many statistics as we want using this +function, and just string along the results. Some of the functions below should +be self explanatory (like `mean`, `median`, `sd`, `IQR`, `min`, and `max`). A +new function here is the `quantile` function which we can use to calculate +values corresponding to specific percentile cutoffs in the distribution. For +example `quantile(x, 0.25)` will yield the cutoff value for the 25th percentile (Q1) +in the distribution of x. Finding these values are useful for describing the +distribution, as we can use them for descriptions like *"the middle 50% of the +homes have areas between such and such square feet"*. + +```{r area-stats} +ames %>% + summarise(mu = mean(area), pop_med = median(area), + sigma = sd(area), pop_iqr = IQR(area), + pop_min = min(area), pop_max = max(area), + pop_q1 = quantile(area, 0.25), # first quartile, 25th percentile + pop_q3 = quantile(area, 0.75)) # third quartile, 75th percentile +``` + +1. Describe this population distribution using a visualization and these summary + statistics. You don't have to use all of the summary statistics in your + description, you will need to decide which ones are relevant based on the + shape of the distribution. Make sure to include the plot and the summary + statistics output in your report along with your narrative. + +## The unknown sampling distribution + +In this lab we have access to the entire population, but this is rarely the +case in real life. Gathering information on an entire population is often +extremely costly or impossible. Because of this, we often take a sample of +the population and use that to understand the properties of the population. + +If we were interested in estimating the mean living area in Ames based on a +sample, we can use the `sample_n` command to survey the population. + +```{r samp1} +samp1 <- ames %>% + sample_n(50) +``` + +This command collects a simple random sample of size 50 from the `ames` dataset, +and assigns the result to `samp1`. This is like going into the City +Assessor's database and pulling up the files on 50 random home sales. Working +with these 50 files would be considerably simpler than working with all 2930 +home sales. + +1. Describe the distribution of area in this sample. How does it compare to the + distribution of the population? **Hint:** the `sample_n` function takes a random + sample of observations (i.e. rows) from the dataset, you can still refer to + the variables in the dataset with the same names. Code you used in the + previous exercise will also be helpful for visualizing and summarizing the sample, + however be careful to not label values `mu` and `sigma` anymore since these + are sample statistics, not population parameters. You can customize the labels + of any of the statistics to indicate that these come from the sample. + +If we're interested in estimating the average living area in homes in Ames +using the sample, our best single guess is the sample mean. + +```{r mean-samp1} +samp1 %>% + summarise(x_bar = mean(area)) +``` + +Depending on which 50 homes you selected, your estimate could be a bit above +or a bit below the true population mean of `r round(mean(ames$area),2)` square feet. In general, +though, the sample mean turns out to be a pretty good estimate of the average +living area, and we were able to get it by sampling less than 3\% of the +population. + +1. Would you expect the mean of your sample to match the mean of another team's + sample? Why, or why not? If the answer is no, would you expect the means to + just be somewhat different or very different? Ask a neighboring team to confirm + your answer. + +1. Take a second sample, also of size 50, and call it `samp2`. How does the + mean of `samp2` compare with the mean of `samp1`? Suppose we took two + more samples, one of size 100 and one of size 1000. Which would you think + would provide a more accurate estimate of the population mean? + +Not surprisingly, every time we take another random sample, we get a different +sample mean. It's useful to get a sense of just how much variability we +should expect when estimating the population mean this way. The distribution +of sample means, called the *sampling distribution (of the mean)*, can help us understand +this variability. In this lab, because we have access to the population, we +can build up the sampling distribution for the sample mean by repeating the +above steps many times. Here we will generate 15,000 samples and compute the +sample mean of each. Note that we specify that +`replace = TRUE` since sampling distributions are constructed by sampling +with replacement. + +```{r loop} +sample_means50 <- ames %>% + rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>% + summarise(x_bar = mean(area)) + +qplot(data = sample_means50, x = x_bar) +``` + +Here we use R to take 15,000 different samples of size 50 from the population, calculate +the mean of each sample, and store each result in a vector called +`sample_means50`. Next, we review how this set of code works. + +1. How many elements are there in `sample_means50`? Describe the sampling + distribution, and be sure to specifically note its center. Make sure to include + a plot of the distribution in your answer. + +## Interlude: Sampling distributions + +The idea behind the `rep_sample_n` function is *repetition*. Earlier we took +a single sample of size `n` (50) from the population of all houses in Ames. With +this new function we are able to repeat this sampling procedure `rep` times in order +to build a distribution of a series of sample statistics, which is called the +**sampling distribution**. + +Note that in practice one rarely gets to build true sampling distributions, +because we rarely have access to data from the entire population. + +Without the `rep_sample_n` function, this would be painful. We would have to +manually run the following code 15,000 times +```{r sample-code, eval=FALSE} +ames %>% + sample_n(size = 50) %>% + summarise(x_bar = mean(area)) +``` +as well as store the resulting sample means each time in a separate vector. + +Note that for each of the 15,000 times we computed a mean, we did so from a +**different** sample! + +1. To make sure you understand how sampling distributions are built, and exactly + what the `rep_sample_n` function does, try modifying the code to create a + sampling distribution of **25 sample means** from **samples of size 10**, + and put them in a data frame named `sample_means_small`. Print the output. + How many observations are there in this object called `sample_means_small`? + What does each observation represent? + +## Sample size and the sampling distribution + +Mechanics aside, let's return to the reason we used the `rep_sample_n` function: to +compute a sampling distribution, specifically, the sampling distribution of the +mean home area for samples of 50 houses. + +```{r hist} +qplot(data = sample_means50, x = x_bar, geom = "histogram") +``` + +The sampling distribution that we computed tells us much about estimating +the average living area in homes in Ames. Because the sample mean is an +unbiased estimator, the sampling distribution is centered at the true average +living area of the population, and the spread of the distribution +indicates how much variability is incurred by sampling only 50 home sales. + +In the remainder of this section we will work on getting a sense of the effect that +sample size has on our sampling distribution. + +1. Use the app below to create sampling distributions of means of `area`s from + samples of size 10, 50, and 100. Use 5,000 simulations. What does each + observation in the sampling distribution represent? How does the mean, standard + error, and shape of the sampling distribution change as the sample size + increases? How (if at all) do these values change if you increase the number + of simulations? (You do not need to include plots in your answer.) + +```{r shiny, echo=FALSE, eval=TRUE} +shinyApp( + ui <- fluidPage( + + # Sidebar with a slider input for number of bins + sidebarLayout( + sidebarPanel( + + selectInput("selected_var", + "Variable:", + choices = list("area", "price"), + selected = "area"), + + numericInput("n_samp", + "Sample size:", + min = 1, + max = nrow(ames), + value = 30), + + numericInput("n_sim", + "Number of samples:", + min = 1, + max = 30000, + value = 15000) + + ), + + # Show a plot of the generated distribution + mainPanel( + plotOutput("sampling_plot"), + verbatimTextOutput("sampling_mean"), + verbatimTextOutput("sampling_se") + ) + ) + ), + + # Define server logic required to draw a histogram + server <- function(input, output) { + + # create sampling distribution + sampling_dist <- reactive({ + ames[[input$selected_var]] %>% + sample(size = input$n_samp * input$n_sim, replace = TRUE) %>% + matrix(ncol = input$n_samp) %>% + rowMeans() %>% + data.frame(x_bar = .) + #ames %>% + # rep_sample_n(size = input$n_samp, reps = input$n_sim, replace = TRUE) %>% + # summarise_(x_bar = mean(input$selected_var)) + }) + + # plot sampling distribution + output$sampling_plot <- renderPlot({ + x_min <- quantile(ames[[input$selected_var]], 0.1) + x_max <- quantile(ames[[input$selected_var]], 0.9) + + ggplot(sampling_dist(), aes(x = x_bar)) + + geom_histogram() + + xlim(x_min, x_max) + + ylim(0, input$n_sim * 0.35) + + ggtitle(paste0("Sampling distribution of mean ", + input$selected_var, " (n = ", input$n_samp, ")")) + + xlab(paste("mean", input$selected_var)) + + theme(plot.title = element_text(face = "bold", size = 16)) + }) + + # mean of sampling distribution + output$sampling_mean <- renderText({ + paste0("mean of sampling distribution = ", round(mean(sampling_dist()$x_bar), 2)) + }) + + # mean of sampling distribution + output$sampling_se <- renderText({ + paste0("SE of sampling distribution = ", round(sd(sampling_dist()$x_bar), 2)) + }) + }, + + options = list(height = 500) +) +``` + + +* * * + +## More Practice + +So far, we have only focused on estimating the mean living area in homes in +Ames. Now you'll try to estimate the mean home price. + +Note that while you might be able to answer some of these questions using the app +you are expected to write the required code and produce the necessary plots and +summary statistics. You are welcome to use the app for exploration. + +1. Take a sample of size 15 from the population and calculate the mean `price` + of the homes in this sample. Using this sample, what is your best point estimate + of the population mean of prices of homes? + +1. Since you have access to the population, simulate the sampling + distribution of $\overline{price}$ for samples of size 15 by taking 2000 + samples from the population of size 15 and computing 2000 sample means. + Store these means + in a vector called `sample_means15`. Plot the data, then describe the + shape of this sampling distribution. Based on this sampling distribution, + what would you guess the mean home price of the population to be? Finally, + calculate and report the population mean. + +1. Change your sample size from 15 to 150, then compute the sampling + distribution using the same method as above, and store these means in a + new vector called `sample_means150`. Describe the shape of this sampling + distribution, and compare it to the sampling distribution for a sample + size of 15. Based on this sampling distribution, what would you guess to + be the mean sale price of homes in Ames? + +1. Of the sampling distributions from 2 and 3, which has a smaller spread? If + we're concerned with making estimates that are more often close to the + true value, would we prefer a sampling distribution with a large or small spread? + + +
+This is a product of OpenIntro that is released under a [Creative Commons +Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0). +This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel. +
\ No newline at end of file diff --git a/06_confidence_intervals/confidence_intervals.Rmd b/06_confidence_intervals/confidence_intervals.Rmd new file mode 100644 index 0000000..3c38cf4 --- /dev/null +++ b/06_confidence_intervals/confidence_intervals.Rmd @@ -0,0 +1,239 @@ +--- +title: 'Foundations for statistical inference - Confidence intervals' +output: + html_document: + css: ../lab.css + highlight: pygments + theme: cerulean + toc: true + toc_float: true +--- + +```{r global_options, include=FALSE} +knitr::opts_chunk$set(eval = FALSE) +library(dplyr) +library(ggplot2) +library(oilabs) +``` + +If you have access to data on an entire population, say the size of every +house in Ames, Iowa, it's straightforward to answer questions like, "How big +is the typical house in Ames?" and "How much variation is there in sizes of +houses?". If you have access to only a sample of the population, as is often +the case, the task becomes more complicated. What is your best guess for the +typical size if you only know the sizes of several dozen houses? This sort of +situation requires that you use your sample to make inference on what your +population looks like. + +
+**Setting a seed:** We will take some random samples and build sampling distributions +in this lab, which means you should set a seed on top of your lab. If this concept +is new to you, review the lab concerning probability. +
+ +## Getting Started + +### Load packages + +In this lab we will explore the data using the `dplyr` package and visualize it +using the `ggplot2` package for data visualization. The data can be found in the +companion package for the OpenIntro labs, `oilabs`. + +Let's load the packages. + +```{r load-packages, message=FALSE} +library(dplyr) +library(ggplot2) +library(oilabs) +``` + +### Creating a reproducible lab report + +To create your new lab report, start by opening a new R Markdown document... From Template... then select Lab Report from the `oilabs` package. + +### The data + +We consider real estate data from the city of Ames, Iowa. This is the same +dataset used in the previous lab. The details of +every real estate transaction in Ames is recorded by the City Assessor's +office. Our particular focus for this lab will be all residential home sales +in Ames between 2006 and 2010. This collection represents our population of +interest. In this lab we would like to learn about these home sales by taking +smaller samples from the full population. Let's load the data. + +```{r load-data} +data(ames) +``` + +In this lab we'll start with a simple random sample of size 60 from the +population. + +```{r sample} +n <- 60 +samp <- sample_n(ames, n) +``` + +Note that +the data set has information on many housing variables, but for the first +portion of the lab we'll focus on the size of the house, represented by the +variable `area`. + +1. Describe the distribution of house area in your sample. What would you say is the + "typical" size within your sample? Also state precisely what you interpreted + "typical" to mean. + +1. Would you expect another student's distribution to be identical to yours? + Would you expect it to be similar? Why or why not? + +## Confidence intervals + +Return for a moment to the question that first motivated this lab: based on +this sample, what can we infer about the population? Based only on this single +sample, the best estimate of the average living area of houses sold in Ames +would be the sample mean, usually denoted as $\bar{x}$ (here we're calling it +`x_bar`). That serves as a good **point estimate** but it would be useful +to also communicate how uncertain we are of that estimate. This uncertainty +can be quantified using a **confidence interval**. + +A confidence interval for a population mean is of the following form +\[ \bar{x} + z^\star \frac{s}{\sqrt{n}} \] + +You should by now be comfortable with calculating the mean and standard deviation of +a sample in R. And we know that the sample size is 60. So the only remaining building +block is finding the appropriate critical value for a given confidence level. We can +use the `qnorm` function for this task, which will give the critical value associated +with a given percentile under the normal distribution. Remember that confidence levels +and percentiles are not equivalent. For example, a 95% confidence level refers to the +middle 95% of the distribution, and the critical value associated with this area will +correspond to the 97.5th percentile. + +We can find the critical value for a 95% confidence interal using +```{r z_star_95} +z_star_95 <- qnorm(0.975) +z_star_95 +``` +which is roughly equal to the value critical value 1.96 that you're likely +familiar with by now. + +Let's finally calculate the confidence interval: +```{r ci} +samp %>% + summarise(x_bar = mean(area), + se = sd(area) / sqrt(n), + me = z_star_95 * se, + lower = x_bar - me, + upper = x_bar + me) +``` + +To recap: even though we don't know what the full population looks like, we're 95% +confident that the true average size of houses in Ames lies between the values `lower` +and `upper`. There are a few conditions that must be met for this interval to be valid. + +1. For the confidence interval to be valid, the sample mean must be normally + distributed and have standard error $s / \sqrt{n}$. What conditions must be + met for this to be true? + +## Confidence levels + +1. What does "95% confidence" mean? + +In this case we have the rare luxury of knowing the true population mean since we +have data on the entire population. Let's calculate this value so that +we can determine if our confidence intervals actually capture it. We'll store it in a +data frame called `params` (short for population parameters), and name it `mu`. + +```{r pop-mean} +params <- ames %>% + summarise(mu = mean(area)) +``` + +1. Does your confidence interval capture the true average size of houses in + Ames? If you are working on this lab in a classroom, does your neighbor's + interval capture this value? + +1. Each student should have gotten a slightly different confidence interval. What + proportion of those intervals would you expect to capture the true population + mean? Why? + +Using R, we're going to collect many samples to learn more about how sample +means and confidence intervals vary from one sample to another. + +Here is the rough outline: + +- Obtain a random sample. +- Calculate the sample's mean and standard deviation, and use these to calculate +and store the lower and upper bounds of the confidence intervals. +- Repeat these steps 50 times. + +We can accomplish this using the `rep_sample_n` function. The following lines of +code takes 50 random samples of size `n` from population (and remember we defined +$n = 60$ earlier), and computes the upper and lower bounds of the confidence intervals based on these samples. + +```{r calculate-50-cis} +ci <- ames %>% + rep_sample_n(size = n, reps = 50, replace = TRUE) %>% + summarise(x_bar = mean(area), + se = sd(area) / sqrt(n), + me = z_star_95 * se, + lower = x_bar - me, + upper = x_bar + me) +``` + +Let's view the first five intervals: + +```{r first-five-intervals} +ci %>% + slice(1:5) +``` + +Next we'll create a plot similar to Figure 4.8 on page 175 of [OpenIntro Statistics, 3rd +Edition](https://www.openintro.org/os). The first step will be to create a new variable in +the `ci` data frame that indicates whether the interval does or does not capture the +true population mean. Note that capturing this value would mean the lower bound of the +confidence interval is below the value and upper bound of the confidence interval is +above the value. Remember that we create new variables using the `mutate` function. + +```{r capture-mu} +ci <- ci %>% + mutate(capture_mu = ifelse(lower < params$mu & upper > params$mu, "yes", "no")) +``` + +The `ifelse` function is new. It takes three arguments: first is a logical statement, +second is the value we want if the logical statement yields a true result, and the +third is the value we want if the logical statement yields a false result. + +We now have all the information we need to create the plot. +Note that the `geom_errorbar()` function +only understands `y` values, and thus we have used the `coord_flip()` function +to flip the coordinates of the entire plot back to the more familiar vertical +orientation. + +```{r plot-ci} +qplot(data = ci, x = replicate, y = x_bar, color = capture_mu) + + geom_errorbar(aes(ymin = lower, ymax = upper)) + + geom_hline(data = params, aes(yintercept = mu), color = "darkgray") + # draw vertical line + coord_flip() +``` + +1. What proportion of your confidence intervals include the true population mean? Is + this proportion exactly equal to the confidence level? If not, explain why. Make + sure to include your plot in your answer. + +* * * + +## More Practice + +1. Pick a confidence level of your choosing, provided it is not 95%. What is + the appropriate critical value? + +1. Calculate 50 confidence intervals at the confidence level you chose in the + previous question, and plot all intervals on one plot, and calculate the proportion + of intervals that include the true population mean. How does this percentage compare + to the confidence level selected for the intervals? Make + sure to include your plot in your answer. + +
+This is a product of OpenIntro that is released under a [Creative Commons +Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0). +This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel. +
\ No newline at end of file diff --git a/06_confidence_intervals/confidence_intervals.html b/06_confidence_intervals/confidence_intervals.html new file mode 100644 index 0000000..02f6c44 --- /dev/null +++ b/06_confidence_intervals/confidence_intervals.html @@ -0,0 +1,389 @@ + + + + + + + + + + + + + +Foundations for statistical inference - Confidence intervals + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + +
+
+
+
+
+ +
+ + + + + + + +

If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straightforward to answer questions like, “How big is the typical house in Ames?” and “How much variation is there in sizes of houses?”. If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This sort of situation requires that you use your sample to make inference on what your population looks like.

+
+

Setting a seed: We will take some random samples and build sampling distributions in this lab, which means you should set a seed on top of your lab. If this concept is new to you, review the lab concerning probability.

+
+
+

Getting Started

+
+

Load packages

+

In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data can be found in the companion package for the OpenIntro labs, oilabs.

+

Let’s load the packages.

+
library(dplyr)
+library(ggplot2)
+library(oilabs)
+
+
+

Creating a reproducible lab report

+

To create your new lab report, start by opening a new R Markdown document… From Template… then select Lab Report from the oilabs package.

+
+
+

The data

+

We consider real estate data from the city of Ames, Iowa. This is the same dataset used in the previous lab. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the data.

+
data(ames)
+

In this lab we’ll start with a simple random sample of size 60 from the population.

+
n <- 60
+samp <- sample_n(ames, n)
+

Note that the data set has information on many housing variables, but for the first portion of the lab we’ll focus on the size of the house, represented by the variable area.

+
    +
  1. Describe the distribution of house area in your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.

  2. +
  3. Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

  4. +
+
+
+
+

Confidence intervals

+

Return for a moment to the question that first motivated this lab: based on this sample, what can we infer about the population? Based only on this single sample, the best estimate of the average living area of houses sold in Ames would be the sample mean, usually denoted as \(\bar{x}\) (here we’re calling it x_bar). That serves as a good point estimate but it would be useful to also communicate how uncertain we are of that estimate. This uncertainty can be quantified using a confidence interval.

+

A confidence interval for a population mean is of the following form \[ \bar{x} + z^\star \frac{s}{\sqrt{n}} \]

+

You should by now be comfortable with calculating the mean and standard deviation of a sample in R. And we know that the sample size is 60. So the only remaining building block is finding the appropriate critical value for a given confidence level. We can use the qnorm function for this task, which will give the critical value associated with a given percentile under the normal distribution. Remember that confidence levels and percentiles are not equivalent. For example, a 95% confidence level refers to the middle 95% of the distribution, and the critical value associated with this area will correspond to the 97.5th percentile.

+

We can find the critical value for a 95% confidence interal using

+
z_star_95 <- qnorm(0.975)
+z_star_95
+

which is roughly equal to the value critical value 1.96 that you’re likely familiar with by now.

+

Let’s finally calculate the confidence interval:

+
samp %>%
+  summarise(x_bar = mean(area), 
+            se = sd(area) / sqrt(n),
+            me = z_star_95 * se,
+            lower = x_bar - me,
+            upper = x_bar + me)
+

To recap: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the values lower and upper. There are a few conditions that must be met for this interval to be valid.

+
    +
  1. For the confidence interval to be valid, the sample mean must be normally distributed and have standard error \(s / \sqrt{n}\). What conditions must be met for this to be true?
  2. +
+
+
+

Confidence levels

+
    +
  1. What does “95% confidence” mean?
  2. +
+

In this case we have the rare luxury of knowing the true population mean since we have data on the entire population. Let’s calculate this value so that we can determine if our confidence intervals actually capture it. We’ll store it in a data frame called params (short for population parameters), and name it mu.

+
params <- ames %>%
+  summarise(mu = mean(area))
+
    +
  1. Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

  2. +
  3. Each student should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?

  4. +
+

Using R, we’re going to collect many samples to learn more about how sample means and confidence intervals vary from one sample to another.

+

Here is the rough outline:

+
    +
  • Obtain a random sample.
  • +
  • Calculate the sample’s mean and standard deviation, and use these to calculate and store the lower and upper bounds of the confidence intervals.
  • +
  • Repeat these steps 50 times.
  • +
+

We can accomplish this using the rep_sample_n function. The following lines of code takes 50 random samples of size n from population (and remember we defined \(n = 60\) earlier), and computes the upper and lower bounds of the confidence intervals based on these samples.

+
ci <- ames %>%
+        rep_sample_n(size = n, reps = 50, replace = TRUE) %>%
+        summarise(x_bar = mean(area), 
+                  se = sd(area) / sqrt(n),
+                  me = z_star_95 * se,
+                  lower = x_bar - me,
+                  upper = x_bar + me)
+

Let’s view the first five intervals:

+
ci %>%
+  slice(1:5)
+

Next we’ll create a plot similar to Figure 4.8 on page 175 of OpenIntro Statistics, 3rd Edition. The first step will be to create a new variable in the ci data frame that indicates whether the interval does or does not capture the true population mean. Note that capturing this value would mean the lower bound of the confidence interval is below the value and upper bound of the confidence interval is above the value. Remember that we create new variables using the mutate function.

+
ci <- ci %>%
+  mutate(capture_mu = ifelse(lower < params$mu & upper > params$mu, "yes", "no"))
+

The ifelse function is new. It takes three arguments: first is a logical statement, second is the value we want if the logical statement yields a true result, and the third is the value we want if the logical statement yields a false result.

+

We now have all the information we need to create the plot. Note that the geom_errorbar() function only understands y values, and thus we have used the coord_flip() function to flip the coordinates of the entire plot back to the more familiar vertical orientation.

+
qplot(data = ci, x = replicate, y = x_bar, color = capture_mu) +
+  geom_errorbar(aes(ymin = lower, ymax = upper)) + 
+  geom_hline(data = params, aes(yintercept = mu), color = "darkgray") + # draw vertical line
+  coord_flip()
+
    +
  1. What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why. Make sure to include your plot in your answer.
  2. +
+
+
+
+

More Practice

+
    +
  1. Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?

  2. +
  3. Calculate 50 confidence intervals at the confidence level you chose in the previous question, and plot all intervals on one plot, and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals? Make sure to include your plot in your answer.

  4. +
+
+

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.

+
+
+ + + +
+
+ +
+ + + + + + + + diff --git a/sampling_distributions/more/AmesHousing.csv b/06_confidence_intervals/more/AmesHousing.csv similarity index 100% rename from sampling_distributions/more/AmesHousing.csv rename to 06_confidence_intervals/more/AmesHousing.csv diff --git a/sampling_distributions/more/AmesHousing.xls b/06_confidence_intervals/more/AmesHousing.xls similarity index 100% rename from sampling_distributions/more/AmesHousing.xls rename to 06_confidence_intervals/more/AmesHousing.xls diff --git a/sampling_distributions/more/ames-readme.txt b/06_confidence_intervals/more/ames-readme.txt similarity index 100% rename from sampling_distributions/more/ames-readme.txt rename to 06_confidence_intervals/more/ames-readme.txt diff --git a/confidence_intervals/more/ames.RData b/06_confidence_intervals/more/ames.RData similarity index 100% rename from confidence_intervals/more/ames.RData rename to 06_confidence_intervals/more/ames.RData diff --git a/sampling_distributions/more/ames.csv b/06_confidence_intervals/more/ames.csv similarity index 100% rename from sampling_distributions/more/ames.csv rename to 06_confidence_intervals/more/ames.csv diff --git a/sampling_distributions/more/ames_dataprep.R b/06_confidence_intervals/more/ames_dataprep.R similarity index 100% rename from sampling_distributions/more/ames_dataprep.R rename to 06_confidence_intervals/more/ames_dataprep.R diff --git a/07_inf_for_numerical_data/inf_for_numerical_data.Rmd b/07_inf_for_numerical_data/inf_for_numerical_data.Rmd new file mode 100644 index 0000000..e137245 --- /dev/null +++ b/07_inf_for_numerical_data/inf_for_numerical_data.Rmd @@ -0,0 +1,192 @@ +--- +title: 'Inference for numerical data' +output: + html_document: + css: ../lab.css + highlight: pygments + theme: cerulean + toc: true + toc_float: true +--- + +```{r global_options, include=FALSE} +knitr::opts_chunk$set(eval = FALSE) +library(dplyr) +library(ggplot2) +library(oilabs) +``` + +## Getting Started + +### Load packages + +In this lab we will explore the data using the `dplyr` package and visualize it +using the `ggplot2` package for data visualization. The data can be found in the +companion package for OpenIntro labs, `oilabs`. + +Let's load the packages. + +```{r load-packages, message=FALSE} +library(dplyr) +library(ggplot2) +library(oilabs) +``` + +### Creating a reproducible lab report + +To create your new lab report, start by opening a new R Markdown document... From Template... then select Lab Report from the `oilabs` package. + +### The data + +In 2004, the state of North Carolina released a large data set containing +information on births recorded in this state. This data set is useful to +researchers studying the relation between habits and practices of expectant +mothers and the birth of their children. We will work with a random sample of +observations from this data set. + +Load the `nc` data set into our workspace. + +```{r load-data} +data(nc) +``` + +We have observations on 13 different variables, some categorical and some +numerical. The meaning of each variable can be found by bringing up the help file: + +```{r help-nc} +?nc +``` + + +1. What are the cases in this data set? How many cases are there in our sample? + +Remember that you can answer this question by viewing the data in the data viewer or +by using the following command: + +```{r str} +glimpse(nc) +``` + +## Exploratory data analysis + +We will first start with analyzing the weight gained by mothers throughout the +pregnancy: `gained`. + +Using visualization and summary statistics, describe the distribution of weight +gained by mothers during pregnancy. The `favstats` function from `mosaic` can be useful. + +```{r summary} +library(mosaic) +favstats(~gained, data = nc) +``` + +1. How many mothers are we missing weight gain data from? + +Next, consider the possible relationship between a mother's smoking habit and the +weight of her baby. Plotting the data is a useful first step because it helps +us quickly visualize trends, identify strong associations, and develop research +questions. + +2. Make a side-by-side boxplot of `habit` and `weight`. What does the plot +highlight about the relationship between these two variables? + +The box plots show how the medians of the two distributions compare, but we can +also compare the means of the distributions using the following to +first group the data by the `habit` variable, and then calculate the mean +`weight` in these groups using the `mean` function. + +```{r by-means} +nc %>% + group_by(habit) %>% + summarise(mean_weight = mean(weight)) +``` + +There is an observed difference, but is this difference statistically +significant? In order to answer this question we will conduct a hypothesis test +. + +## Inference + +3. Are all conditions necessary for inference satisfied? Comment on each. You can +compute the group sizes with the `summarize` command above by defining a new variable +with the definition `n()`. + +4. Write the hypotheses for testing if the average weights of babies born to +smoking and non-smoking mothers are different. + +Next, we introduce a new function, `inference`, that we will use for conducting +hypothesis tests and constructing confidence intervals. + +```{r inf-weight-habit-ht, tidy=FALSE} +inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ht", null = 0, + alternative = "twosided", method = "theoretical") +``` + +Let's pause for a moment to go through the arguments of this custom function. +The first argument is `y`, which is the response variable that we are +interested in: `weight`. The second argument is the explanatory variable, +`x`, which is the variable that splits the data into two groups, smokers and +non-smokers: `habit`. The third argument, `data`, is the data frame these +variables are stored in. Next is `statistic`, which is the sample statistic +we're using, or similarly, the population parameter we're estimating. In future labs +we'll also work with "median" and "proportion". Next we decide on the `type` of inference +we want: a hypothesis test (`"ht"`) or a confidence interval (`"ci"`). When performing a +hypothesis test, we also need to supply the `null` value, which in this case is `0`, +since the null hypothesis sets the two population means equal to each other. +The `alternative` hypothesis can be `"less"`, `"greater"`, or `"twosided"`. +Lastly, the `method` of inference can be `"theoretical"` or `"simulation"` based. + +For more information on the inference function see the help file with `?inference`. + +5. Change the `type` argument to `"ci"` to construct and record a confidence +interval for the difference between the weights of babies born to nonsmoking and +smoking mothers, and interpret this interval in context of the data. Note that by +default you'll get a 95% confidence interval. If you want to change the +confidence level, add a new argument (`conf_level`) which takes on a value +between 0 and 1. Also note that when doing a confidence interval arguments like +`null` and `alternative` are not useful, so make sure to remove them. + +By default the function reports an interval for ($\mu_{nonsmoker} - \mu_{smoker}$) +. We can easily change this order by using the `order` argument: + +```{r inf-weight-habit-ci, tidy=FALSE} +inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ci", + method = "theoretical", order = c("smoker","nonsmoker")) +``` + +* * * + +## More Practice + +6. Calculate a 95% confidence interval for the average length of pregnancies +(`weeks`) and interpret it in context. Note that since you're doing inference +on a single population parameter, there is no explanatory variable, so you can +omit the `x` variable from the function. + +7. Calculate a new confidence interval for the same parameter at the 90% +confidence level. You can change the confidence level by adding a new argument +to the function: `conf_level = 0.90`. Comment on the width of this interval versus +the one obtained in the previous exercise. + +8. Conduct a hypothesis test evaluating whether the average weight gained by +younger mothers is different than the average weight gained by mature mothers. + +9. Now, a non-inference task: Determine the age cutoff for younger and mature +mothers. Use a method of your choice, and explain how your method works. + +10. Pick a pair of variables: one numerical (response) and one categorical (explanatory). +Come up with a research question evaluating the relationship between these variables. +Formulate the question in a way that it can be answered using a hypothesis test +and/or a confidence interval. Answer your question using the `inference` +function, report the statistical results, and also provide an explanation in +plain language. Be sure to check all assumptions,state your $\alpha$ level, and conclude +in context. (Note: Picking your own variables, coming up with a research question, +and analyzing the data to answer this question is basically what you'll need to do for +your project as well.) + +
+This is a product of OpenIntro that is released under a [Creative Commons +Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0). +This lab was adapted for OpenIntro by Mine Çetinkaya-Rundel from a lab +written by the faculty and TAs of UCLA Statistics. +
\ No newline at end of file diff --git a/07_inf_for_numerical_data/inf_for_numerical_data.html b/07_inf_for_numerical_data/inf_for_numerical_data.html new file mode 100644 index 0000000..84c77bf --- /dev/null +++ b/07_inf_for_numerical_data/inf_for_numerical_data.html @@ -0,0 +1,361 @@ + + + + + + + + + + + + + +Inference for numerical data + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + +
+
+
+
+
+ +
+ + + + + + + +
+

Getting Started

+
+

Load packages

+

In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data can be found in the companion package for OpenIntro labs, oilabs.

+

Let’s load the packages.

+
library(dplyr)
+library(ggplot2)
+library(oilabs)
+
+
+

Creating a reproducible lab report

+

To create your new lab report, start by opening a new R Markdown document… From Template… then select Lab Report from the oilabs package.

+
+
+

The data

+

In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.

+

Load the nc data set into our workspace.

+
data(nc)
+

We have observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:

+
?nc
+
    +
  1. What are the cases in this data set? How many cases are there in our sample?
  2. +
+

Remember that you can answer this question by viewing the data in the data viewer or by using the following command:

+
glimpse(nc)
+
+
+
+

Exploratory data analysis

+

We will first start with analyzing the weight gained by mothers throughout the pregnancy: gained.

+

Using visualization and summary statistics, describe the distribution of weight gained by mothers during pregnancy. The favstats function from mosaic can be useful.

+
library(mosaic)
+favstats(~gained, data = nc)
+
    +
  1. How many mothers are we missing weight gain data from?
  2. +
+

Next, consider the possible relationship between a mother’s smoking habit and the weight of her baby. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.

+
    +
  1. Make a side-by-side boxplot of habit and weight. What does the plot highlight about the relationship between these two variables?
  2. +
+

The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following to first group the data by the habit variable, and then calculate the mean weight in these groups using the mean function.

+
nc %>%
+  group_by(habit) %>%
+  summarise(mean_weight = mean(weight))
+

There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test .

+
+
+

Inference

+
    +
  1. Are all conditions necessary for inference satisfied? Comment on each. You can compute the group sizes with the summarize command above by defining a new variable with the definition n().

  2. +
  3. Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.

  4. +
+

Next, we introduce a new function, inference, that we will use for conducting hypothesis tests and constructing confidence intervals.

+
inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ht", null = 0, 
+          alternative = "twosided", method = "theoretical")
+

Let’s pause for a moment to go through the arguments of this custom function. The first argument is y, which is the response variable that we are interested in: weight. The second argument is the explanatory variable, x, which is the variable that splits the data into two groups, smokers and non-smokers: habit. The third argument, data, is the data frame these variables are stored in. Next is statistic, which is the sample statistic we’re using, or similarly, the population parameter we’re estimating. In future labs we’ll also work with “median” and “proportion”. Next we decide on the type of inference we want: a hypothesis test ("ht") or a confidence interval ("ci"). When performing a hypothesis test, we also need to supply the null value, which in this case is 0, since the null hypothesis sets the two population means equal to each other. The alternative hypothesis can be "less", "greater", or "twosided". Lastly, the method of inference can be "theoretical" or "simulation" based.

+

For more information on the inference function see the help file with ?inference.

+
    +
  1. Change the type argument to "ci" to construct and record a confidence interval for the difference between the weights of babies born to nonsmoking and smoking mothers, and interpret this interval in context of the data. Note that by default you’ll get a 95% confidence interval. If you want to change the confidence level, add a new argument (conf_level) which takes on a value between 0 and 1. Also note that when doing a confidence interval arguments like null and alternative are not useful, so make sure to remove them.
  2. +
+

By default the function reports an interval for (\(\mu_{nonsmoker} - \mu_{smoker}\)) . We can easily change this order by using the order argument:

+
inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ci", 
+          method = "theoretical", order = c("smoker","nonsmoker"))
+
+
+
+

More Practice

+
    +
  1. Calculate a 95% confidence interval for the average length of pregnancies (weeks) and interpret it in context. Note that since you’re doing inference on a single population parameter, there is no explanatory variable, so you can omit the x variable from the function.

  2. +
  3. Calculate a new confidence interval for the same parameter at the 90% confidence level. You can change the confidence level by adding a new argument to the function: conf_level = 0.90. Comment on the width of this interval versus the one obtained in the previous exercise.

  4. +
  5. Conduct a hypothesis test evaluating whether the average weight gained by younger mothers is different than the average weight gained by mature mothers.

  6. +
  7. Now, a non-inference task: Determine the age cutoff for younger and mature mothers. Use a method of your choice, and explain how your method works.

  8. +
  9. Pick a pair of variables: one numerical (response) and one categorical (explanatory). Come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Answer your question using the inference function, report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions,state your \(\alpha\) level, and conclude in context. (Note: Picking your own variables, coming up with a research question, and analyzing the data to answer this question is basically what you’ll need to do for your project as well.)

  10. +
+
+

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics.

+
+
+ + + +
+
+ +
+ + + + + + + + diff --git a/inf_for_numerical_data/more/1000births.xlsx b/07_inf_for_numerical_data/more/1000births.xlsx similarity index 100% rename from inf_for_numerical_data/more/1000births.xlsx rename to 07_inf_for_numerical_data/more/1000births.xlsx diff --git a/inf_for_numerical_data/more/2000births.txt b/07_inf_for_numerical_data/more/2000births.txt similarity index 100% rename from inf_for_numerical_data/more/2000births.txt rename to 07_inf_for_numerical_data/more/2000births.txt diff --git a/inf_for_numerical_data/more/data-processing-code.R b/07_inf_for_numerical_data/more/data-processing-code.R similarity index 100% rename from inf_for_numerical_data/more/data-processing-code.R rename to 07_inf_for_numerical_data/more/data-processing-code.R diff --git a/inf_for_numerical_data/more/nc.RData b/07_inf_for_numerical_data/more/nc.RData similarity index 100% rename from inf_for_numerical_data/more/nc.RData rename to 07_inf_for_numerical_data/more/nc.RData diff --git a/inf_for_numerical_data/more/nc.csv b/07_inf_for_numerical_data/more/nc.csv similarity index 100% rename from inf_for_numerical_data/more/nc.csv rename to 07_inf_for_numerical_data/more/nc.csv diff --git a/inf_for_numerical_data/more/ncbirths.csv b/07_inf_for_numerical_data/more/ncbirths.csv similarity index 100% rename from inf_for_numerical_data/more/ncbirths.csv rename to 07_inf_for_numerical_data/more/ncbirths.csv diff --git a/08_inf_for_categorical_data/inf_for_categorical_data.Rmd b/08_inf_for_categorical_data/inf_for_categorical_data.Rmd new file mode 100644 index 0000000..e672b6f --- /dev/null +++ b/08_inf_for_categorical_data/inf_for_categorical_data.Rmd @@ -0,0 +1,288 @@ +--- +title: "Inference for categorical data" +runtime: shiny +output: + html_document: + css: www/lab.css + highlight: pygments + theme: cerulean + toc: true + toc_float: true +--- + +```{r global_options, include=FALSE} +knitr::opts_chunk$set(eval = FALSE) +``` + +In August of 2012, news outlets ranging from the [Washington Post](http://www.washingtonpost.com/national/on-faith/poll-shows-atheism-on-the-rise-in-the-us/2012/08/13/90020fd6-e57d-11e1-9739-eef99c5fb285_story.html) to the [Huffington Post](http://www.huffingtonpost.com/2012/08/14/atheism-rise-religiosity-decline-in-america_n_1777031.html) ran a story about the rise of atheism in America. The source for the story was a poll that asked people, "Irrespective of whether you attend a place of worship or not, would you say you are a religious person, not a religious person, or a convinced atheist?" This type of question, which asks people to classify themselves in one way or another, is common in polling and generates categorical data. In this lab we take a look at the atheism survey and explore what's at play when making inference about population proportions using categorical data. + +## Getting Started + +### Load packages + +In this lab we will explore the data using the `dplyr` package and visualize it +using the `ggplot2` package for data visualization. The data can be found in the +companion package for OpenIntro labs, `oilabs`. + +Let's load the packages. + +```{r load-packages, message=FALSE, eval=TRUE} +library(dplyr) +library(ggplot2) +library(oilabs) +``` + +### Creating a reproducible lab report + +To create your new lab report, start by opening a new R Markdown document... From Template... then select Lab Report from the `oilabs` package. + +### The survey + +The press release for the poll, conducted by WIN-Gallup International, can be accessed [here](http://www.wingia.com/web/files/richeditor/filemanager/Global_INDEX_of_Religiosity_and_Atheism_PR__6.pdf). + +Take a moment to review the report then address the following questions. + +1. In the first paragraph, several key findings are reported. Do these + percentages appear to be *sample statistics* (derived from the data + sample) or *population parameters*? Explain your reasoning. + +1. The title of the report is "Global Index of Religiosity and Atheism". To + generalize the report's findings to the global human population, what must + we assume about the sampling method? Does that seem like a reasonable + assumption? + +### The data + +Turn your attention to Table 6 (pages 15 and 16), which reports the +sample size and response percentages for all 57 countries. While this is +a useful format to summarize the data, we will base our analysis on the +original data set of individual responses to the survey. Load this data +set into R with the following command. + +```{r head-data} +data(atheism) +``` + +1. What does each row of Table 6 correspond to? What does each row of + `atheism` correspond to? + +To investigate the link between these two ways of organizing this data, take a +look at the estimated proportion of atheists in the United States. Towards +the bottom of Table 6, we see that this is 5%. We should be able to come to +the same number using the `atheism` data. + +1. Using the command below, create a new dataframe called `us12` that contains + only the rows in `atheism` associated with respondents to the 2012 survey + from the United States. Next, calculate the proportion of atheist + responses. Does it agree with the percentage in Table 6? If not, why? + +```{r us-atheism} +us12 <- atheism %>% + filter(nationality == "United States", year == "2012") +``` + +## Inference on proportions + +As was hinted earlier, Table 6 provides *statistics*, that is, +calculations made from the sample of 51,927 people. What we'd like, though, is +insight into the population *parameters*. You answer the question, "What +proportion of people in your sample reported being atheists?" with a +statistic; while the question "What proportion of people on earth would report +being atheists" is answered with an estimate of the parameter. + +The inferential tools for estimating population proportion are analogous to +those used for means in the last chapter: the confidence interval and the +hypothesis test. + +1. Write out the conditions for inference to construct a 95% confidence + interval for the proportion of atheists in the United States in 2012. + Are you confident all conditions are met? + +If the conditions for inference are reasonable, we can calculate +the standard error and construct the interval in R. + +```{r us-atheism-ci, tidy = FALSE} +us12 %>% + summarize(N = n(), atheist = sum(response == "atheist")) %>% + mutate(p_hat = atheist / N, + se = sqrt(p_hat * (1 - p_hat) / N), + me = qnorm(0.975) * se, + lower = p_hat - me, + upper = p_hat + me) +``` + +Note that since the goal is to construct an interval estimate for a +proportion, it's necessary to specify what constitutes a "success", which here +is a response of `"atheist"`. + +Although formal confidence intervals and hypothesis tests don't show up in the +report, suggestions of inference appear at the bottom of page 7: "In general, +the error margin for surveys of this kind is $\pm$ 3--5% at 95% confidence". + +1. Based on the R output, what is the margin of error for the estimate of the + proportion of the proportion of atheists in US in 2012? + +1. Calculate confidence intervals for the + proportion of atheists in 2012 in two other countries of your choice, and + report the associated margins of error. Be sure to note whether the + conditions for inference are met, and interpet the interval in context of the data. + It may be helpful to create new data sets for each of the two countries first, and + then use these data sets to construct the confidence + intervals. + +## How does the proportion affect the margin of error? + +Imagine you've set out to survey 1000 people on two questions: are you female? +and are you left-handed? Since both of these sample proportions were +calculated from the same sample size, they should have the same margin of +error, right? Wrong! While the margin of error does change with sample size, +it is also affected by the proportion. + +Think back to the formula for the standard error: $SE = \sqrt{p(1-p)/n}$. This +is then used in the formula for the margin of error for a 95% confidence +interval: +$$ +ME = 1.96\times SE = 1.96\times\sqrt{p(1-p)/n} \,. +$$ +Since the +population proportion $p$ is in this $ME$ formula, it should make sense that +the margin of error is in some way dependent on the population proportion. We +can visualize this relationship by creating a plot of $ME$ vs. $p$. + +Since sample size is irrelevant to this discussion, let's just set it to +some value ($n = 1000$) and use this value in the following calculations: + +```{r n-for-me-plot} +n <- 1000 +``` + +The first step is to make a variable `p` that is a sequence from 0 to 1 with +each number incremented by 0.01. We can then create a variable of the margin of +error (`me`) associated with each of these values of `p` using the familiar +approximate formula ($ME = 2 \times SE$). + +```{r p-me} +p <- seq(from = 0, to = 1, by = 0.01) +me <- 2 * sqrt(p * (1 - p)/n) +``` + +Lastly, we plot the two variables against each other to reveal their relationship. +To do so, we need to first put these variables in a data frame that we can +call in the `qplot` function. + +```{r me-plot} +dd <- data.frame(p = p, me = me) +qplot(x = p, y = me, data = dd, + ylab = "Margin of Error", + xlab = "Population Proportion") + + geom_line() +``` + +1. Describe the relationship between `p` and `me`. Include the margin of + error vs. population proportion plot you constructed in your answer. For + a given sample size, for which value of `p` is margin of error maximized? + +## Success-failure condition + +We have emphasized that you must always check conditions before making +inference. For inference on proportions, the sample proportion can be assumed +to be nearly normal if it is based upon a random sample of independent +observations and if both $np \geq 10$ and $n(1 - p) \geq 10$. This rule of +thumb is easy enough to follow, but it makes one wonder: what's so special +about the number 10? + +The short answer is: nothing. You could argue that we would be fine with 9 or +that we really should be using 11. What is the "best" value for such a rule of +thumb is, at least to some degree, arbitrary. However, when $np$ and $n(1-p)$ +reaches 10 the sampling distribution is sufficiently normal to use confidence +intervals and hypothesis tests that are based on that approximation. + +We can investigate the interplay between $n$ and $p$ and the shape of the +sampling distribution by using simulations. Play around with the following +app to investigate how the shape, center, and spread of the distribution of +$\hat{p}$ changes as $n$ and $p$ changes. + +```{r sf-app, echo=FALSE, eval=TRUE} +inputPanel( + numericInput("n", label = "Sample size:", value = 300), + + sliderInput("p", label = "Population proportion:", + min = 0, max = 1, value = 0.1, step = 0.01), + + numericInput("x_min", label = "Min for x-axis:", value = 0, min = 0, max = 1), + numericInput("x_max", label = "Max for x-axis:", value = 1, min = 0, max = 1) +) + +renderPlot({ + pp <- data.frame(p_hat = rep(0, 5000)) + for(i in 1:5000){ + samp <- sample(c(TRUE, FALSE), input$n, replace = TRUE, + prob = c(input$p, 1 - input$p)) + pp$p_hat[i] <- sum(samp == TRUE) / input$n + } + bw <- diff(range(pp$p_hat)) / 30 + ggplot(data = pp, aes(x = p_hat)) + + geom_histogram(binwidth = bw) + + xlim(input$x_min, input$x_max) + + ggtitle(paste0("Distribution of p_hats, drawn from p = ", input$p, ", n = ", input$n)) +}) +``` + +1. Describe the sampling distribution of sample proportions at $n = 300$ and + $p = 0.1$. Be sure to note the center, spread, and shape. + +1. Keep $n$ constant and change $p$. How does the shape, center, and spread + of the sampling distribution vary as $p$ changes. You might want to adjust + min and max for the $x$-axis for a better view of the distribution. + +1. Now also change $n$. How does $n$ appear to affect the distribution of $\hat{p}$? + +1. If you refer to Table 6, you'll find that Australia has a sample + proportion of 0.1 in a sample size of 1040, and that Ecuador has a sample + proportion of 0.02 on 400 subjects. Let's suppose for this exercise that + these point estimates are actually the truth. Construct their sampling + distributions by using these values as inputs in the app. Do you think it + is sensible to proceed with inference and report margin of errors, as the + report does? + +* * * + +## More Practice + +The question of atheism was asked by WIN-Gallup International in a similar +survey that was conducted in 2005. (We assume here that sample sizes have +remained the same.) Table 4 on page 13 of the report summarizes survey results +from 2005 and 2012 for 39 countries. + + +1. Is there convincing evidence that Spain has seen a change in its atheism index + between 2005 and 2012? As always, write out the hypotheses for any tests you + conduct and outline the status of the conditions for inference. If you find a + significant difference, also quantify this difference with a confidence interval. \ + *Hint:* Use the difference of two proportions methodology (i.e. find the + observed difference, compute the standard error, compute the z-score, etc.) + +1. Is there convincing evidence that the US has seen a change in its atheism index + between 2005 and 2012? As always, write out the hypotheses for any tests you + conduct and outline the status of the conditions for inference. If you find a + significant difference, also quantify this difference with a confidence interval. + +1. If in fact there has been no change in the atheism index in the countries + listed in Table 4, in how many of those countries would you expect to + detect a change (at a significance level of 0.05) simply by chance?\ + *Hint:* Review the definition of the Type 1 error. + +1. Suppose you're hired by the local government to estimate the proportion of + residents that attend a religious service on a weekly basis. According to + the guidelines, the estimate must have a margin of error no greater than + 1% with 95% confidence. You have no idea what to expect for $p$. How many + people would you have to sample to ensure that you are within the + guidelines?\ + *Hint:* Refer to your plot of the relationship between $p$ and margin of + error. This question does not require using the dataset. + +
+This is a product of OpenIntro that is released under a [Creative Commons +Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0). +This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel. +
diff --git a/inf_for_categorical_data/inf_for_categorical_data.html b/08_inf_for_categorical_data/inf_for_categorical_data.html similarity index 98% rename from inf_for_categorical_data/inf_for_categorical_data.html rename to 08_inf_for_categorical_data/inf_for_categorical_data.html index 29fe1f8..b451e90 100644 --- a/inf_for_categorical_data/inf_for_categorical_data.html +++ b/08_inf_for_categorical_data/inf_for_categorical_data.html @@ -47,7 +47,7 @@ - + @@ -76,30 +76,31 @@

Inference for categorical data

-

In August of 2012, news outlets ranging from the Washington Post to the Huffington Post ran a story about the rise of atheism in America. The source for the story was a poll that asked people, “Irrespective of whether you attend a place of worship or not, would you say you are a religious person, not a religious person or a convinced atheist?” This type of question, which asks people to classify themselves in one way or another, is common in polling and generates categorical data. In this lab we take a look at the atheism survey and explore what’s at play when making inference about population proportions using categorical data.

+

In August of 2012, news outlets ranging from the Washington Post to the Huffington Post ran a story about the rise of atheism in America. The source for the story was a poll that asked people “Irrespective of whether you attend a place of worship or not, would you say you are a religious person, not a religious person or a convinced atheist?” This type of question, which asks people to classify themselves in one way or another, is common in polling and generates categorical data. In this lab we take a look at the atheism survey and explore what’s at play when making inference about population proportions using categorical data.

The survey

-

To access the press release for the poll, conducted by WIN-Gallup International, click on the following link:

-

http://www.wingia.com/web/files/richeditor/filemanager/Global_INDEX_of_Religiosity_and_Atheism_PR__6.pdf

-

Take a moment to review the report then address the following questions.

+

Take a moment to review the press release for the poll conducted by WIN-Gallup International then address the following questions.

  1. In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?

  2. -
  3. The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?

  4. +
  5. The title of the report is “Global Index of Religiosity and Atheism.” To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?

The data

-

Turn your attention to Table 6 (pages 15 and 16), which reports the sample size and response percentages for all 57 countries. While this is a useful format to summarize the data, we will base our analysis on the original data set of individual responses to the survey. Load this data set into R with the following command.

-
download.file("http://www.openintro.org/stat/data/atheism.RData", destfile = "atheism.RData")
-load("atheism.RData")
+

Turn your attention to Table 6 (pages 15 and 16), which reports the sample size and response percentages for all 57 countries. While this is a useful format to summarize the data, we will base our analysis on the original data set of individual responses to the survey. Load the necessary packages and the data set into R with the following command.

+
library(oilabs)
+library(mosaic)
+data(atheism)
  1. What does each row of Table 6 correspond to? What does each row of atheism correspond to?

To investigate the link between these two ways of organizing this data, take a look at the estimated proportion of atheists in the United States. Towards the bottom of Table 6, we see that this is 5%. We should be able to come to the same number using the atheism data.

    -
  1. Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?
  2. +
  3. Using the command below, create a new data frame called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?
-
us12 <- subset(atheism, nationality == "United States" & year == "2012")
+
us12 <- 
+  atheism %>% 
+  filter(nationality == "United States" & year == "2012")

Inference on proportions

@@ -108,27 +109,29 @@

Inference on proportions

  1. Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?
-

If the conditions for inference are reasonable, we can either calculate the standard error and construct the interval by hand, or allow the inference function to do it for us.

-
inference(us12$response, est = "proportion", type = "ci", method = "theoretical", 
-          success = "atheist")
-

Note that since the goal is to construct an interval estimate for a proportion, it’s necessary to specify what constitutes a “success”, which here is a response of "atheist".

-

Although formal confidence intervals and hypothesis tests don’t show up in the report, suggestions of inference appear at the bottom of page 7: “In general, the error margin for surveys of this kind is \(\pm\) 3-5% at 95% confidence”.

+

If the conditions for inference are reasonable (check this!), we can calculate the standard error and construct the confidence interval.

+
p <- tally(~response, data=us12, format='proportion')["atheist"]
+n <- nrow(us12)
+se <- sqrt(p * (1-p) / n)
+p + qnorm(c(0.025, 0.975)) * se
+

Note that since the goal is to construct an interval estimate for a proportion, it’s necessary to specify what constitutes a “success”, which here is a response of "atheist". Secondly, the qnorm function helps us find the width (in terms of the number of standard deviations from the mean) that our confidence interval needs to be in order to achieve a 95% confidence level. Note that since the normal distribution is symmetric, by cutting off the smallest 2.5% and the largest 2.5%, we’re left with the middle 95%. By changing the argument to qnorm we can find intervals that correspond to different confidence levels.

+

Although formal confidence intervals and hypothesis tests don’t show up in the report, suggestions of inference appear at the bottom of page 7: “In general, the error margin for surveys of this kind is \(\pm\) 3-5% at 95% confidence.”

  1. Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?

  2. -
  3. Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference function to construct the confidence intervals.

  4. +
  5. Calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets to construct the confidence intervals.

How does the proportion affect the margin of error?

Imagine you’ve set out to survey 1000 people on two questions: are you female? and are you left-handed? Since both of these sample proportions were calculated from the same sample size, they should have the same margin of error, right? Wrong! While the margin of error does change with sample size, it is also affected by the proportion.

-

Think back to the formula for the standard error: \(SE = \sqrt{p(1-p)/n}\). This is then used in the formula for the margin of error for a 95% confidence interval: \(ME = 1.96\times SE = 1.96\times\sqrt{p(1-p)/n}\). Since the population proportion \(p\) is in this \(ME\) formula, it should make sense that the margin of error is in some way dependent on the population proportion. We can visualize this relationship by creating a plot of \(ME\) vs. \(p\).

-

The first step is to make a vector p that is a sequence from 0 to 1 with each number separated by 0.01. We can then create a vector of the margin of error (me) associated with each of these values of p using the familiar approximate formula (\(ME = 2 \times SE\)). Lastly, we plot the two vectors against each other to reveal their relationship.

+

Think back to the formula for the standard error: \(SE = \sqrt{p(1-p)/n}\). This is then used in the formula for the margin of error for a 95% confidence interval: \(ME = 1.96\times SE = 1.96\times\sqrt{p(1-p)/n}\). Since the population proportion \(p\) is in this \(ME\) formula, it should make sense that the margin of error is in some way dependent on the population proportion. We can visualize this relationship by creating a plot of \(ME\) vs. \(p\) for all values of \(p\) between 0 and 1.

+

The first step is to make a vector p that is a sequence from 0 to 1 with each number separated by 0.01. We can then create a vector of the margin of error (ME) associated with each of these values of p using the familiar approximate formula (\(ME = 1.96 \times SE\)). Lastly, we plot the two vectors against each other to reveal their relationship.

n <- 1000
-p <- seq(0, 1, 0.01)
-me <- 2 * sqrt(p * (1 - p)/n)
-plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion")
+p <- seq(from = 0, to = 1, by = 0.01) +ME <- 2 * sqrt(p * (1 - p)/n) +xyplot(ME ~ p, ylab = "Margin of Error", xlab = "Population Proportion")
    -
  1. Describe the relationship between p and me.
  2. +
  3. Describe the relationship between p and ME.
@@ -136,24 +139,23 @@

Success-failure condition

The textbook emphasizes that you must always check conditions before making inference. For inference on proportions, the sample proportion can be assumed to be nearly normal if it is based upon a random sample of independent observations and if both \(np \geq 10\) and \(n(1 - p) \geq 10\). This rule of thumb is easy enough to follow, but it makes one wonder: what’s so special about the number 10?

The short answer is: nothing. You could argue that we would be fine with 9 or that we really should be using 11. What is the “best” value for such a rule of thumb is, at least to some degree, arbitrary. However, when \(np\) and \(n(1-p)\) reaches 10 the sampling distribution is sufficiently normal to use confidence intervals and hypothesis tests that are based on that approximation.

We can investigate the interplay between \(n\) and \(p\) and the shape of the sampling distribution by using simulations. To start off, we simulate the process of drawing 5000 samples of size 1040 from a population with a true atheist proportion of 0.1. For each of the 5000 samples we compute \(\hat{p}\) and then plot a histogram to visualize their distribution.

+

Here, we first resample \(n\) times from the list of responses, with the probability of drawing "atheist" equal to \(p = 0.1\). Then we tally the proportion of those responses that are "atheist". We do this 500 times.

p <- 0.1
 n <- 1040
-p_hats <- rep(0, 5000)
+responses <- c("atheist", "non_atheist")
 
-for(i in 1:5000){
-  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
-  p_hats[i] <- sum(samp == "atheist")/n
-}
+p_hats <- 
+  do(5000) * 
+  responses %>%
+  resample(size = n, prob = c(p, 1-p)) %>%
+  tally(format = "proportion")
 
-hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))
-

These commands build up the sampling distribution of \(\hat{p}\) using the familiar for loop. You can read the sampling procedure for the first line of code inside the for loop as, “take a sample of size \(n\) with replacement from the choices of atheist and non-atheist with probabilities \(p\) and \(1 - p\), respectively.” The second line in the loop says, “calculate the proportion of atheists in this sample and record this value.” The loop allows us to repeat this process 5,000 times to build a good representation of the sampling distribution.

+histogram(~atheist, data=p_hats, main = "p = 0.1, n = 1040") +

These commands build up the sampling distribution of \(\hat{p}\) using the familiar do loop. You can read the sampling procedure for the inner bit of code as, “take a sample of size \(n\) with replacement from the choices of atheist and non-atheist with probabilities \(p\) and \(1 - p\), respectively.” The tally command says, “calculate the proportion of atheists in this sample and record this value.” The loop allows us to repeat this process 5,000 times to build a good representation of the sampling distribution.

  1. Describe the sampling distribution of sample proportions at \(n = 1040\) and \(p = 0.1\). Be sure to note the center, spread, and shape.
    Hint: Remember that R has functions such as mean to calculate summary statistics.

  2. -
  3. Repeat the above simulation three more times but with modified sample sizes and proportions: for \(n = 400\) and \(p = 0.1\), \(n = 1040\) and \(p = 0.02\), and \(n = 400\) and \(p = 0.02\). Plot all four histograms together by running the par(mfrow = c(2, 2)) command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. Describe the three new sampling distributions. Based on these limited plots, how does \(n\) appear to affect the distribution of \(\hat{p}\)? How does \(p\) affect the sampling distribution?

  4. -
-

Once you’re done, you can reset the layout of the plotting window by using the command par(mfrow = c(1, 1)) command or clicking on “Clear All” above the plotting window (if using RStudio). Note that the latter will get rid of all your previous plots.

-
    -
  1. If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?
  2. +
  3. Repeat the above simulation three more times but with modified sample sizes and proportions: for \(n = 400\) and \(p = 0.1\), \(n = 1040\) and \(p = 0.02\), and \(n = 400\) and \(p = 0.02\). Plot all four histograms. Describe the three new sampling distributions. Based on these limited plots, how does \(n\) appear to affect the distribution of \(\hat{p}\)? How does \(p\) affect the sampling distribution?

  4. +
  5. If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?


@@ -161,7 +163,7 @@

Success-failure condition

On your own

The question of atheism was asked by WIN-Gallup International in a similar survey that was conducted in 2005. (We assume here that sample sizes have remained the same.) Table 4 on page 13 of the report summarizes survey results from 2005 and 2012 for 39 countries.