You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 10-inference.Rmd
+23-23Lines changed: 23 additions & 23 deletions
Original file line number
Diff line number
Diff line change
@@ -61,9 +61,9 @@ every single undergraduate in North America whether or not they own an iPhone. I
61
61
directly computing population parameters is often time-consuming and costly, and sometimes impossible.
62
62
63
63
A more practical approach would be to collect measurements for a **sample**: a subset of
64
-
individuals collected from the population. We can then compute a **sample statistic**—a numerical
64
+
individuals collected from the population. We can then compute a **sample estimate**—a numerical
65
65
characteristic of the sample—that estimates the population parameter. For example, suppose we randomly selected 100 undergraduate students across North America (the sample) and computed the proportion of those
66
-
students who own an iPhone (the sample statistic). In that case, we might suspect that that proportion is a reasonable estimate of the proportion of students who own an iPhone in the entire population.
66
+
students who own an iPhone (the sample estimate). In that case, we might suspect that that proportion is a reasonable estimate of the proportion of students who own an iPhone in the entire population.
The price per night of all Airbnb rentals in Vancouver, BC is \$`r round(population_parameters$pop_mean,2)`, on average.
250
+
The price per night of all Airbnb rentals in Vancouver, BC is \$`r round(population_parameters$pop_mean,2)`, on average. This value is our population parameter since we are calculating it using the population data.
251
251
252
-
Suppose that we did not have access to the population data, yet we still wanted to estimate the mean price per night? We could answer this question by taking a random sample of as many Airbnb listings as we had time to, let's say we could do this for 40 listings. What would such a sample look like?
252
+
Suppose that we did not have access to the population data, yet we still wanted to estimate the mean price per night. We could answer this question by taking a random sample of as many Airbnb listings as we had time to, let's say we could do this for 40 listings. What would such a sample look like?
253
253
254
-
Let's take advantage of the fact that we do have access to the population data and simulate taking one random sample of 40 listings in R, again using `rep_sample_n`. After doing this we
255
-
create a histogram to visualize the
254
+
Let's take advantage of the fact that we do have access to the population data and simulate taking one random sample of 40 listings in R, again using `rep_sample_n`. After doing this we create a histogram to visualize the
256
255
distribution of observations in the sample,
257
256
and calculate the mean of our sample. This number is a **point estimate** for the mean of the full population.
258
257
@@ -276,7 +275,7 @@ Note that in practice, we usually cannot compute the accuracy of the estimate, s
276
275
parameter; if we did, we wouldn't need to estimate it!
277
276
278
277
Also recall from the previous section that the point estimate can vary; if
279
-
we took another random sample from the population, then the value of our statistic may change.
278
+
we took another random sample from the population, then the value of our estimate may change.
280
279
So then did we just get lucky with our point estimate above?
281
280
How much does our estimate vary across different samples of size 40 in this example? Again, since we have access to the population,
282
281
we can take many samples and plot the **sampling distribution** of sample means for samples of size 40 to get a sense
@@ -433,14 +432,15 @@ in a more reliable point estimate of the population parameter.
433
432
--->
434
433
435
434
### Summary
436
-
1. A *statistic* is a value computed using a sample from a population; a *point estimate* is a statistic that is a single value (e.g. a mean or proportion)
437
-
2. The *sampling distribution* of a statistic is the distribution of the statistic for all possible samples of a fixed size from the same population.
435
+
1. A *point estimate* is a single value computed using a sample from a population (e.g. a mean or proportion)
436
+
2. The *sampling distribution* of an estimate is the distribution of the estimate for all possible samples of a fixed size from the same population.
438
437
3. The sample means and proportions calculated from samples are centered around the population mean and proportion, respectively.
439
438
4. The spread of the sampling distribution is related to the sample size. As the sample size increases, the spread of the sampling distribution decreases.
440
439
5. The shape of the sampling distribution is usually bell-shaped with one peak and centred at the population mean or proportion.
441
440
442
441
*Why all this emphasis on sampling distributions?*
443
-
Usually, we don't have access to the population data, so we cannot construct the sampling distribution as we did in this section. As we saw, our sample estimate's value will likely not equal the population parameter value exactly. We saw from the sampling distribution just how much our estimates can vary. So reporting a single *point estimate* for the population parameter alone may not be enough. Using simulations, we can see patterns of the sample statistic's sampling distribution would look like for a sample of a given size. We can use these patterns to approximate the sampling distribution when we only have one sample, which is the realistic case. If we can ``predict" what the sampling distribution would look like for a sample, we could construct a range of values we think the population parameter's value might lie. We can use our single sample and its properties that influence sampling distributions, such as the spread and sample size, to approximate the sampling distribution as best as we can. There are several methods to do this; however, in this book, we will use the bootstrap method to do this, as we will see in the next section.
442
+
443
+
Usually, we don't have access to the population data, so we cannot construct the sampling distribution as we did in this section. As we saw, our sample estimate's value will likely not equal the population parameter value exactly. We saw from the sampling distribution just how much our estimates can vary. So reporting a single *point estimate* for the population parameter alone may not be enough. Using simulations, we can see patterns of the sample estimate's sampling distribution would look like for a sample of a given size. We can use these patterns to approximate the sampling distribution when we only have one sample, which is the realistic case. If we can "predict" what the sampling distribution would look like for a sample, we could construct a range of values we think the population parameter's value might lie. We can use our single sample and its properties that influence sampling distributions, such as the spread and sample size, to approximate the sampling distribution as best as we can. There are several methods to do this; however, in this book, we will use the bootstrap method to do this, as we will see in the next section.
Let's continue working with our Airbnb data. Once again, let's say we are interested
530
-
in estimating the population mean price per night of all Airbnb listings in
531
-
Vancouver, Canada from a single sample we collected of size 40.
529
+
Let's continue working with our Airbnb data. Once again, let's say we are interested in estimating the population mean price per night of all Airbnb listings in
530
+
Vancouver, Canada using a single sample we collected of size 40.
532
531
533
532
To simulate doing this in R, we will use `rep_sample_n` to take a random sample from from our population. In real life we wouldn't do this step in R, we would instead simply load the data into R, that we, or our collaborators collected.
534
533
535
-
After we have our sample, we will visualize it's distribution and calculate our point estimate, the sample mean.
534
+
After we have our sample, we will visualize it's distribution and calculate our point estimate, the sample mean.
536
535
537
536
```{r 11-bootstrapping1, echo = TRUE, message = FALSE, warning = FALSE, fig.cap = "Histogram of price per night ($) for one sample of size 40", out.width = "60%"}
Let's compare our bootstrap distribution with the true sampling distribution (taking many samples from the population).
631
630
632
-
```{r 11-bootstrapping6, echo = F, message = FALSE, warning = FALSE, fig.cap = "Comparison of distribution of the bootstrap sample means and sampling distribution", out.height="60%"}
631
+
```{r 11-bootstrapping6, echo = F, message = FALSE, warning = FALSE, fig.cap = "Comparison of distribution of the bootstrap sample means and sampling distribution", out.height="50%"}
To finish our estimation of the population parameter, we would report the point estimate and our confidence interval's lower and upper bounds. Here the sample mean price-per-night of 40 Airbnb listings was \$`r round(mean(sample_1$price),2)`, and we are 95\% "confident" that the true population mean price-per-night for all Airbnb listings in Vancouver is between \$`r round(bounds[1],2)`, \$`r round(bounds[2],2)`).
811
+
To finish our estimation of the population parameter, we would report the point estimate and our confidence interval's lower and upper bounds. Here the sample mean price-per-night of 40 Airbnb listings was \$`r round(mean(sample_1$price),2)`, and we are 95\% "confident" that the true population mean price-per-night for all Airbnb listings in Vancouver is between \$(`r round(bounds[1],2)`, `r round(bounds[2],2)`).
813
812
814
813
Notice that our interval does indeed contain the true
815
814
population mean value, \$`r round(mean(airbnb$price),2)`\! However, in
@@ -820,6 +819,7 @@ This chapter is only the beginning of the journey into statistical inference. We
820
819
## Additional readings
821
820
822
821
For more about statistical inference and bootstrapping, refer to
823
-
- Chapters 7 - 8 of [Modern Dive](https://moderndive.com/) Statistical
824
-
Inference via Data Science by Chester Ismay and Albert Y. Kim
822
+
823
+
- Chapters 7 - 8 of [Modern Dive: Statistical
824
+
Inference via Data Science](https://moderndive.com/) by Chester Ismay and Albert Y. Kim
825
825
- Chapters 4 - 7 of [OpenIntro Statistics - Fourth Edition](https://www.openintro.org/) by David M. Diez, Christopher D. Barr and Mine Cetinkaya-Rundel
0 commit comments