Skip to content

Commit 3558bda

Browse files
Melissa LeeMelissa Lee
authored andcommitted
changed statistic to estimate, updated population vs sample plot, minor grammar fixes
1 parent 524e56f commit 3558bda

File tree

3 files changed

+44
-42
lines changed

3 files changed

+44
-42
lines changed

10-inference.Rmd

Lines changed: 23 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -61,9 +61,9 @@ every single undergraduate in North America whether or not they own an iPhone. I
6161
directly computing population parameters is often time-consuming and costly, and sometimes impossible.
6262

6363
A more practical approach would be to collect measurements for a **sample**: a subset of
64-
individuals collected from the population. We can then compute a **sample statistic**—a numerical
64+
individuals collected from the population. We can then compute a **sample estimate**—a numerical
6565
characteristic of the sample—that estimates the population parameter. For example, suppose we randomly selected 100 undergraduate students across North America (the sample) and computed the proportion of those
66-
students who own an iPhone (the sample statistic). In that case, we might suspect that that proportion is a reasonable estimate of the proportion of students who own an iPhone in the entire population.
66+
students who own an iPhone (the sample estimate). In that case, we might suspect that that proportion is a reasonable estimate of the proportion of students who own an iPhone in the entire population.
6767

6868
```{r 11-population-vs-sample, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Population versus sample", fig.retina = 2}
6969
knitr::include_graphics("img/population_vs_sample.svg")
@@ -151,13 +151,13 @@ choc_sample_2 <- summarize(samples_2, n = sum(flavour == "chocolate"),
151151
choc_sample_2
152152
```
153153

154-
Notice that we get a different value for our statistic this time. The
154+
Notice that we get a different value for our estimate this time. The
155155
proportion of chocolate Timbits in this sample is `r round(choc_sample_2$prop, 2)`.
156156
If we were to do this again, another random sample could also give a
157-
different result. Statistics vary from sample to sample
157+
different result. Estimates vary from sample to sample
158158
due to **sampling variability**.
159159

160-
But just how much should we expect the statistics of our random
160+
But just how much should we expect the estimates of our random
161161
samples to vary? In order to understand this, we will simulate taking more samples
162162
of size 40 from our population of Timbits, and calculate the
163163
proportion of chocolate Timbits in each sample. We can then
@@ -247,12 +247,11 @@ population_parameters <- airbnb %>%
247247
summarize(pop_mean = mean(price))
248248
population_parameters
249249
```
250-
The price per night of all Airbnb rentals in Vancouver, BC is \$`r round(population_parameters$pop_mean,2)`, on average.
250+
The price per night of all Airbnb rentals in Vancouver, BC is \$`r round(population_parameters$pop_mean,2)`, on average. This value is our population parameter since we are calculating it using the population data.
251251

252-
Suppose that we did not have access to the population data, yet we still wanted to estimate the mean price per night? We could answer this question by taking a random sample of as many Airbnb listings as we had time to, let's say we could do this for 40 listings. What would such a sample look like?
252+
Suppose that we did not have access to the population data, yet we still wanted to estimate the mean price per night. We could answer this question by taking a random sample of as many Airbnb listings as we had time to, let's say we could do this for 40 listings. What would such a sample look like?
253253

254-
Let's take advantage of the fact that we do have access to the population data and simulate taking one random sample of 40 listings in R, again using `rep_sample_n`. After doing this we
255-
create a histogram to visualize the
254+
Let's take advantage of the fact that we do have access to the population data and simulate taking one random sample of 40 listings in R, again using `rep_sample_n`. After doing this we create a histogram to visualize the
256255
distribution of observations in the sample,
257256
and calculate the mean of our sample. This number is a **point estimate** for the mean of the full population.
258257

@@ -276,7 +275,7 @@ Note that in practice, we usually cannot compute the accuracy of the estimate, s
276275
parameter; if we did, we wouldn't need to estimate it!
277276

278277
Also recall from the previous section that the point estimate can vary; if
279-
we took another random sample from the population, then the value of our statistic may change.
278+
we took another random sample from the population, then the value of our estimate may change.
280279
So then did we just get lucky with our point estimate above?
281280
How much does our estimate vary across different samples of size 40 in this example? Again, since we have access to the population,
282281
we can take many samples and plot the **sampling distribution** of sample means for samples of size 40 to get a sense
@@ -433,14 +432,15 @@ in a more reliable point estimate of the population parameter.
433432
--->
434433

435434
### Summary
436-
1. A *statistic* is a value computed using a sample from a population; a *point estimate* is a statistic that is a single value (e.g. a mean or proportion)
437-
2. The *sampling distribution* of a statistic is the distribution of the statistic for all possible samples of a fixed size from the same population.
435+
1. A *point estimate* is a single value computed using a sample from a population (e.g. a mean or proportion)
436+
2. The *sampling distribution* of an estimate is the distribution of the estimate for all possible samples of a fixed size from the same population.
438437
3. The sample means and proportions calculated from samples are centered around the population mean and proportion, respectively.
439438
4. The spread of the sampling distribution is related to the sample size. As the sample size increases, the spread of the sampling distribution decreases.
440439
5. The shape of the sampling distribution is usually bell-shaped with one peak and centred at the population mean or proportion.
441440

442441
*Why all this emphasis on sampling distributions?*
443-
Usually, we don't have access to the population data, so we cannot construct the sampling distribution as we did in this section. As we saw, our sample estimate's value will likely not equal the population parameter value exactly. We saw from the sampling distribution just how much our estimates can vary. So reporting a single *point estimate* for the population parameter alone may not be enough. Using simulations, we can see patterns of the sample statistic's sampling distribution would look like for a sample of a given size. We can use these patterns to approximate the sampling distribution when we only have one sample, which is the realistic case. If we can ``predict" what the sampling distribution would look like for a sample, we could construct a range of values we think the population parameter's value might lie. We can use our single sample and its properties that influence sampling distributions, such as the spread and sample size, to approximate the sampling distribution as best as we can. There are several methods to do this; however, in this book, we will use the bootstrap method to do this, as we will see in the next section.
442+
443+
Usually, we don't have access to the population data, so we cannot construct the sampling distribution as we did in this section. As we saw, our sample estimate's value will likely not equal the population parameter value exactly. We saw from the sampling distribution just how much our estimates can vary. So reporting a single *point estimate* for the population parameter alone may not be enough. Using simulations, we can see patterns of the sample estimate's sampling distribution would look like for a sample of a given size. We can use these patterns to approximate the sampling distribution when we only have one sample, which is the realistic case. If we can "predict" what the sampling distribution would look like for a sample, we could construct a range of values we think the population parameter's value might lie. We can use our single sample and its properties that influence sampling distributions, such as the spread and sample size, to approximate the sampling distribution as best as we can. There are several methods to do this; however, in this book, we will use the bootstrap method to do this, as we will see in the next section.
444444

445445
## Bootstrapping
446446
### Overview
@@ -502,7 +502,7 @@ grid.arrange(sample_distribution_10 + xlim(min(airbnb$price), 600),
502502
```
503503

504504
In the previous section, we took many samples of the same size *from our population* to get
505-
a sense for the variability of a sample statistic. But if our sample is big enough that it looks like our population,
505+
a sense for the variability of a sample estimate. But if our sample is big enough that it looks like our population,
506506
we can pretend that our sample *is* the population, and take more samples (with replacement) of the same size
507507
from it instead! This very clever technique is called **the bootstrap**.
508508
Note that by taking many samples from our single, observed sample, we do not obtain the true sampling distribution,
@@ -522,17 +522,16 @@ For a sample of size $n$, the process we will go through is as follows:
522522
6. Repeat steps (1) - (5) many times to create a distribution of point estimates (the bootstrap distribution)
523523
7. Calculate the plausible range of values around our observed point estimate
524524

525-
```{r 11-intro-bootstrap-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Bootstrap process", fig.retina = 2, out.width="60%"}
525+
```{r 11-intro-bootstrap-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Overview of the bootstrap process", fig.retina = 2, out.width="60%"}
526526
knitr::include_graphics("img/intro-bootstrap.svg")
527527
```
528528
### Bootstrapping in R
529-
Let's continue working with our Airbnb data. Once again, let's say we are interested
530-
in estimating the population mean price per night of all Airbnb listings in
531-
Vancouver, Canada from a single sample we collected of size 40.
529+
Let's continue working with our Airbnb data. Once again, let's say we are interested in estimating the population mean price per night of all Airbnb listings in
530+
Vancouver, Canada using a single sample we collected of size 40.
532531

533532
To simulate doing this in R, we will use `rep_sample_n` to take a random sample from from our population. In real life we wouldn't do this step in R, we would instead simply load the data into R, that we, or our collaborators collected.
534533

535-
After we have our sample, we will visualize it's distribution and calculate our point estimate, the sample mean.
534+
After we have our sample, we will visualize it's distribution and calculate our point estimate, the sample mean.
536535

537536
```{r 11-bootstrapping1, echo = TRUE, message = FALSE, warning = FALSE, fig.cap = "Histogram of price per night ($) for one sample of size 40", out.width = "60%"}
538537
one_sample <- airbnb %>%
@@ -629,7 +628,7 @@ boot_est_dist <- ggplot(boot15000_means, aes(x = mean)) +
629628

630629
Let's compare our bootstrap distribution with the true sampling distribution (taking many samples from the population).
631630

632-
```{r 11-bootstrapping6, echo = F, message = FALSE, warning = FALSE, fig.cap = "Comparison of distribution of the bootstrap sample means and sampling distribution", out.height="60%"}
631+
```{r 11-bootstrapping6, echo = F, message = FALSE, warning = FALSE, fig.cap = "Comparison of distribution of the bootstrap sample means and sampling distribution", out.height="50%"}
633632
634633
samples <- rep_sample_n(airbnb, size = 40, reps = 15000)
635634
@@ -809,7 +808,7 @@ boot_est_dist +
809808
label = paste("97.5th percentile =", round(bounds[2], 2)))
810809
```
811810

812-
To finish our estimation of the population parameter, we would report the point estimate and our confidence interval's lower and upper bounds. Here the sample mean price-per-night of 40 Airbnb listings was \$`r round(mean(sample_1$price),2)`, and we are 95\% "confident" that the true population mean price-per-night for all Airbnb listings in Vancouver is between \$`r round(bounds[1],2)`, \$`r round(bounds[2],2)`).
811+
To finish our estimation of the population parameter, we would report the point estimate and our confidence interval's lower and upper bounds. Here the sample mean price-per-night of 40 Airbnb listings was \$`r round(mean(sample_1$price),2)`, and we are 95\% "confident" that the true population mean price-per-night for all Airbnb listings in Vancouver is between \$(`r round(bounds[1],2)`, `r round(bounds[2],2)`).
813812

814813
Notice that our interval does indeed contain the true
815814
population mean value, \$`r round(mean(airbnb$price),2)`\! However, in
@@ -820,6 +819,7 @@ This chapter is only the beginning of the journey into statistical inference. We
820819
## Additional readings
821820

822821
For more about statistical inference and bootstrapping, refer to
823-
- Chapters 7 - 8 of [Modern Dive](https://moderndive.com/) Statistical
824-
Inference via Data Science by Chester Ismay and Albert Y. Kim
822+
823+
- Chapters 7 - 8 of [Modern Dive: Statistical
824+
Inference via Data Science](https://moderndive.com/) by Chester Ismay and Albert Y. Kim
825825
- Chapters 4 - 7 of [OpenIntro Statistics - Fourth Edition](https://www.openintro.org/) by David M. Diez, Christopher D. Barr and Mine Cetinkaya-Rundel

0 commit comments

Comments
 (0)