Skip to content

Commit 09d7de2

Browse files
committed
mostly fixed fig placement and whitespace in inference
1 parent a03765b commit 09d7de2

File tree

1 file changed

+28
-31
lines changed

1 file changed

+28
-31
lines changed

inference.Rmd

Lines changed: 28 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -208,8 +208,9 @@ Here we see that the proportion of entire home/apartment listings in this
208208
random sample is `r round(airbnb_sample_1$prop,2)`. Wow—that's close to our
209209
true population value! But remember, we computed the proportion using a random sample of size 40.
210210
This has two consequences. First, this value is only an *estimate*, i.e., our best guess
211-
of our population parameter using this sample. Given that it is a single value that we are estimating, we often
212-
refer to it as a **point estimate**. And second, since the sample was random,
211+
of our population parameter using this sample.
212+
Given that we are estimating a single value here, we often
213+
refer to it as a **point estimate**. Second, since the sample was random,
213214
if we were to take *another* random sample of size 40 and compute the proportion for that sample,
214215
we would not get the same answer:
215216

@@ -289,11 +290,10 @@ We have created this particular example
289290
such that we *do* have access to the full population, which lets us visualize the
290291
sampling distribution directly for learning purposes.
291292

292-
```{r 11-example-proportions7, echo = TRUE, message = FALSE, warning = FALSE,fig.cap = "Sampling distribution of the sample proportion for sample size 40.", fig.height = 3.3, fig.width = 4.2}
293+
```{r 11-example-proportions7, echo = TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Sampling distribution of the sample proportion for sample size 40.", fig.height = 3.3, fig.width = 4.2}
293294
sampling_distribution <- ggplot(sample_estimates, aes(x = sample_proportion)) +
294295
geom_histogram(fill = "dodgerblue3", color = "lightgrey", bins = 12) +
295-
ylab("Count") +
296-
xlab("Sample proportions") +
296+
labs(x = "Sample proportions", y = "Count") +
297297
theme(text = element_text(size = 12))
298298
299299
sampling_distribution
@@ -338,11 +338,10 @@ We can visualize the population distribution of the price per night with a histo
338338
options(pillar.sigfig = 5)
339339
```
340340

341-
```{r 11-example-means2, echo = TRUE, message = FALSE, warning = FALSE, fig.cap = "Population distribution of price per night (Canadian dollars) for all Airbnb listings in Vancouver, Canada.", fig.height = 3.5, fig.width = 4.5}
341+
```{r 11-example-means2, echo = TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Population distribution of price per night (Canadian dollars) for all Airbnb listings in Vancouver, Canada.", fig.height = 3.5, fig.width = 4.5}
342342
population_distribution <- ggplot(airbnb, aes(x = price)) +
343343
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
344-
ylab("Count") +
345-
xlab("Price per night (Canadian dollars)") +
344+
labs(x = "Price per night (Canadian dollars)", y = "Count") +
346345
theme(text = element_text(size = 12))
347346
348347
population_distribution
@@ -384,11 +383,10 @@ We can create a histogram to visualize the distribution of observations in the
384383
sample (Figure \@ref(fig:11-example-means-sample-hist)), and calculate the mean
385384
of our sample.
386385

387-
```{r 11-example-means-sample-hist, echo = TRUE, message = FALSE, warning = FALSE, fig.cap = "Distribution of price per night (Canadian dollars) for sample of 40 Airbnb listings.", fig.height = 3.5, fig.width = 4.5}
386+
```{r 11-example-means-sample-hist, echo = TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Distribution of price per night (Canadian dollars) for sample of 40 Airbnb listings.", fig.height = 3.5, fig.width = 4.5}
388387
sample_distribution <- ggplot(one_sample, aes(price)) +
389388
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
390-
ylab("Count") +
391-
xlab("Price per night (Canadian dollars)") +
389+
labs(x = "Price per night (Canadian dollars)", y = "Count") +
392390
theme(text = element_text(size = 12))
393391
394392
sample_distribution
@@ -427,7 +425,7 @@ samples
427425
Now we can calculate the sample mean for each replicate and plot the sampling
428426
distribution of sample means for samples of size 40.
429427

430-
```{r 11-example-means4, echo = TRUE, message = FALSE, warning = FALSE, fig.cap= "Sampling distribution of the sample means for sample size of 40.", fig.height = 3.5, fig.width = 4.5}
428+
```{r 11-example-means4, echo = TRUE, message = FALSE, fig.pos = "H", out.extra="", warning = FALSE, fig.cap= "Sampling distribution of the sample means for sample size of 40.", fig.height = 3.5, fig.width = 4.5}
431429
sample_estimates <- samples |>
432430
group_by(replicate) |>
433431
summarize(sample_mean = mean(price))
@@ -436,8 +434,7 @@ sample_estimates
436434
437435
sampling_distribution_40 <- ggplot(sample_estimates, aes(x = sample_mean)) +
438436
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
439-
ylab("Count") +
440-
xlab("Sample mean price per night (Canadian dollars)") +
437+
labs(x = "Sample mean price per night (Canadian dollars)", y = "Count") +
441438
theme(text = element_text(size = 12))
442439
443440
sampling_distribution_40
@@ -517,8 +514,7 @@ sample_estimates_500 <- rep_sample_n(airbnb, size = 500, reps = 20000) |>
517514
## Sampling distribution n = 20
518515
sampling_distribution_20 <- ggplot(sample_estimates_20, aes(x = sample_mean)) +
519516
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
520-
ylab("Count") +
521-
xlab("Sample mean price per night\n(Canadian dollars)") +
517+
labs(x = "Sample mean price per night\n(Canadian dollars)", y = "Count") +
522518
ggtitle("n = 20")
523519
524520
## Sampling distribution n = 50
@@ -623,7 +619,7 @@ mean is roughly bell-shaped. \index{sampling distribution!effect of sample size}
623619
> In general, the sampling distribution&mdash;for both means and proportions&mdash;only
624620
> becomes bell-shaped *once the sample size is large enough*.
625621
> How large is "large enough?" Unfortunately, it depends entirely on the problem at hand. But
626-
> as a rule of thumb for many problems in practice, having a sample size of at least 20 will suffice.
622+
> as a rule of thumb, often a sample size of at least 20 will suffice.
627623
628624
<!--- > **Note:** If random samples of size $n$ are taken from a population, the sample mean $\bar{x}$ will be approximately Normal with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$ as long as the sample size $n$ is large enough. $\mu$ is the population mean, $\sigma$ is the population standard deviation, $\bar{x}$ is the sample mean, and $n$ is the sample size.
629625
> If samples are selected from a finite population as we are doing in this chapter, we should apply a finite population correction. We multiply $\frac{\sigma}{\sqrt{n}}$ by $\sqrt{\frac{N - n}{N - 1}}$ where $N$ is the population size and $n$ is the sample size. If our sample size, $n$, is small relative to the population size, this finite correction factor is less important.
@@ -671,7 +667,7 @@ see that the sample’s distribution looks like that of the population for a
671667
large enough sample.
672668

673669

674-
```{r 11-example-bootstrapping0, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 7, fig.cap = "Comparison of samples of different sizes from the population."}
670+
```{r 11-example-bootstrapping0, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 6.8, fig.cap = "Comparison of samples of different sizes from the population."}
675671
sample_10 <- airbnb |>
676672
rep_sample_n(10)
677673
sample_distribution_10 <- ggplot(sample_10, aes(price)) +
@@ -746,6 +742,8 @@ called **the bootstrap**. Note that by taking many samples from our single, obs
746742
sample, we do not obtain the true sampling distribution, but rather an
747743
approximation that we call **the bootstrap distribution**. \index{bootstrap!distribution}
748744

745+
\newpage
746+
749747
> **Note:** We must sample *with* replacement when using the bootstrap.
750748
> Otherwise, if we had a sample of size $n$, and obtained a sample from it of
751749
> size $n$ *without* replacement, it would just return our original sample!
@@ -762,11 +760,12 @@ For a sample of size $n$, you would do the following:
762760
6. Repeat steps 1&ndash;5 many times to create a distribution of point estimates (the bootstrap distribution).
763761
7. Calculate the plausible range of values around our observed point estimate.
764762

765-
```{r 11-intro-bootstrap-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Overview of the bootstrap process.", fig.retina = 2, out.width="100%"}
763+
```{r 11-intro-bootstrap-image, echo = FALSE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Overview of the bootstrap process.", fig.retina = 2, out.width="100%"}
766764
knitr::include_graphics("img/intro-bootstrap.jpeg")
767765
```
768766

769767
### Bootstrapping in R
768+
770769
Let’s continue working with our Airbnb example to illustrate how we might create
771770
and use a bootstrap distribution using just a single sample from the population.
772771
Once again, suppose we are
@@ -780,13 +779,12 @@ one_sample <- one_sample |>
780779
ungroup() |> select(-replicate)
781780
```
782781

783-
```{r 11-bootstrapping1, echo = TRUE, message = FALSE, warning = FALSE, fig.cap = "Histogram of price per night (Canadian dollars) for one sample of size 40.", fig.height = 3.5, fig.width = 4.5}
782+
```{r 11-bootstrapping1, echo = TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Histogram of price per night (Canadian dollars) for one sample of size 40.", fig.height = 3.5, fig.width = 4.5}
784783
one_sample
785784
786785
one_sample_dist <- ggplot(one_sample, aes(price)) +
787786
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
788-
ylab("Count") +
789-
xlab("Price per night (Canadian dollars)") +
787+
labs(x = "Price per night (Canadian dollars)", y = "Count") +
790788
theme(text = element_text(size = 12))
791789
792790
one_sample_dist
@@ -807,13 +805,12 @@ we change the argument for `replace` from its default value of `FALSE` to `TRUE`
807805
\index{bootstrap!in R}
808806
\index{rep\_sample\_n!bootstrap}
809807

810-
```{r 11-bootstrapping3, echo = TRUE, message = FALSE, warning = FALSE, fig.cap = "Bootstrap distribution.", fig.height = 3.5, fig.width = 4.5}
808+
```{r 11-bootstrapping3, echo = TRUE, message = FALSE, fig.pos = "H", out.extra="", warning = FALSE, fig.cap = "Bootstrap distribution.", fig.height = 3.5, fig.width = 4.5}
811809
boot1 <- one_sample |>
812810
rep_sample_n(size = 40, replace = TRUE, reps = 1)
813811
boot1_dist <- ggplot(boot1, aes(price)) +
814812
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
815-
ylab("Count") +
816-
xlab("Price per night (Canadian dollars)") +
813+
labs(x = "Price per night (Canadian dollars)", y = "Count") +
817814
theme(text = element_text(size = 12))
818815
819816
boot1_dist
@@ -847,14 +844,13 @@ tail(boot20000)
847844
```
848845

849846
Let's take a look at histograms of the first six replicates of our bootstrap samples.
850-
```{r 11-bootstrapping-six-bootstrap-samples, echo = TRUE, message = FALSE, warning = FALSE, fig.cap = "Histograms of first six replicates of bootstrap samples."}
847+
```{r 11-bootstrapping-six-bootstrap-samples, echo = TRUE, fig.pos = "H", out.extra="", message = FALSE, warning = FALSE, fig.cap = "Histograms of first six replicates of bootstrap samples."}
851848
six_bootstrap_samples <- boot20000 |>
852849
filter(replicate <= 6)
853850
854851
ggplot(six_bootstrap_samples, aes(price)) +
855852
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
856-
xlab("Price per night (Canadian dollars)") +
857-
ylab("Count") +
853+
labs(x = "Price per night (Canadian dollars)", y = "Count") +
858854
facet_wrap(~replicate) +
859855
theme(text = element_text(size = 12))
860856
```
@@ -875,7 +871,7 @@ generate a bootstrap distribution of our point estimates. The bootstrap
875871
distribution (Figure \@ref(fig:11-bootstrapping5)) suggests how we might expect
876872
our point estimate to behave if we took another sample.
877873

878-
```{r 11-bootstrapping5, echo = TRUE, message = FALSE, warning = FALSE, fig.cap = "Distribution of the bootstrap sample means.", fig.height = 3.5, fig.width = 4.5}
874+
```{r 11-bootstrapping5, echo = TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Distribution of the bootstrap sample means.", fig.height = 3.5, fig.width = 4.5}
879875
boot20000_means <- boot20000 |>
880876
group_by(replicate) |>
881877
summarize(mean = mean(price))
@@ -885,8 +881,7 @@ tail(boot20000_means)
885881
886882
boot_est_dist <- ggplot(boot20000_means, aes(x = mean)) +
887883
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
888-
ylab("Count") +
889-
xlab("Sample mean price per night \n (Canadian dollars)") +
884+
labs(x = "Sample mean price per night \n (Canadian dollars)", y = "Count") +
890885
theme(text = element_text(size = 12))
891886
892887
boot_est_dist
@@ -1117,6 +1112,8 @@ To calculate a 95\% percentile bootstrap confidence interval, we will do the fol
11171112
2. Find the value such that 2.5\% of observations fall below it (the 2.5\% percentile). Use that value as the lower bound of the interval.
11181113
3. Find the value such that 97.5\% of observations fall below it (the 97.5\% percentile). Use that value as the upper bound of the interval.
11191114

1115+
\newpage
1116+
11201117
To do this in R, we can use the `quantile()` function:
11211118
\index{quantile}
11221119
\index{pull}

0 commit comments

Comments
 (0)