mostly fixed fig placement and whitespace in inference

ttimbers · ttimbers · commit 09d7de26d222 · 2022-01-10T18:48:06.000-08:00
diff --git a/inference.Rmd b/inference.Rmd
@@ -208,8 +208,9 @@ Here we see that the proportion of entire home/apartment listings in this
 random sample is `r round(airbnb_sample_1$prop,2)`. Wow&mdash;that's close to our
 true population value! But remember, we computed the proportion using a random sample of size 40.
 This has two consequences. First, this value is only an *estimate*, i.e., our best guess 
-of our population parameter using this sample. Given that it is a single value that we are estimating, we often
-refer to it as a **point estimate**.  And second, since the sample was random,
+of our population parameter using this sample. 
+Given that we are estimating a single value here, we often
+refer to it as a **point estimate**.  Second, since the sample was random,
 if we were to take *another* random sample of size 40 and compute the proportion for that sample,
 we would not get the same answer:
 
@@ -289,11 +290,10 @@ We have created this particular example
 such that we *do* have access to the full population, which lets us visualize the 
 sampling distribution directly for learning purposes.
 
-```{r 11-example-proportions7, echo = TRUE, message = FALSE, warning = FALSE,fig.cap = "Sampling distribution of the sample proportion for sample size 40.", fig.height = 3.3, fig.width = 4.2}
+```{r 11-example-proportions7, echo = TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Sampling distribution of the sample proportion for sample size 40.", fig.height = 3.3, fig.width = 4.2}
 sampling_distribution <- ggplot(sample_estimates, aes(x = sample_proportion)) +
   geom_histogram(fill = "dodgerblue3", color = "lightgrey", bins = 12) +
-  ylab("Count") +
-  xlab("Sample proportions") +
+  labs(x = "Sample proportions", y = "Count") +
   theme(text = element_text(size = 12))
 
 sampling_distribution
@@ -338,11 +338,10 @@ We can visualize the population distribution of the price per night with a histo
 options(pillar.sigfig = 5)
 ```
 
-```{r 11-example-means2, echo = TRUE, message = FALSE, warning = FALSE, fig.cap = "Population distribution of price per night (Canadian dollars) for all Airbnb listings in Vancouver, Canada.", fig.height = 3.5, fig.width = 4.5}
+```{r 11-example-means2, echo = TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Population distribution of price per night (Canadian dollars) for all Airbnb listings in Vancouver, Canada.", fig.height = 3.5, fig.width = 4.5}
 population_distribution <- ggplot(airbnb, aes(x = price)) +
   geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
-  ylab("Count") + 
-  xlab("Price per night (Canadian dollars)") +
+  labs(x = "Price per night (Canadian dollars)", y = "Count") +
   theme(text = element_text(size = 12))
 
 population_distribution
@@ -384,11 +383,10 @@ We can create a histogram to visualize the distribution of observations in the
 sample (Figure \@ref(fig:11-example-means-sample-hist)), and calculate the mean
 of our sample.
 
-```{r 11-example-means-sample-hist, echo = TRUE, message = FALSE, warning = FALSE, fig.cap = "Distribution of price per night (Canadian dollars) for sample of 40 Airbnb listings.", fig.height = 3.5, fig.width = 4.5}
+```{r 11-example-means-sample-hist, echo = TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Distribution of price per night (Canadian dollars) for sample of 40 Airbnb listings.", fig.height = 3.5, fig.width = 4.5}
 sample_distribution <- ggplot(one_sample, aes(price)) +
   geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
-  ylab("Count") + 
-  xlab("Price per night (Canadian dollars)") +
+  labs(x = "Price per night (Canadian dollars)", y = "Count") +
   theme(text = element_text(size = 12))
 
 sample_distribution
@@ -427,7 +425,7 @@ samples
 Now we can calculate the sample mean for each replicate and plot the sampling
 distribution of sample means for samples of size 40.
 
-```{r 11-example-means4, echo = TRUE, message = FALSE, warning = FALSE, fig.cap= "Sampling distribution of the sample means for sample size of 40.", fig.height = 3.5, fig.width = 4.5}
+```{r 11-example-means4, echo = TRUE, message = FALSE, fig.pos = "H", out.extra="", warning = FALSE, fig.cap= "Sampling distribution of the sample means for sample size of 40.", fig.height = 3.5, fig.width = 4.5}
 sample_estimates <- samples |>
   group_by(replicate) |>
   summarize(sample_mean = mean(price))
@@ -436,8 +434,7 @@ sample_estimates
 
 sampling_distribution_40 <- ggplot(sample_estimates, aes(x = sample_mean)) +
   geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
-  ylab("Count") + 
-  xlab("Sample mean price per night (Canadian dollars)") +
+  labs(x = "Sample mean price per night (Canadian dollars)", y = "Count") +
   theme(text = element_text(size = 12))
 
 sampling_distribution_40
@@ -517,8 +514,7 @@ sample_estimates_500 <- rep_sample_n(airbnb, size = 500, reps = 20000) |>
 ## Sampling distribution n = 20
 sampling_distribution_20 <- ggplot(sample_estimates_20, aes(x = sample_mean)) +
   geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
-  ylab("Count") +
-  xlab("Sample mean price per night\n(Canadian dollars)") +
+  labs(x = "Sample mean price per night\n(Canadian dollars)", y = "Count") +
   ggtitle("n = 20") 
 
 ## Sampling distribution n = 50
@@ -623,7 +619,7 @@ mean is roughly bell-shaped. \index{sampling distribution!effect of sample size}
 > In general, the sampling distribution&mdash;for both means and proportions&mdash;only 
 > becomes bell-shaped *once the sample size is large enough*.
 > How large is "large enough?" Unfortunately, it depends entirely on the problem at hand. But 
-> as a rule of thumb for many problems in practice, having a sample size of at least 20 will suffice.
+> as a rule of thumb, often a sample size of at least 20 will suffice.
 
 <!--- > **Note:** If random samples of size $n$ are taken from a population, the sample mean $\bar{x}$ will be approximately Normal with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$ as long as the sample size $n$ is large enough. $\mu$ is the population mean, $\sigma$ is the population standard deviation, $\bar{x}$ is the sample mean, and $n$ is the sample size. 
 > If samples are selected from a finite population as we are doing in this chapter, we should apply a finite population correction. We multiply $\frac{\sigma}{\sqrt{n}}$ by $\sqrt{\frac{N - n}{N - 1}}$ where $N$ is the population size and $n$ is the sample size. If our sample size, $n$, is small relative to the population size, this finite correction factor is less important. 
@@ -671,7 +667,7 @@ see that the sample’s distribution looks like that of the population for a
 large enough sample.
 
 
-```{r 11-example-bootstrapping0, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 7, fig.cap = "Comparison of samples of different sizes from the population."}
+```{r 11-example-bootstrapping0, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 6.8, fig.cap = "Comparison of samples of different sizes from the population."}
 sample_10 <- airbnb |>
   rep_sample_n(10)
 sample_distribution_10 <- ggplot(sample_10, aes(price)) +
@@ -746,6 +742,8 @@ called **the bootstrap**.  Note that by taking many samples from our single, obs
 sample, we do not obtain the true sampling distribution, but rather an
 approximation that we call **the bootstrap distribution**. \index{bootstrap!distribution}
 
+\newpage
+
 > **Note:** We must sample *with* replacement when using the bootstrap.
 > Otherwise, if we had a sample of size $n$, and obtained a sample from it of
 > size $n$ *without* replacement, it would just return our original sample!
@@ -762,11 +760,12 @@ For a sample of size $n$, you would do the following:
 6. Repeat steps 1&ndash;5 many times to create a distribution of point estimates (the bootstrap distribution).
 7. Calculate the plausible range of values around our observed point estimate.
 
-```{r 11-intro-bootstrap-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Overview of the bootstrap process.", fig.retina = 2, out.width="100%"}
+```{r 11-intro-bootstrap-image, echo = FALSE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Overview of the bootstrap process.", fig.retina = 2, out.width="100%"}
 knitr::include_graphics("img/intro-bootstrap.jpeg")
 ```
 
 ### Bootstrapping in R 
+
 Let’s continue working with our Airbnb example to illustrate how we might create
 and use a bootstrap distribution using just a single sample from the population. 
 Once again, suppose we are
@@ -780,13 +779,12 @@ one_sample <- one_sample |>
   ungroup() |> select(-replicate)
 ```
 
-```{r 11-bootstrapping1, echo = TRUE, message = FALSE, warning = FALSE, fig.cap = "Histogram of price per night (Canadian dollars) for one sample of size 40.", fig.height = 3.5, fig.width = 4.5}
+```{r 11-bootstrapping1, echo = TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Histogram of price per night (Canadian dollars) for one sample of size 40.", fig.height = 3.5, fig.width = 4.5}
 one_sample
 
 one_sample_dist <- ggplot(one_sample, aes(price)) +
   geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
-  ylab("Count") + 
-  xlab("Price per night (Canadian dollars)") +
+  labs(x = "Price per night (Canadian dollars)", y = "Count") +
   theme(text = element_text(size = 12))
 
 one_sample_dist
@@ -807,13 +805,12 @@ we change the argument for `replace` from its default value of `FALSE` to `TRUE`
 \index{bootstrap!in R}
 \index{rep\_sample\_n!bootstrap}
 
-```{r 11-bootstrapping3, echo = TRUE, message = FALSE, warning = FALSE, fig.cap = "Bootstrap distribution.", fig.height = 3.5, fig.width = 4.5}
+```{r 11-bootstrapping3, echo = TRUE, message = FALSE, fig.pos = "H", out.extra="", warning = FALSE, fig.cap = "Bootstrap distribution.", fig.height = 3.5, fig.width = 4.5}
 boot1 <- one_sample |>
   rep_sample_n(size = 40, replace = TRUE, reps = 1)
 boot1_dist <- ggplot(boot1, aes(price)) +
   geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
-  ylab("Count") + 
-  xlab("Price per night (Canadian dollars)") +
+  labs(x = "Price per night (Canadian dollars)", y =  "Count") + 
   theme(text = element_text(size = 12))
 
 boot1_dist
@@ -847,14 +844,13 @@ tail(boot20000)
 ```
 
 Let's take a look at histograms of the first six replicates of our bootstrap samples.
-```{r 11-bootstrapping-six-bootstrap-samples, echo = TRUE, message = FALSE, warning = FALSE, fig.cap = "Histograms of first six replicates of bootstrap samples."}
+```{r 11-bootstrapping-six-bootstrap-samples, echo = TRUE, fig.pos = "H", out.extra="", message = FALSE, warning = FALSE, fig.cap = "Histograms of first six replicates of bootstrap samples."}
 six_bootstrap_samples <- boot20000 |>
   filter(replicate <= 6)
 
 ggplot(six_bootstrap_samples, aes(price)) +
   geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
-  xlab("Price per night (Canadian dollars)") +
-  ylab("Count") + 
+  labs(x = "Price per night (Canadian dollars)", y = "Count") +
   facet_wrap(~replicate) +
   theme(text = element_text(size = 12))
 ```
@@ -875,7 +871,7 @@ generate a bootstrap distribution of our point estimates. The bootstrap
 distribution (Figure \@ref(fig:11-bootstrapping5)) suggests how we might expect
 our point estimate to behave if we took another sample.
 
-```{r 11-bootstrapping5, echo = TRUE, message = FALSE, warning = FALSE, fig.cap = "Distribution of the bootstrap sample means.", fig.height = 3.5, fig.width = 4.5}
+```{r 11-bootstrapping5, echo = TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Distribution of the bootstrap sample means.", fig.height = 3.5, fig.width = 4.5}
 boot20000_means <- boot20000 |>
   group_by(replicate) |>
   summarize(mean = mean(price))
@@ -885,8 +881,7 @@ tail(boot20000_means)
 
 boot_est_dist <- ggplot(boot20000_means, aes(x = mean)) +
   geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
-  ylab("Count") +
-  xlab("Sample mean price per night \n (Canadian dollars)") +
+  labs(x = "Sample mean price per night \n (Canadian dollars)", y = "Count") +
   theme(text = element_text(size = 12))
 
 boot_est_dist
@@ -1117,6 +1112,8 @@ To calculate a 95\% percentile bootstrap confidence interval, we will do the fol
 2. Find the value such that 2.5\% of observations fall below it (the 2.5\% percentile). Use that value as the lower bound of the interval.
 3. Find the value such that 97.5\% of observations fall below it (the 97.5\% percentile). Use that value as the upper bound of the interval.
 
+\newpage
+
 To do this in R, we can use the `quantile()` function:
 \index{quantile}
 \index{pull}