@@ -392,18 +392,18 @@ sample_distribution <- ggplot(one_sample, aes(price)) +
392
392
sample_distribution
393
393
394
394
estimates <- one_sample |>
395
- summarize(sample_mean = mean(price))
395
+ summarize(mean_price = mean(price))
396
396
397
397
estimates
398
398
```
399
399
400
400
The average value of the sample of size 40
401
- is \$ ` r round(estimates$sample_mean , 2) ` . This
401
+ is \$ ` r round(estimates$mean_price , 2) ` . This
402
402
number is a point estimate for the mean of the full population.
403
403
Recall that the population mean was
404
404
\$ ` r round(population_parameters$pop_mean,2) ` . So our estimate was fairly close to
405
405
the population parameter: the mean was about
406
- ` r round(100*abs(estimates$sample_mean - population_parameters$pop_mean)/population_parameters$pop_mean, 1) ` %
406
+ ` r round(100*abs(estimates$mean_price - population_parameters$pop_mean)/population_parameters$pop_mean, 1) ` %
407
407
off. Note that we usually cannot compute the estimate's accuracy in practice
408
408
since we do not have access to the population parameter; if we did, we wouldn't
409
409
need to estimate it!
@@ -428,11 +428,11 @@ distribution of sample means for samples of size 40.
428
428
``` {r 11-example-means4, echo = TRUE, message = FALSE, fig.pos = "H", out.extra="", warning = FALSE, fig.cap= "Sampling distribution of the sample means for sample size of 40.", fig.height = 3.5, fig.width = 4.5}
429
429
sample_estimates <- samples |>
430
430
group_by(replicate) |>
431
- summarize(sample_mean = mean(price))
431
+ summarize(mean_price = mean(price))
432
432
433
433
sample_estimates
434
434
435
- sampling_distribution_40 <- ggplot(sample_estimates, aes(x = sample_mean )) +
435
+ sampling_distribution_40 <- ggplot(sample_estimates, aes(x = mean_price )) +
436
436
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
437
437
labs(x = "Sample mean price per night (dollars)", y = "Count") +
438
438
theme(text = element_text(size = 12))
@@ -442,12 +442,12 @@ sampling_distribution_40
442
442
443
443
In Figure \@ ref(fig:11-example-means4), the sampling distribution of the mean
444
444
has one peak and is \index{sampling distribution!shape} bell-shaped. Most of the estimates are between
445
- about \$ ` r round(quantile(sample_estimates$sample_mean )[2], -1) ` and
446
- \$ ` r round(quantile(sample_estimates$sample_mean )[4], -1) ` ; but there are
445
+ about \$ ` r round(quantile(sample_estimates$mean_price )[2], -1) ` and
446
+ \$ ` r round(quantile(sample_estimates$mean_price )[4], -1) ` ; but there are
447
447
a good fraction of cases outside this range (i.e., where the point estimate was
448
448
not close to the population parameter). So it does indeed look like we were
449
449
quite lucky when we estimated the population mean with only
450
- ` r round(100*abs(estimates$sample_mean - population_parameters$pop_mean)/population_parameters$pop_mean, 1) ` % error.
450
+ ` r round(100*abs(estimates$mean_price - population_parameters$pop_mean)/population_parameters$pop_mean, 1) ` % error.
451
451
452
452
Let's visualize the population distribution, distribution of the sample, and
453
453
the sampling distribution on one plot to compare them in Figure
@@ -465,9 +465,9 @@ sample, which will keep the average from being too extreme.
465
465
<!-- -
466
466
```{r 11-example-means4.5}
467
467
sample_estimates |>
468
- summarize(mean_of_sample_means = mean(sample_mean ))
468
+ summarize(mean_of_sample_means = mean(mean_price ))
469
469
```
470
- Notice that the mean of the sample means is \$`r round(mean(sample_estimates$sample_mean ),2)`. Recall that the population mean
470
+ Notice that the mean of the sample means is \$`r round(mean(sample_estimates$mean_price ),2)`. Recall that the population mean
471
471
was \$`r round(mean(airbnb$price),2)`.
472
472
-->
473
473
@@ -497,44 +497,44 @@ distribution with a red vertical line.
497
497
## Sampling n = 20, 50, 100, 500
498
498
sample_estimates_20 <- rep_sample_n(airbnb, size = 20, reps = 20000) |>
499
499
group_by(replicate) |>
500
- summarize(sample_mean = mean(price))
500
+ summarize(mean_price = mean(price))
501
501
502
502
sample_estimates_50 <- rep_sample_n(airbnb, size = 50, reps = 20000) |>
503
503
group_by(replicate) |>
504
- summarize(sample_mean = mean(price))
504
+ summarize(mean_price = mean(price))
505
505
506
506
sample_estimates_100 <- rep_sample_n(airbnb, size = 100, reps = 20000) |>
507
507
group_by(replicate) |>
508
- summarize(sample_mean = mean(price))
508
+ summarize(mean_price = mean(price))
509
509
510
510
sample_estimates_500 <- rep_sample_n(airbnb, size = 500, reps = 20000) |>
511
511
group_by(replicate) |>
512
- summarize(sample_mean = mean(price))
512
+ summarize(mean_price = mean(price))
513
513
514
514
## Sampling distribution n = 20
515
- sampling_distribution_20 <- ggplot(sample_estimates_20, aes(x = sample_mean )) +
515
+ sampling_distribution_20 <- ggplot(sample_estimates_20, aes(x = mean_price )) +
516
516
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
517
517
labs(x = "Sample mean price per night (dollars)", y = "Count") +
518
518
ggtitle("n = 20")
519
519
520
520
## Sampling distribution n = 50
521
- sampling_distribution_50 <- ggplot(sample_estimates_50, aes(x = sample_mean )) +
521
+ sampling_distribution_50 <- ggplot(sample_estimates_50, aes(x = mean_price )) +
522
522
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
523
523
ylab("Count") +
524
524
xlab("Sample mean price per night (dollars)") +
525
525
ggtitle("n = 50") +
526
526
xlim(min_x(sampling_distribution_20), max_x(sampling_distribution_20))
527
527
528
528
## Sampling distribution n = 100
529
- sampling_distribution_100 <- ggplot(sample_estimates_100, aes(x = sample_mean )) +
529
+ sampling_distribution_100 <- ggplot(sample_estimates_100, aes(x = mean_price )) +
530
530
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
531
531
ylab("Count") +
532
532
xlab("Sample mean price per night (dollars)") +
533
533
ggtitle("n = 100") +
534
534
xlim(min_x(sampling_distribution_20), max_x(sampling_distribution_20))
535
535
536
536
## Sampling distribution n = 500
537
- sampling_distribution_500 <- ggplot(sample_estimates_500, aes(x = sample_mean )) +
537
+ sampling_distribution_500 <- ggplot(sample_estimates_500, aes(x = mean_price )) +
538
538
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
539
539
ylab("Count") +
540
540
xlab("Sample mean price per night (dollars)") +
@@ -544,57 +544,57 @@ sampling_distribution_500 <- ggplot(sample_estimates_500, aes(x = sample_mean))
544
544
545
545
``` {r 11-example-means7, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Comparison of sampling distributions, with mean highlighted as a vertical red line."}
546
546
annotated_sampling_dist_20 <- sampling_distribution_20 +
547
- geom_vline(xintercept = mean(sample_estimates$sample_mean ), col = "red") +
547
+ geom_vline(xintercept = mean(sample_estimates$mean_price ), col = "red") +
548
548
xlim(min_x(sampling_distribution_20), max_x(sampling_distribution_20)) +
549
549
ggtitle("n = 20") +
550
550
annotate("text",
551
551
x = max_x(sampling_distribution_20),
552
552
y = max_count(sampling_distribution_20),
553
553
hjust = 1,
554
554
vjust = 1,
555
- label = paste("mean = ", round(mean(sample_estimates$sample_mean ), 1))
555
+ label = paste("mean = ", round(mean(sample_estimates$mean_price ), 1))
556
556
)+ theme(text = element_text(size = 12), axis.title=element_text(size=12))
557
557
#+
558
558
# annotate("text", x = max_x(sampling_distribution_20), y = max_count(sampling_distribution_20), hjust = 1, vjust = 3,
559
- # label = paste("sd = ", round(sd(sample_estimates$sample_mean ), 1)))
559
+ # label = paste("sd = ", round(sd(sample_estimates$mean_price ), 1)))
560
560
561
561
annotated_sampling_dist_50 <- sampling_distribution_50 +
562
- geom_vline(xintercept = mean(sample_estimates_50$sample_mean ), col = "red") +
562
+ geom_vline(xintercept = mean(sample_estimates_50$mean_price ), col = "red") +
563
563
## x limits set the same as n = 20 graph, y is this graph
564
564
annotate("text",
565
565
x = max_x(sampling_distribution_20),
566
566
y = max_count(sampling_distribution_50),
567
567
hjust = 1,
568
568
vjust = 1,
569
- label = paste("mean = ", round(mean(sample_estimates_50$sample_mean ), 1))
569
+ label = paste("mean = ", round(mean(sample_estimates_50$mean_price ), 1))
570
570
)+ theme(text = element_text(size = 12), axis.title=element_text(size=12)) #+
571
571
# annotate("text", x = max_x(sampling_distribution_20), y = max_count(sampling_distribution_50), hjust = 1, vjust = 3,
572
- # label = paste("sd = ", round(sd(sample_estimates_50$sample_mean ), 1)))
572
+ # label = paste("sd = ", round(sd(sample_estimates_50$mean_price ), 1)))
573
573
574
574
annotated_sampling_dist_100 <- sampling_distribution_100 +
575
- geom_vline(xintercept = mean(sample_estimates_100$sample_mean ), col = "red") +
575
+ geom_vline(xintercept = mean(sample_estimates_100$mean_price ), col = "red") +
576
576
annotate("text",
577
577
x = max_x(sampling_distribution_20),
578
578
y = max_count(sampling_distribution_100),
579
579
hjust = 1,
580
580
vjust = 1,
581
- label = paste("mean = ", round(mean(sample_estimates_100$sample_mean ), 1))
581
+ label = paste("mean = ", round(mean(sample_estimates_100$mean_price ), 1))
582
582
) + theme(text = element_text(size = 12), axis.title=element_text(size=12)) #+
583
583
# annotate("text", x = max_x(sampling_distribution_20), y = max_count(sampling_distribution_100), hjust = 1, vjust = 3,
584
- # label = paste("sd = ", round(sd(sample_estimates_100$sample_mean ), 1)))
584
+ # label = paste("sd = ", round(sd(sample_estimates_100$mean_price ), 1)))
585
585
586
586
annotated_sampling_dist_500 <- sampling_distribution_500 +
587
- geom_vline(xintercept = mean(sample_estimates_500$sample_mean ), col = "red") +
587
+ geom_vline(xintercept = mean(sample_estimates_500$mean_price ), col = "red") +
588
588
annotate("text",
589
589
x = max_x(sampling_distribution_20),
590
590
y = max_count(sampling_distribution_500),
591
591
hjust = 1,
592
592
vjust = 1,
593
- label = paste("mean = ", round(mean(sample_estimates_500$sample_mean ), 1))
593
+ label = paste("mean = ", round(mean(sample_estimates_500$mean_price ), 1))
594
594
) + theme(text = element_text(size = 12), axis.title=element_text(size=12))
595
595
#+
596
596
# annotate("text", x = max_x(sampling_distribution_20), y = max_count(sampling_distribution_500), hjust = 1, vjust = 3,
597
- # label = paste("sd = ", round(sd(sample_estimates_500$sample_mean ), 1)))
597
+ # label = paste("sd = ", round(sd(sample_estimates_500$mean_price ), 1)))
598
598
599
599
grid.arrange(annotated_sampling_dist_20,
600
600
annotated_sampling_dist_50,
@@ -771,7 +771,7 @@ and use a bootstrap distribution using just a single sample from the population.
771
771
Once again, suppose we are
772
772
interested in estimating the population mean price per night of all Airbnb
773
773
listings in Vancouver, Canada, using a single sample size of 40.
774
- Recall our point estimate was \$ ` r round(estimates$sample_mean , 2) ` . The
774
+ Recall our point estimate was \$ ` r round(estimates$mean_price , 2) ` . The
775
775
histogram of prices in the sample is displayed in Figure \@ ref(fig:11-bootstrapping1).
776
776
777
777
``` {r, echo = F, message = F, warning = F}
@@ -791,7 +791,7 @@ one_sample_dist
791
791
```
792
792
793
793
The histogram for the sample is skewed, with a few observations out to the right. The
794
- mean of the sample is \$ ` r round(estimates$sample_mean , 2) ` .
794
+ mean of the sample is \$ ` r round(estimates$mean_price , 2) ` .
795
795
Remember, in practice, we usually only have this one sample from the population. So
796
796
this sample and estimate are the only data we can work with.
797
797
@@ -895,21 +895,21 @@ samples <- rep_sample_n(airbnb, size = 40, reps = 20000)
895
895
896
896
sample_estimates <- samples |>
897
897
group_by(replicate) |>
898
- summarize(sample_mean = mean(price))
898
+ summarize(mean_price = mean(price))
899
899
900
- sampling_dist <- ggplot(sample_estimates, aes(x = sample_mean )) +
900
+ sampling_dist <- ggplot(sample_estimates, aes(x = mean_price )) +
901
901
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
902
902
ylab("Count") +
903
903
xlab("Sample mean price per night (dollars)")
904
904
905
905
annotated_sampling_dist <- sampling_dist +
906
906
xlim(min_x(sampling_dist), max_x(sampling_dist)) +
907
- geom_vline(xintercept = mean(sample_estimates$sample_mean ), col = "red") +
907
+ geom_vline(xintercept = mean(sample_estimates$mean_price ), col = "red") +
908
908
annotate("text",
909
909
x = max_x(sampling_dist), y = max_count(sampling_dist),
910
910
hjust = 1,
911
911
vjust = 1,
912
- label = paste("mean = ", round(mean(sample_estimates$sample_mean ), 1)))
912
+ label = paste("mean = ", round(mean(sample_estimates$mean_price ), 1)))
913
913
914
914
boot_est_dist_limits <- boot_est_dist +
915
915
xlim(min_x(sampling_dist), max_x(sampling_dist))
0 commit comments