| output | html_document |
|---|
::: {.infobox .download data-latex="{download}"} You can download the corresponding R-Code here :::
This chapter will provide you with a basic intuition on statistical inference. As marketing researchers we are usually faced with "imperfect" data in the sense that we cannot collect all the data we would like. Imagine you are interested in the average amount of time WU students spend listening to music every month. Ideally, we could force all WU students to fill out our survey. Realistically we will only be able to observe a small fraction of students (maybe 500 out of the
Assume there are rnorm() function will be used to generate 25,000 observations from a normal distribution with a mean of 50 and a standard deviation of 10. Although you might not be used to working with this type of simulated (i.e., synthetic) data, it is useful when explaining statistical concepts because the properties of the data are known. In this case, for example, we know the true mean (
library(tidyverse)
library(ggplot2)
library(latex2exp)
set.seed(321)
hours <- rnorm(n = 25000, mean = 50, sd = 10)
ggplot(data.frame(hours)) +
geom_histogram(aes(hours), bins = 50, fill = 'white', color = 'black') +
labs(title = "Histogram of listening times",
subtitle = TeX(sprintf("Population mean ($\\mu$) = %.2f; population standard deviation ($\\sigma$) = %.2f",round(mean(hours),2),round(sd(hours),2))),
y = 'Number of students',
x = 'Hours') +
theme_bw() +
geom_vline(xintercept = mean(hours), size = 1) +
geom_vline(xintercept = mean(hours)+2*sd(hours), colour = "red", size = 1) +
geom_vline(xintercept = mean(hours)-2*sd(hours), colour = "red", size = 1) +
geom_segment(aes(x = mean(hours), y = 1100, yend = 1100, xend = (mean(hours) - 2*sd(hours))), lineend = "butt", linejoin = "round",
size = 0.5, arrow = arrow(length = unit(0.2, "inches"))) +
geom_segment(aes(x = mean(hours), y = 1100, yend = 1100, xend = (mean(hours) + 2*sd(hours))), lineend = "butt", linejoin = "round",
size = 0.5, arrow = arrow(length = unit(0.2, "inches"))) +
annotate("text", x = mean(hours) + 28, y = 1100, label = "Mean + 2 * SD" )+
annotate("text", x = mean(hours) -28, y = 1100, label = "Mean - 2 * SD" )::: {.infobox_orange .hint data-latex="{hint}"}
Notice the set.seed() function we used in the code above. By specifying the seed, we can make sure that the results will be the same as here on the website when you execute the code on your computer. Otherwise, you would end up with a slightly different data set since the observations are generated randomly from the normal distribution.
:::
In this case, we refer to all WU students as the population. In general, the population is the entire group we are interested in. This group does not have to necessarily consist of people, but could also be companies, stores, animals, etc.. The parameters of the distribution of population values (in hour case: "hours") are called population parameters. As already mentioned, we do not usually know population parameters but use inferential statistics to infer them based on our sample from the population, i.e., we measure statistics from a sample (e.g., the sample mean
| Variable | Sample statistic | Population parameter |
|---|---|---|
| Size | n | N |
| Mean | ||
| Variance | ||
| Standard deviation | ||
| Standard error |
Using this notation,
In the first step towards a realistic research setting, let us take one sample from this population and calculate the mean listening time. We can simply sample the row numbers of students and then subset the hours vector with the sampled row numbers. The sample() function will be used to draw a sample of size 100 from the population of 25,000 students, and one student can only be drawn once (i.e., replace = FALSE). The following plot shows the distribution of listening times for our sample.
student_sample <- sample(1:25000, size = 100, replace = FALSE)
sample_1 <- hours[student_sample]
ggplot(data.frame(sample_1)) +
geom_histogram(aes(x = sample_1), bins = 30, fill='white', color='black') +
theme_bw() + xlab("Hours") +
geom_vline(aes(xintercept = mean(sample_1)), size=1) +
ggtitle(TeX(sprintf("Distribution of listening times ($\\bar{x}$ = %.2f)",round(mean(sample_1),2))))Observe that in this first draw the mean (
It becomes clear that the mean is slightly different for each sample. This is referred to as sampling variation and it is completely fine to get a slightly different mean every time we take a sample. We just need to find a way of expressing the uncertainty associated with the fact that we only have data from one sample, because in a realistic setting you are most likely only going to have access to a single sample.
So in order to make sure that the first draw is not just pure luck and the sample mean is in fact a good estimate for the population mean, let us take many (e.g.,
As you can see, on average the sample mean ("mean of sample means") is extremely close to the population mean, despite only sampling
Due to the variation in the sample means shown in our simulation, it is never possible to say exactly what the population mean is based on a single sample. However, even with a single sample we can infer a range of values within which the population mean is likely contained. In order to do so, notice that the sample means are approximately normally distributed. Another interesting fact is that the mean of sample means (i.e., 49.94) is roughly equal to the population mean (i.e., 49.93). This tells us already that generally the sample mean is a good approximation of the population mean. However, in order to make statements about the expected range of possible values, we would need to know the standard deviation of the sampling distribution. The formal representation of the standard deviation of the sample means is
where
The first thing to notice here is that an increase in the number of observations per sample
The following plots show the relationship between the sample size and the standard error in a slightly different way. The plots show the range of sample means resulting from the repeated sampling process for different sample sizes. Notice that the more students are contained in the individual samples, the less uncertainty there is when estimating the population mean from a sample (i.e., the possible values are more closely centered around the mean). So when the sample size is small, the sample mean can expected to be very different the next time we take a sample. When the sample size is large, we can expect the sample means to be more similar every time we take a sample.
As you can see, the standard deviation of the sample means (
A second factor determining the standard deviation of the distribution of sample means (
In the first plot (panel A), we assume a much smaller population standard deviation (e.g.,
The attentive reader might have noticed that the population above was generated using a normal distribution function. It would be very restrictive if we could only analyze populations whose values are normally distributed. Furthermore, we are unable in reality to check whether the population values are normally distributed since we do not know the entire population. However, it turns out that the results generalize to many other distributions. This is described by the Central Limit Theorem.
The central limit theorem states that if (1) the population distribution has a mean (there are examples of distributions that don't have a mean , but we will ignore these here), and (2) we take a large enough sample, then the sampling distribution of the sample mean is approximately normally distributed. What exactly "large enough" means depends on the setting, but the interactive element at the end of this chapter illustrates how the sample size influences how accurately we can estimate the population parameters from the sample statistics.
To illustrate this, let's repeat the analysis above with a population from a gamma distribution. In the previous example, we assumed a normal distribution so it was more likely for a given student to spend around 50 hours per week listening to music. The following example depicts the case in which most students spend a similar amount of time listening to music, but there are a few students who very rarely listen to music, and some music enthusiasts with a very high level of listening time. In the following code, we will use the rgamma() function to generate 25,000 random observations from the gamma distribution. The gamma distribution is specified by shape and scale parameters instead of the mean and standard deviation of the normal distribution. Here is a histogram of the listening times in the population:
set.seed(321)
hours <- rgamma(n = 25000, shape = 2, scale = 10)
ggplot(data.frame(hours)) +
geom_histogram(aes(x = hours), bins = 30, fill='white', color='black') +
geom_vline(xintercept = mean(hours), size = 1) + theme_bw() +
labs(title = "Histogram of listening times",
subtitle = TeX(sprintf("Population mean ($\\mu$) = %.2f; population standard deviation ($\\sigma$) = %.2f",round(mean(hours),2),round(sd(hours),2))),
y = 'Number of students',
x = 'Hours') The vertical black line represents the population mean (
As in the previous example, the mean is slightly different every time we take a sample due to sampling variation. Also note that the distribution of listening times no longer follows a normal distribution as a result of the fact that we now assume a gamma distribution for the population with a positive skew (i.e., lower values more likely, higher values less likely).
Let's see what happens to the distribution of sample means if we take an increasing number of samples, each drawn from the same gamma population:
Two things are worth noting: (1) The more (hypothetical) samples we take, the more the sampling distribution approaches a normal distribution. (2) The mean of the sampling distribution of the sample mean (
In summary, it is important to distinguish two types of variation: (1) For each individual sample that we may take in real life, the standard deviation (
So far we have assumed to know the population standard deviation (
Note that
We will not go into detail about the importance of random samples but basically the correctness of your estimate depends crucially on having a sample at hand that actually represents the population. Unfortunately, we will usually not notice if the sample is non-random. Our statistics are still a good approximation of "a" population parameter, namely the one for the population that we actually sampled but not the one we are interested in. To illustrate this uncheck the "Random Sample" box below. The new sample will be only from the top
When we try to estimate parameters of populations (e.g., the population mean
Let us consider one random sample of 100 students from our population above.
set.seed(321)
hours <- rgamma(25000, shape = 2, scale = 10)
set.seed(6789)
sample_size <- 100
student_sample <- sample(1:25000, size = sample_size, replace = FALSE)
hours_s <- hours[student_sample]
plot2 <- ggplot(data.frame(hours_s)) +
geom_histogram(aes(x = hours_s), bins = 30, fill='white', color='black') +
theme_bw() + xlab("Hours") +
geom_vline(aes(xintercept = mean(hours_s)), size=1) +
ggtitle(TeX(sprintf("Random sample; $n$ = %d; $\\bar{x}$ = %.2f; $s$ = %.2f",sample_size,round(mean(hours_s),2),round(sd(hours_s),2))))
plot2From the central limit theorem we know that the sampling distribution of the sample mean is approximately normal and we know that for the normal distribution, 95% of the values lie within about 2 standard deviations from the mean. Actually, it is not exactly 2 standard deviations from the mean. To get the exact number, we can use the quantile function for the normal distribution qnorm():
qnorm(0.975)## [1] 1.959964
We use 0.975 (and not 0.95) to account for the probability at each end of the distribution (i.e., 2.5% at the lower end and 2.5% at the upper end). We can see that 95% of the values are roughly within 1.96 standard deviations from the mean. Since we know the sample mean (
Here,
Plugging in the values from our sample, we get:
sample_mean <- mean(hours_s)
se <- sd(hours_s)/sqrt(sample_size)
ci_lower <- sample_mean - qnorm(0.975)*se
ci_upper <- sample_mean + qnorm(0.975)*se
ci_lower## [1] 17.67089
ci_upper## [1] 23.1592
such that if we collected 100 samples and computed the mean and confidence interval for each of them, in
::: {.infobox_orange .hint data-latex="{hint}"} Note the correct interpretation of the confidence interval: If we’d collected 100 samples, calculated the mean and then calculated a confidence interval for that mean, then, for 95 of these samples, the confidence intervals we constructed would contain the true value of the mean in the population. :::
This is illustrated in the plot below that shows the mean of the first 100 samples and their confidence intervals:
::: {.infobox_red .caution data-latex="{caution}"}
Note that this does not mean that for a specific sample there is a
(LC5.1) What is the correct interpretation of a confidence interval for a significance level of
- If we take 100 samples and calculate mean and confidence interval for each one of them, then the true population mean would be included in 95% of these intervals.
- If we take 100 samples and calculate mean and confidence interval for each one of them, then the true population mean would be included in 5% of these intervals.
- If we take 100 samples and calculate mean and confidence interval for each one of them, then the true population mean would be included in 100% of these intervals.
- For a given sample, there is a 95% chance that the true population mean lies within the confidence interval.
(LC5.2) Which statements regarding standard error are TRUE?
- There is no connection between the standard deviation and the standard error.
- The standard error is a function of the sample size and the standard deviation.
- The standard error of the mean decreases as the sample size increases.
- The standard error of the mean increases as the standard deviation increases.
- None of the above
(LC5.3) What is the correct definition for the standard error (
-
${s \over \sqrt{n}}$ -
${s * \sqrt{n}}$ -
${\sqrt{s^2} \over \sqrt{n}}$ -
${\sqrt{s} \over n}$ - None of the above
(LC5.4) Which of the following do you need to compute a confidence interval around a sample mean?
- The critical value of the test statistic given a certain level of confidence
- A continuous variable (i.e., at least measured at the interval level)
- The sample the mean
- The standard error
- None of the above
(LC5.5) What is the correct definition for the confidence interval?
-
$CI=\bar{x} \pm \frac{z_{1-\frac{a}{n}}}{\sigma_{\bar{x}}}$ -
$CI=\bar{x} * z_{1-\frac{a}{n}}*\sigma_{\bar{x}}$ -
$CI= z_{1-\frac{a}{n}}*\sigma_{\bar{x}}-\bar{x}$ -
$CI=\bar{x} \pm z_{1-\frac{a}{n}}*\sigma_{\bar{x}}$ - None of the above
As a marketing manager at Spotify you wish to find the average listening time of your users. Based on a random sample of 180 users you found that the mean listening time for the sample is 7.34 hours per week and the standard deviation is 6.87 hours.
(LC5.6) What is the 95% confidence interval for the mean listening time (the corresponding z-value for the 95% CI is 1.96)?
- [6.34;8.34]
- [7.15;7.55]
- [6.25;8.15]
- [6.54;8.54]
- None of the above
- Field, A., Miles J., & Field, Z. (2012). Discovering Statistics Using R. Sage Publications.
- Malhotra, N. K.(2010). Marketing Research: An Applied Orientation (6th. ed.). Prentice Hall.
- Vasishth, S. (2014). An introduction to statistical data analysis (lecture notes)











