MRDA2018/04-basic_statistics.Rmd at master · IMSMWU/MRDA2018 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
title: "04-summary statistics"
output:
  html_document:
    toc: yes
  html_notebook: default
  pdf_document:
    toc: yes

---
# Summarizing data

## Summary statistics

```{r echo=FALSE, eval=TRUE, message=FALSE, warning=FALSE}
library(knitr)
options(scipen = 999)
#This code automatically tidies code so that it does not reach over the page
opts_chunk$set(tidy.opts=list(width.cutoff=50),tidy=TRUE, rownames.print = FALSE, rows.print = 10)
```

This section discusses how to produce and analyse basic summary statistics. We make a distinction between categorical and continuous variables, for which different statistics are permissible.
<br>
<br>

OK to compute....	 | Nominal	 | Ordinal	 | Interval	 | Ratio
------------- | ------------- | ------------- | --- | ---
frequency distribution  | Yes  | Yes  | Yes  | Yes
median and percentiles  | No  | Yes  | Yes  | Yes
mean, standard deviation, standard error of the mean | No  | No  | Yes  | Yes
ratio, or coefficient of variation  | No  | No  | No  | Yes

As an example data set, we will be using the MRDA course survey data. Let's load and inspect the data first.

```{r message=FALSE, warning=FALSE}
test_data <- read.table("https://raw.githubusercontent.com/IMSMWU/Teaching/master/MRDA2017/survey2017.dat",
                        sep = "\t", header = TRUE)
head(test_data)
```

### Categorical variables

Categorical variables contain a finite number of categories or distinct groups and are also known as qualitative variables. There are different types of categorical variables:

* **Nominal variables**: variables that have two or more categories but no logical order (e.g., music genres). A dichotomous variables is simply a nominal variable that only has two categories (e.g., gender).
* **Ordinal variables**: variables that have two or more categories that can also be ordered or ranked (e.g., income groups).

For this example, we are interested in the following two variables

* "overall_knowledge": measures the self-reported prior knowledge of marketing students in statistics before taking the marketing research class on a 5-point scale with the categories "none", "basic", "intermediate","advanced", and "proficient".
* "gender": the gender of the students (1 = male, 2 = female)

In a first step, we convert the variables to factor variables using the ```factor()``` function to assign appropriate labels according to the scale points:

```{r}
test_data$overall_knowledge_cat <- factor(test_data$overall_knowledge, levels = c(1:5), labels = c("none", "basic", "intermediate","advanced","proficient"))
test_data$gender_cat <- factor(test_data$gender, levels = c(1:2), labels = c("male", "female"))
```

The ```table()``` function creates a frequency table. Let's start with the number of occurrences of the categories associated with the prior knowledge and gender variables separately:

```{r}
table(test_data[,c("overall_knowledge_cat")]) #absolute frequencies
table(test_data[,c("gender_cat")]) #absolute frequencies
```

It is obvious that there are more female than male students. For variables with more categories, it might be less obvious and we might compute the median category using the ```median()``` function, or the ```summary()``` function, which produces further statistics.

```{r}
median((test_data[,c("overall_knowledge")]))
summary((test_data[,c("overall_knowledge")]))
```

Often, we are interested in the relative frequencies, which can be obtained by using the ```prop.table()``` function.

```{r}
prop.table(table(test_data[,c("overall_knowledge_cat")])) #relative frequencies
prop.table(table(test_data[,c("gender_cat")])) #relative frequencies
```

Now let's investigate if the prior knowledge differs by gender. To do this, we simply apply the ```table()``` function to both variables:

```{r}
table(test_data[,c("overall_knowledge_cat", "gender_cat")]) #absolute frequencies
```

Again, it might be more meaningful to look at the relative frequencies using ```prop.table()```:

```{r}
prop.table(table(test_data[,c("overall_knowledge_cat", "gender_cat")])) #relative frequencies
```

Note that the above output shows the overall relative frequencies when male and female respondents are considered together. In this context, it might be even more meaningful to look at the conditional relative frequencies. This can be achieved by adding a ```,2``` to the ```prop.table()``` command, which tells R to compute the relative frequencies by the columns (which is in our case the gender variable):

```{r}
prop.table(table(test_data[,c("overall_knowledge_cat", "gender_cat")]), 2) #conditional relative frequencies
```

### Continuous variables

#### Descriptive statistics

Continuous variables are numeric variables that can take on any value on a measurement scale (i.e., there is an infinite number of values between any two values). There are different types of continuous variables:

* **Interval variables**: while the zero point is arbitrary, equal intervals on the scale represent equal differences in the property being measured. E.g., on a temperature scale measured in Celsius the difference between a temperature of 15 degrees and  25 degrees is the same difference as between 25 degrees and 35 degrees but the zero point is arbitrary.
* **Ratio variables**: has all the properties of an interval variable, but also has an absolute zero point. When the variable equals 0.0, it means that there is none of that variable (e.g., number of products sold, willingness-to-pay, mileage a car gets).

Computing descriptive statistics in R is easy and there are many functions from different packages that let you calculate summary statistics (including the ```summary()``` function from the ```base``` package). In this tutorial, we will use the ```describe()``` function from the ```psych``` package:

```{r message=FALSE, warning=FALSE, paged.print = FALSE}
library(psych)
psych::describe(test_data[,c("duration", "overall_100")])
```

In the above command, we used the ```psych::``` prefix to avoid confusion and to make sure that R uses the ```describe()``` function from the ```psych``` package since there are many other packages that also contain a ```desribe()``` function. Note that you could also compute these statistics separately by using the respective functions (e.g., ```mean()```, ```sd()```, ```median()```, ```min()```, ```max()```, etc.).

The ```psych``` package also contains the ```describeBy()``` function, which lets you compute the summary statistics by sub-group separately. For example, we could easily compute the summary statistics by gender as follows:

```{r message=FALSE, warning=FALSE}
describeBy(test_data[,c("duration","overall_100")], test_data$gender_cat)
```

Note that you could just as well use other packages to compute the descriptive statistics. For example, you could have used the ```stat.desc()``` function from the ```pastecs``` package:

```{r message=FALSE, warning=FALSE, paged.print = FALSE}
library(pastecs)
stat.desc(test_data[,c("duration", "overall_100")])
```

Computing statistics by group is also possible by using the wrapper function ```by()```. Within the function, you first specify the data on which you would like to perform the grouping ```test_data[,c("duration", "overall_100")]```, followed by the grouping variable ```test_data$gender_cat``` and the function that you would like to execute (e.g., ```stat.desc()```):

```{r message=FALSE, warning=FALSE, paged.print = FALSE}
library(pastecs)
by(test_data[,c("duration", "overall_100")],test_data$gender_cat,stat.desc)
```

These examples are meant to exemplify that there are often many different ways to reach a goal in R. Which one you choose depends on what type of information you seek (the results provide slightly different information) and on personal preferences.

#### Creating subsets

From the above statistics it is clear that the data set contains some severe outliers on some variables. For example, the maximum duration is `r round(max(test_data$duration)/60/60/24,1)` days. You might want to investigate these cases and delete them if they would turn out to indeed induce a bias in your analyses. For normally distributed data, any absolute standardized deviations larger than 3 standard deviations from the mean are suspicious. Let's check if potential ourliers exist in the data:

```{r message=FALSE, warning=FALSE, paged.print = FALSE}
test_data %>% mutate(duration_std = as.vector(scale(duration))) %>% filter(abs(duration_std) > 3)
```

Indeed, there appears to be one potential outlier, which we may wish to exclude before we start fitting models to the data. You could easily create a subset of the original data, which you would then use for estimation using the ```filter()``` function from the ```dplyr()``` package. For example, the following code creates a subset that excludes all cases with a standardized duration of more than 3:

```{r message=FALSE, warning=FALSE, paged.print = FALSE}
library(dplyr)
estimation_sample <- test_data %>% mutate(duration_std = as.vector(scale(duration))) %>% filter(abs(duration_std) < 3)
psych::describe(estimation_sample[,c("duration", "overall_100")])
```