IODS-project/chapter2.Rmd at master · ahjyrkia/IODS-project · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131

# Regression and model validation
Week 2,
Anton Jyrkiäinen

### Part 1

The data used is from a study which investigates the relationship between teaching, learning approaches and exam results of the students.

Link to data [here](http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt) and link to metadata
[here](http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS2-meta.txt).

First we import the data which we wrangled and saved locally in the first part of the exercise. The data is set to a variable ```lrn14``` using read.table.

```{r}
lrn14 <- read.table("C:/Users/Anton/Documents/IODS-project/learning2014.txt")
```

We also need to import few dependancies for later use:
```{r}
library(GGally)
library(ggplot2)
library(dplyr)
```

By using ```str()``` we can have a look at the structure of the data:

```{r}
str(lrn14)
```

The output gives us meta information about the data in general and about each column itself: amount of rows and colums, name of the variables, data type and the first values. There's seven variables: gender, global attitude toward statistics, exam points and three variables "deep" (deep questions), "surf" (surface questions), "stra" (strategic questions) which are composed of means calculated from answering various questions. These three, in the same order as mentioned previously, stand for "intention to maximize understanding, with a true commitment to learning", "memorizing without understanding, with a serious lack of personal engagement in the learning process" and "the ways students organize their studying".

The questions were measured with a scale from 1 to 5 with (1 = disagree, 5 = agree). The data covers answers from 183 participants of which 67% were female and had a mean age of 25.6 years. Observations with 0 exam points are excluded from the data in the wrangling part of the exercise.

Using ```dim()``` we get an output which gives as the dimensions of the dataset.
```{r}
dim(lrn14)
```

### Part 2

The ```ggpairs()``` functions provides us a matrix of plots which consists of density plots, bar plot, histograms, scatter plots, correlation coefficients and box plots. These plots represent all the possible combinations of the variables.

Using the dataset with ```ggpairs()``` we generate a graphical overview:

```{r}
p <- ggpairs(lrn14, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
p
```


And with ```summary()``` we get a summary of the variables:

```{r}
summary(lrn14)
```

Just looking at the density plots they give us an rough idea how each variable is distributed. All the variables seem to be close to normally distributed expect for "age" and "deep". The scale for age is from 17 to 55 so it makes sense for the density plot to be heavily distributed to the young side since the participants of the survey are students. In the "deep" variable there is quite interesting bias towards agreeing to a lot to the questions. From the summary we see that the mean to answering "deep questions" is 3.68.

### Part 3

A regression model is made with ```lm()```. As for the formula we will be using "Points" as dependent variable and "Attitude", "stra" and "deep" for explanatory variables. Also a summary of the regression model is generated with ```summary()```:

```{r}
lm1 <- lm(formula = Points ~ Attitude + stra + deep, data = lrn14)
lm1
```

The intercept is the value of y ("Points") when all the x ("Attitude","stra","deep") are 0. The other values are the slope in the regression model.

```{r}
summary(lm1)
```

The residuals are difference between the observed value of the dependant variable and the predicted value. Each data point has a residual. The residuals should be normally distributed and their sum should always be 0. Residuals can reveal unexplained patterns in data which is fitted to the model.

The coefficients show us the estimate, standard error, t-value and p-value. Standard error is an estimate of the standard deviation of the coefficient estimate. T-value is estimate divided by standard error. The p-value shows the probability of observing A t-value larger than the one presented. P-value basically indicates if there's evidence against the null hypothesis. Low p-value
indicates strong evidence against the null hypothesis and high p-value indicates weak evidence or no evidence against the null hypothesis.

The output also gives us a "signif. codes" which make it easier to interpret the results with just a glance. In this case "Attitude" received p-value of under 0.01 which means there is strong evidence that the null hypothesis does not hold. This means that it is unlikely that "Attitude" does not correlate with "points". "stra" had p-value between 0.05 and 0.1 which indicates weak evidence that the null hypothesis does not hold. Variable "deep" does not have statistically signifant relationship with the target variable.

Next we fit the model again without variable "deep":

```{r}
lm1 <- lm(formula = Points ~ Attitude + stra, data = lrn14)
lm1
```

```{r}
summary(lm1)
```

Without the variable "deep" we get similar results with a bit higher p-values for "Attitude" and "stra" but still within the same signifiance codes as before. Also the r-squared and f-statistics stay relatively the same.

### Part 4

Residual standard error is the standard deviation of the residuals.

R-squared tells us how close the data is to the fitted regression line. The higher the R-squared, the better the model fits the data. If the R-squared was 100% the fitted values would all be on the fitted regression line. Difference between multiple and adjusted R-squared is that adjusted R-squared takes the phenomenon of R-squared automatically increasing when more explanatory variables are added to the model. The adjusted R-squared is at 19.51% which indicates that the data points will be quite scattered around the fitted regression line. This indicates that producing predictions with this model would be somewhat inaccurate.

F-statistic is variaton between sample means divided by variance within the samples. The p-value in the f-statistics indicate that atleast some of the results are statistically significant. If the p-value here would indicate insignificance we couldn't reject the null hypothesis.

### Part 5

Usin ```qplot()``` we generate three diagnostics plots: "Residuals vs Fitted values", "Normal QQ-plot" and "Residuals vs Leverage". We give the model as parameter with a vector of indexes
to choose the right plots.

Residuals vs fitted values:

```{r}
plot(lm1, which = c(1))
```

We can see that variability of the points are about the equal so that would be good. There seems to be no signs here that would indicate a problem with the model.

Normal Q-Q:

```{r}
plot(lm1, which = c(2))
```

If the observations are normally distributed the plot should results in a approximately straight line. In the plot above we can see that the line the pairs create is fairly straight so there should be no problem here. There are some extreme values but they are as extreme as would be expected. We can see the same kind of extremities in the low and high end so that would fit in the principles of normal distribution.

Residuals vs leverage:

```{r}
plot(lm1, which = c(5))
```

This plot shows us if there are extreme values that are influental. Even though there is some extreme values they are no influental since they more or less follow the trend of the other values. Excluding influental point would radically change the results which don't seem to be the case here.