feature-engineering-az/correlated.qmd at main · EmilHvitfeldt/feature-engineering-az · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
pagetitle: "Feature Engineering A-Z | Correlated Overview"
image: social-cards/correlated.png
---

# Correlated Overview {#sec-correlated}

::: {style="visibility: hidden; height: 0px;"}
## Correlated Overview
:::

Correlation happens when two or more variables contain similar information.
We typically refer to correlation when we talk about predictors.
This can be a problem for some machine learning models as they don't perform well with correlated predictors.
There are many different ways to calculate the degree of correlation.
And those details aren't going to matter much right now.
The important thing is that it can happen.
Below we see such examples

```{r}
#| label: fig-correlation-examples
#| echo: false
#| message: false
#| fig-cap: |
#|   uncorrelated features and correlated feature
#| fig-alt: |
#|   Four scatter charts in a 2x2 grid showing increasing correlation levels.
#|   Top-left "zero correlation": random scatter with no pattern. Top-right
#|   "mild correlation": slight diagonal trend with high variance. Bottom-left
#|   "medium correlation": clear diagonal trend, moderate scatter. Bottom-right
#|   "strong correlation": tight diagonal line with minimal scatter.
library(ggplot2)
library(dplyr)
library(patchwork)
set.seed(1234)

g1 <- tibble(predictor_1 = runif(200)) |>
  mutate(predictor_2 = runif(200)) |>
  ggplot(aes(predictor_1, predictor_2)) +
  geom_point() +
  theme_minimal() +
  labs(title = "zero correlation")

g2 <- tibble(predictor_1 = runif(200)) |>
  mutate(predictor_2 = predictor_1 + rnorm(200, sd = 0.5)) |>
  ggplot(aes(predictor_1, predictor_2)) +
  geom_point() +
  theme_minimal() +
  labs(title = "mild correlation")

g3 <- tibble(predictor_1 = runif(200)) |>
  mutate(predictor_2 = predictor_1 + rnorm(200, sd = 0.1)) |>
  ggplot(aes(predictor_1, predictor_2)) +
  geom_point() +
  theme_minimal() +
  labs(title = "medium correlation")

g4 <- tibble(predictor_1 = runif(200)) |>
  mutate(predictor_2 = predictor_1 + rnorm(200, sd = 0.05)) |>
  ggplot(aes(predictor_1, predictor_2)) +
  geom_point() +
  theme_minimal() +
  labs(title = "strong correlation")

(g1 + g2) / (g3 + g4)
```

The reason why correlated features are bad for our models is that two correlated features have the potential to share information that is useful.
Imagine we are working with strongly correlated variables.
Furthermore, we propose that `predictor_1` is highly predictive in our model, since `predictor_1` and `predictor_2` are correlated, we can conclude that `predictor_2` would also be highly predictive.
The problem then arises when one of these predictors is used, the other predictor will no longer be a predictor since they share their information.
Another way to think about it is that we could replace these two predictors with just one predictor with minimal loss of information.
This is one of the reasons why we sometimes want to do dimension reduction, as seen in @sec-too-many.

We will see how we can use the correlation structure to figure out which variables we can eliminate, this is covered in @sec-correlated-filter.
This is a more specialized version of the methods we cover in @sec-too-many-filter as we are looking at correlation to determine which variables to remove rather than their relationship to the outcome.

Another set of methods that works well in anything PCA-related, which are split into two chapters, one explaining [PCA](too-many-pca) and one going over [PCA varients](too-many-pca-variants).
The resulting data coming out of PCA will be uncorrelated.