Advanced-descriptive-analysis-of-tabular-data/chapter-05-correlation-extensions.qmd at main · AntoineSoetewey/Advanced-descriptive-analysis-of-tabular-data · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
# Extensions of Correlation: Nonlinear and Conditional {#sec-correlation-extensions}

```{r, echo = FALSE, message = FALSE, warning = FALSE}
# Setup for chapter examples
library(dplyr)
library(tidyr)
library(ggplot2)
library(purrr)
library(gt)
library(patchwork)
library(knitr)
library(mgcv)  # For smooth conditional relationships
library(gratia)  # For GAM visualization
```

## Introduction: Why Standard Correlation Falls Short

Pearson and Spearman correlations are powerful descriptive tools, but they capture only *marginal* associations. In real data, relationships often depend on context: a strong positive association for one subgroup may disappear or reverse in another. Additionally, standard correlations assume that relationships are *linear* (or at least monotonic), which can miss complex functional forms that drive patterns in nonlinear systems.

This chapter extends the correlation toolkit in two directions:

1.  **Nonlinear associations**: Methods that detect and describe relationships that curve, oscillate, or follow other complex patterns
2.  **Conditional associations**: Techniques that reveal how associations *change* across levels of other variables or contextual conditions

## Part I: Nonlinear Association Methods

### Visual Detection of Nonlinearity

Before applying formal measures, it helps to start with visual inspection. A scatterplot with a smooth trend line can reveal whether linearity is a reasonable assumption.

```{r, message = FALSE, warning = FALSE}
# Create dataset with various nonlinear relationships
set.seed(42)
n <- 200

nonlin_data <- tibble(
  Linear = seq(0, 10, length.out = n),
  Quadratic = seq(0, 10, length.out = n),
  Exponential = seq(0, 10, length.out = n),
  Sine = seq(0, 4 * pi, length.out = n),
  x_lin = rnorm(n, 0, 1),
  x_quad = rnorm(n, 0, 1),
  x_exp = rnorm(n, 0, 1),
  x_sin = rnorm(n, 0, 1)
) |>
  mutate(
    y_lin = Linear + x_lin,
    y_quad = 10 - Quadratic^2 / 8 + x_quad,
    y_exp = exp(Exponential / 3) + x_exp * 2,
    y_sin = sin(Sine) + x_sin
  )

# Visualize multiple relationship types
p1 <- nonlin_data |>
  ggplot(aes(x = Linear, y = y_lin)) +
  geom_point(alpha = 0.5, size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  geom_smooth(method = "loess", se = FALSE, color = "blue", linewidth = 1) +
  labs(
    title = "Linear Relationship",
    x = "X", y = "Y",
    subtitle = "Red: linear fit | Blue: smooth fit"
  ) +
  theme_minimal()

p2 <- nonlin_data |>
  ggplot(aes(x = Quadratic, y = y_quad)) +
  geom_point(alpha = 0.5, size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  geom_smooth(method = "loess", se = FALSE, color = "blue", linewidth = 1) +
  labs(
    title = "Quadratic Relationship",
    x = "X", y = "Y"
  ) +
  theme_minimal()

p3 <- nonlin_data |>
  ggplot(aes(x = Exponential, y = y_exp)) +
  geom_point(alpha = 0.5, size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  geom_smooth(method = "loess", se = FALSE, color = "blue", linewidth = 1) +
  labs(
    title = "Exponential Relationship",
    x = "X", y = "Y"
  ) +
  theme_minimal()

p4 <- nonlin_data |>
  ggplot(aes(x = Sine, y = y_sin)) +
  geom_point(alpha = 0.5, size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  geom_smooth(method = "loess", se = FALSE, color = "blue", linewidth = 1) +
  labs(
    title = "Periodic Relationship",
    x = "X", y = "Y"
  ) +
  theme_minimal()

(p1 + p2) / (p3 + p4)
```

**Interpretation**: When the blue smooth curve (locally weighted scatterplot smoothing, LOESS) deviates substantially from the red linear fit, nonlinearity is present. This visual diagnostic often provides helpful context before quantitative analysis.

### Quantifying Nonlinearity: The $R^2$ Contrast

A simple descriptive measure of nonlinearity is to compare the $R^2$ from a **linear model** to the $R^2$ from a **smooth nonparametric model** (such as LOESS or a spline). The difference indicates how much variance can be explained by nonlinear structure that the linear model misses.

$$\text{Nonlinearity Index} = R^2_{\text{smooth}} - R^2_{\text{linear}}$$

```{r}
# Function to compute nonlinearity index
nonlinearity_index <- function(x, y) {
  df <- tibble(x = x, y = y) |>
    na.omit()

  # Linear model
  lm_fit <- lm(y ~ x, data = df)
  r2_linear <- summary(lm_fit)$r.squared

  # LOESS smooth fit
  loess_fit <- loess(y ~ x, data = df)
  r2_smooth <- 1 - (sum(loess_fit$residuals^2) / sum((df$y - mean(df$y))^2))

  tibble(
    R2_linear = r2_linear,
    R2_smooth = r2_smooth,
    Nonlinearity_Index = r2_smooth - r2_linear
  )
}

# Apply to our synthetic data
comparison <- bind_rows(
  nonlinearity_index(nonlin_data$Linear, nonlin_data$y_lin) |>
    mutate(Relationship = "Linear"),
  nonlinearity_index(nonlin_data$Quadratic, nonlin_data$y_quad) |>
    mutate(Relationship = "Quadratic"),
  nonlinearity_index(nonlin_data$Exponential, nonlin_data$y_exp) |>
    mutate(Relationship = "Exponential"),
  nonlinearity_index(nonlin_data$Sine, nonlin_data$y_sin) |>
    mutate(Relationship = "Periodic")
)

comparison |>
  gt() |>
  fmt_number(columns = c(R2_linear, R2_smooth, Nonlinearity_Index), decimals = 3)
```

**Interpretation**: A high nonlinearity index can suggest that a linear model substantially underestimates the association strength. However, this measure is best **paired with visual inspection**, as sometimes the smooth fit overfits noise.

### Distance Correlation: A Revisit with Nonlinearity in Focus

Recall from Chapter [-@sec-association-measures] that **distance correlation** detects both linear and nonlinear associations. Let's examine how it performs on nonlinear data:

```{r}
# Distance correlation comparison
library(energy)

comparison_dcor <- bind_rows(
  tibble(
    Relationship = "Linear",
    Pearson = cor(nonlin_data$Linear, nonlin_data$y_lin, method = "pearson"),
    Spearman = cor(nonlin_data$Linear, nonlin_data$y_lin, method = "spearman"),
    DistanceCorr = dcor(nonlin_data$Linear, nonlin_data$y_lin)
  ),
  tibble(
    Relationship = "Quadratic",
    Pearson = cor(nonlin_data$Quadratic, nonlin_data$y_quad, method = "pearson"),
    Spearman = cor(nonlin_data$Quadratic, nonlin_data$y_quad, method = "spearman"),
    DistanceCorr = dcor(nonlin_data$Quadratic, nonlin_data$y_quad)
  ),
  tibble(
    Relationship = "Exponential",
    Pearson = cor(nonlin_data$Exponential, nonlin_data$y_exp, method = "pearson"),
    Spearman = cor(nonlin_data$Exponential, nonlin_data$y_exp, method = "spearman"),
    DistanceCorr = dcor(nonlin_data$Exponential, nonlin_data$y_exp)
  ),
  tibble(
    Relationship = "Periodic",
    Pearson = cor(nonlin_data$Sine, nonlin_data$y_sin, method = "pearson"),
    Spearman = cor(nonlin_data$Sine, nonlin_data$y_sin, method = "spearman"),
    DistanceCorr = dcor(nonlin_data$Sine, nonlin_data$y_sin)
  )
)

comparison_dcor |>
  gt() |>
  fmt_number(columns = c(Pearson, Spearman, DistanceCorr), decimals = 3)
```

**Observation**: Distance correlation captures all relationship types more uniformly than Pearson or Spearman, which may completely miss periodic patterns.

### Generalized Additive Models (GAMs) for Smooth Associations

**Generalized Additive Models** extend linear regression by allowing smooth functions instead of linear terms. For description, GAMs serve two purposes:

1.  Visualize the functional form of a relationship
2.  Quantify association strength via model diagnostics

```{r}
# Fit GAM to exponential data
gam_fit <- gam(y_exp ~ s(Exponential), data = nonlin_data)

# Extract and visualize smooth term
plot_data <- tibble(
  Exponential = seq(min(nonlin_data$Exponential),
                     max(nonlin_data$Exponential),
                     length.out = 100)
)

# Predict from GAM
preds <- predict(gam_fit, newdata = plot_data, se.fit = TRUE)

plot_data <- plot_data |>
  mutate(
    fit = preds$fit,
    se = preds$se.fit,
    upper = fit + 1.96 * se,
    lower = fit - 1.96 * se
  )

# Visualize with original data
ggplot(nonlin_data, aes(x = Exponential, y = y_exp)) +
  geom_point(alpha = 0.4, size = 2) +
  geom_ribbon(data = plot_data, aes(y = fit, ymin = lower, ymax = upper),
              fill = "lightblue", color = NA, alpha = 0.5) +
  geom_line(data = plot_data, aes(y = fit), color = "darkblue", linewidth = 1) +
  labs(
    title = "GAM Smooth Term: Exponential Relationship",
    x = "X", y = "Y",
    subtitle = "Blue ribbon: 95% confidence band"
  ) +
  theme_minimal()

# Model summary
tibble(
  Metric = c("Deviance Explained", "GCV Score"),
  Value = c(
    paste0(round(summary(gam_fit)$dev.expl * 100, 1), "%"),
    round(gam_fit$gcv.ubre, 3)
  )
) |>
  gt()
```

**Interpretation**:

-   **Deviance Explained** shows how much variance the GAM accounts for (analogous to $R^2$)
-   **GCV Score** (Generalized Cross-Validation) balances fit quality and complexity (lower is better)
-   The smooth visualization reveals the precise functional form of the relationship

### Additive Models and Interaction Effects

Beyond univariate smooths, GAMs allow **tensor product interactions** $s(x, z)$, capturing how the relationship between two variables depends on a third:

```{r}
# Create data with interaction: y depends on x and z together
set.seed(42)
interaction_data <- expand_grid(
  x = seq(0, 5, length.out = 30),
  z = seq(0, 5, length.out = 30)
) |>
  mutate(
    y = sin(x) * z + rnorm(n(), 0, 0.3)
  )

# Fit GAM with tensor product interaction
gam_interact <- gam(y ~ ti(x) + ti(z) + ti(x, z),
                     data = interaction_data)

# Visualize the interaction surface
summary(gam_interact)

# Contour plot of fitted surface
pred_grid <- expand_grid(
  x = seq(0, 5, length.out = 50),
  z = seq(0, 5, length.out = 50)
)
pred_grid$y_hat <- predict(gam_interact, newdata = pred_grid)

ggplot(pred_grid, aes(x = x, y = z, fill = y_hat)) +
  geom_tile() +
  scale_fill_viridis_c() +
  labs(
    title = "Fitted Surface: Interaction Effect",
    x = "X", y = "Z", fill = "Predicted Y"
  ) +
  theme_minimal()
```

## Part II: Conditional Associations

### The Concept of Conditional Correlation

A **conditional correlation** estimates the association between two variables while "holding constant" (controlling for) a third variable. Mathematically, it measures the correlation between the **residuals** of $X$ and $Y$ after removing the linear effects of $Z$.

$$r_{XY|Z} = \frac{\text{Cov}(X - \hat{X}_Z, Y - \hat{Y}_Z)}{\sqrt{\text{Var}(X - \hat{X}_Z) \cdot \text{Var}(Y - \hat{Y}_Z)}}$$

where $\hat{X}_Z$ and $\hat{Y}_Z$ are predictions from regressions of $X$ and $Y$ on $Z$.

### Partial Correlation

**Partial correlation** is the most common way to estimate conditional association. It measures the linear association between two variables after regressing out the linear effects of control variables.

```{r}
# Example: relationship between body mass (wt) and fuel economy (mpg),
# controlling for engine power (hp)

mtcars_subset <- mtcars |>
  select(mpg, wt, hp, qsec) |>
  na.omit()

# Bivariate correlations
bivariat_cors <- tibble(
  Pair = c("mpg-wt", "mpg-hp", "wt-hp"),
  Correlation = c(
    cor(mtcars_subset$mpg, mtcars_subset$wt),
    cor(mtcars_subset$mpg, mtcars_subset$hp),
    cor(mtcars_subset$wt, mtcars_subset$hp)
  )
)

bivariat_cors |>
  gt() |>
  fmt_number(columns = Correlation, decimals = 3)

# Partial correlation: mpg ~ wt | hp
# Method: residuals approach
fit_mpg_hp <- lm(mpg ~ hp, data = mtcars_subset)
fit_wt_hp <- lm(wt ~ hp, data = mtcars_subset)

partial_cor_mpg_wt_given_hp <- cor(
  residuals(fit_mpg_hp),
  residuals(fit_wt_hp)
)

# Partial correlation: mpg ~ hp | wt
fit_mpg_wt <- lm(mpg ~ wt, data = mtcars_subset)
fit_hp_wt <- lm(hp ~ wt, data = mtcars_subset)

partial_cor_mpg_hp_given_wt <- cor(
  residuals(fit_mpg_wt),
  residuals(fit_hp_wt)
)

partial_summary <- tibble(
  Association = c(
    "mpg-wt (bivariate)",
    "mpg-wt (controlling for hp)",
    "mpg-hp (bivariate)",
    "mpg-hp (controlling for wt)"
  ),
  Correlation = c(
    cor(mtcars_subset$mpg, mtcars_subset$wt),
    partial_cor_mpg_wt_given_hp,
    cor(mtcars_subset$mpg, mtcars_subset$hp),
    partial_cor_mpg_hp_given_wt
  )
)

partial_summary |>
  gt() |>
  fmt_number(columns = Correlation, decimals = 3)
```

**Key Insight**: Notice how partial correlations can differ markedly from bivariate correlations. The relationship between `wt` and `mpg` remains strong even after controlling for `hp`, suggesting that weight has an independent effect on fuel economy.

### Semi-Partial (Part) Correlation

**Semi-partial correlation** is related but slightly different. It measures the association between $X$ and $Y$ after removing the effect of $Z$ from *only one* variable (usually the predictor).

$$r_{Y(X.Z)} = \frac{\text{Cov}(X - \hat{X}_Z, Y)}{\sqrt{\text{Var}(X - \hat{X}_Z) \cdot \text{Var}(Y)}}$$

```{r}
# Semi-partial: association between wt and mpg, removing hp effects from wt only
residuals_wt_hp <- residuals(lm(wt ~ hp, data = mtcars_subset))

semi_partial_mpg_wt_given_hp <- cor(residuals_wt_hp, mtcars_subset$mpg)

comparison_correlations <- tibble(
  Type = c("Bivariate", "Partial (both adjusted)", "Semi-partial (X adjusted)"),
  mpg_wt = c(
    cor(mtcars_subset$mpg, mtcars_subset$wt),
    partial_cor_mpg_wt_given_hp,
    semi_partial_mpg_wt_given_hp
  )
)

comparison_correlations |>
  gt() |>
  fmt_number(columns = mpg_wt, decimals = 3)
```

### Stratified Analysis: Examining Associations Within Groups

Sometimes the appropriate "control" is not a continuous covariate but a categorical stratification. **Stratified analysis** computes associations separately within each group, revealing how patterns differ.

```{r}
# Stratify mtcars by number of cylinders
mtcars_strat <- mtcars |>
  mutate(cyl = factor(cyl, labels = c("4-cyl", "6-cyl", "8-cyl")))

# Correlation of mpg and wt within each stratum
strat_cors <- mtcars_strat |>
  group_by(cyl) |>
  summarise(
    N = n(),
    Correlation = cor(mpg, wt),
    .groups = "drop"
  )

strat_cors |>
  gt() |>
  fmt_number(columns = Correlation, decimals = 3)

# Visualize stratified associations
mtcars_strat |>
  ggplot(aes(x = wt, y = mpg, color = cyl)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "mpg vs. wt by Engine Type",
    x = "Weight (1000 lbs)", y = "Fuel Economy (mpg)",
    color = "Engine Type"
  ) +
  theme_minimal()
```

**Observation**: The relationship between weight and fuel economy is relatively consistent across engine types, though the 4-cylinder group shows slightly higher correlations.

### Correlation Networks: Detecting Conditional Independence

With many variables, **graphical models** can identify which associations are direct versus mediated through other variables. The **graphical lasso** (or other sparse methods) estimates a sparse precision matrix (edges that remain after removing the effects of all other variables).

While full graphical model estimation is beyond this descriptive chapter, we can identify **strong conditional associations** by computing partial correlations across all pairs, visualizing only the strongest connections, and comparing to bivariate correlations to identify mediated effects.

```{r}
# Simple conditional independence example using partial correlations
library(corpcor)  # For partial correlations

# Create a subset of continuous variables
mtcars_cont <- mtcars |>
  select(mpg, wt, hp, disp, qsec) |>
  na.omit()

# Compute correlations
df_corr <- cor(mtcars_cont)

# Compute partial correlations (adjusting for all other variables)
# Approximate using precision matrix inverse
df_precision <- solve(df_corr)
df_partial <- -cov2cor(df_precision)
diag(df_partial) <- 1

# Compare formats
comparison_corr_types <- tibble(
  Pair = c("mpg-wt", "mpg-hp", "wt-hp"),
  Bivariate = c(df_corr["mpg", "wt"], df_corr["mpg", "hp"], df_corr["wt", "hp"]),
  Partial = c(df_partial["mpg", "wt"], df_partial["mpg", "hp"], df_partial["wt", "hp"])
)

comparison_corr_types |>
  gt() |>
  fmt_number(columns = c(Bivariate, Partial), decimals = 3)
```

## Part III: Combining Nonlinear and Conditional Analysis

Real-world analysis often requires both nonlinear and conditional thinking. We can combine these ideas in a **semiparametric model** that allows smooth associations while controlling for other variables.

```{r}
# Example: how does the effect of weight on mpg vary by engine size?
# Allow smooth mpg ~ wt relationship, varying smoothly with hp

gam_semi <- gam(mpg ~ s(wt, by = hp) + s(hp),
                  data = mtcars)

# Visualize the varying slope effect
hp_values <- quantile(mtcars$hp, probs = c(0.25, 0.5, 0.75))

hp_levels <- paste0("hp = ", hp_values)

mtcars_col <- mtcars |>
  mutate(
    hp_label = case_when(
      hp <= hp_values[1] ~ hp_levels[1],
      hp <= hp_values[2] ~ hp_levels[2],
      TRUE               ~ hp_levels[3]
    ),
    hp_label = factor(hp_label, levels = hp_levels)
  )

pred_smooth <- map_df(hp_values, function(hp_val) {
  pred_grid <- tibble(
    wt = seq(min(mtcars$wt), max(mtcars$wt), length.out = 50),
    hp = hp_val
  )
  pred_grid$mpg_pred <- predict(gam_semi, newdata = pred_grid)
  pred_grid$hp_label <- paste0("hp = ", hp_val)
  pred_grid
})

  ggplot(mtcars_col, aes(x = wt, y = mpg, color = hp_label)) +
  geom_point(alpha = 0.6, size = 2) +
  geom_line(
    data = pred_smooth,
    aes(y = mpg_pred),
    linewidth = 1
  ) +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "Weight-MPG Relationship Varies by Engine Power",
    x = "Weight (1000 lbs)",
    y = "Fuel Economy (mpg)",
    color = "Engine Power"
  ) +
  theme_minimal()
```

**Interpretation**: The smooth trends indicate a consistently negative relationship between vehicle weight and fuel economy across all engine power levels. However, this relationship becomes steeper as engine power increases: for a given increase in weight, high-horsepower vehicles experience a larger reduction in mpg than low-horsepower vehicles. This suggests that heavier, more powerful cars are disproportionately penalized in terms of fuel efficiency.

## Summary: When to Use Each Method

| Situation | Recommended Method |
|---------------------------|---------------------------------------------|
| Exploring whether association is linear | Nonlinearity index, visual inspection with LOESS |
| Describing nonlinear relationship shape | GAM smooth terms, scatterplot with lowess |
| Testing for independence (any form) | Distance correlation, mutual information |
| Adjusting for one continuous confounder | Partial correlation, residual method |
| Examining association in subgroups | Stratified analysis, conditional plots |
| Multiple confounders simultaneously | GAM with adjustment terms, graphical models |
| Context-dependent nonlinear effects | Semiparametric models (GAM with interactions) |

## Summary and Key Takeaways

-   **Linear correlations can miss real patterns**: It helps to visualize associations with smooth trend lines to detect nonlinearity
-   **Nonlinearity indices** compare simple vs. complex model fits to quantify how much structure is missed by linearity
-   **Distance correlation** and **GAMs** are flexible tools that capture diverse functional forms
-   **Partial correlations** separate direct from mediated effects, revealing which associations are conditionally independent
-   **Stratified analysis** reveals how associations vary across subpopulations
-   **Combining approaches** (nonlinear + conditional) yields richer understanding than either alone
-   **Visualization precedes quantification**: Scatterplots with smooth fits can usefully accompany numeric results

## Looking Ahead

With nonlinear and conditional association methods now in hand, the next chapter moves to **network representations** of association patterns. Networks elegantly visualize how multiple associations connect and interact, transforming correlation matrices into interpretable visual structures. This bridges descriptive statistics and exploratory analytics, setting the stage for interactive visualization and machine learning approaches in later chapters.