-
-
Notifications
You must be signed in to change notification settings - Fork 5
Expand file tree
/
Copy pathpca.qmd
More file actions
444 lines (346 loc) · 23.9 KB
/
pca.qmd
File metadata and controls
444 lines (346 loc) · 23.9 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
{{< include _chunk-timing.qmd >}}
# Data Reduction: Principal Component Analysis {#sec-pca}
This chapter provides an overview of principal component analysis as a useful technique for data reduction.
## Getting Started {#sec-pcaGettingStarted}
### Load Packages {#sec-pcaLoadPackages}
```{r}
library("psych")
library("nFactors")
library("tidyverse")
```
### Load Data {#sec-pcaLoadData}
```{r}
#| eval: false
#| include: false
load(file = file.path(path, "/OneDrive - University of Iowa/Teaching/Courses/Fantasy Football/Data/player_stats_weekly.RData", fsep = ""))
load(file = file.path(path, "/OneDrive - University of Iowa/Teaching/Courses/Fantasy Football/Data/nfl_nextGenStats_weekly.RData", fsep = ""))
```
```{r}
load(file = "./data/player_stats_weekly.RData")
load(file = "./data/nfl_nextGenStats_weekly.RData")
```
We created the `player_stats_weekly.RData` object in @sec-calculatePlayerAge.
### Prepare Data {#sec-pcaPrepareData}
#### Merge Data {#sec-pcaPrepareDataMerge}
```{r}
dataMerged <- full_join(
player_stats_weekly,
nfl_nextGenStats_weekly %>% select(-any_of(c("player_display_name","completions","attempts","receptions","targets"))),
by = c("player_id" = "player_gsis_id","season","season_type","week"),
)
```
#### Specify Variables {#sec-pcaPrepareDataSpecifyVars}
```{r}
pcaVars <- c(
"completions","attempts","passing_yards","passing_tds","passing_interceptions",
"sacks_suffered","sack_yards_lost","sack_fumbles","sack_fumbles_lost",
"passing_air_yards","passing_yards_after_catch","passing_first_downs",
"passing_epa","passing_cpoe","passing_2pt_conversions","pacr","pass_40_yds",
"pass_inc","pass_comp_pct","fumbles","two_pts",
"avg_time_to_throw","avg_completed_air_yards","avg_intended_air_yards",
"avg_air_yards_differential","aggressiveness","max_completed_air_distance",
"avg_air_yards_to_sticks","passer_rating", #,"completion_percentage"
"expected_completion_percentage","completion_percentage_above_expectation",
"avg_air_distance","max_air_distance")
```
#### Standardize Variables {#sec-pcaPrepareDataStandardize}
```{r}
dataForPCA <- dataMerged
dataForPCA[pcaVars] <- scale(dataForPCA[pcaVars])
```
## Overview of Principal Component Analysis {#sec-pcaOverview}
Principal component analysis (PCA) is used if you want to reduce your data matrix.\index{principal component analysis}
PCA composites represent the variances of an observed measure in as economical a fashion as possible, with no latent underlying variables.\index{principal component analysis}
The goal of PCA is to identify a smaller number of components that explain as much variance in a set of variables as possible.\index{principal component analysis}\index{data!reduction}
It is an atheoretical way to decompose a matrix.\index{principal component analysis}
PCA involves decomposition of a data matrix into a set of eigenvectors, which are transformations of the old variables.\index{principal component analysis}\index{eigenvector}
The eigenvectors attempt to simplify the data in the matrix.\index{parsimony}\index{principal component analysis}\index{eigenvector}
PCA takes the data matrix and identifies the weighted sum of all variables that does the best job at explaining variance: these are the principal components, also called eigenvectors.\index{principal component analysis}\index{eigenvector}
Principal components reflect optimally weighted sums.\index{principal component analysis}
PCA decomposes the data matrix into any number of components—as many as the number of variables, which will always account for all variance.\index{principal component analysis}
After the PCA is performed, you can look at the results and discard the components which likely reflect error variance.\index{principal component analysis}
Judgments about which components to retain are based on empirical criteria in conjunction with theory to select a parsimonious number of components that account for the majority of variance.\index{principal component analysis}
The eigenvalue reflects the amount of variance explained by the component (eigenvector).\index{principal component analysis}\index{eigenvalue}
When using a varimax (orthogonal) rotation, an eigenvalue for a component is calculated as the sum of squared standardized component loadings on that component.\index{principal component analysis}\index{orthogonal rotation}\index{eigenvalue}
When using oblique rotation, however, the items explain more variance than is attributable to their factor loadings because the factors are correlated.\index{principal component analysis}\index{oblique rotation}
PCA pulls the first principal component out (i.e., the eigenvector that explains the most variance) and makes a new data matrix: i.e., new correlation matrix.\index{principal component analysis}\index{eigenvector}
Then the PCA pulls out the component that explains the next most variance—i.e., the eigenvector with the next largest eigenvalue, and it does this for all components, equal to the same number of variables.\index{principal component analysis}\index{eigenvector}\index{eigenvalue}
For instance, if there are six variables, it will iteratively extract an additional component up to six components.\index{principal component analysis}
You can extract as many eigenvectors as there are variables.\index{principal component analysis}\index{eigenvector}
If you extract all six components, the data matrix left over will be the same as the correlation matrix in @fig-correlationMatrix2.\index{principal component analysis}
That is, the remaining variables (as part of the leftover data matrix) will be entirely uncorrelated with each other, because six components explain 100% of the variance from six variables.\index{principal component analysis}
In other words, you can explain (6) variables with (6) new things!\index{principal component analysis}
::: {#fig-correlationMatrix2}
{fig-alt="Example Correlation Matrix 2. From @Petersen2024a and @PetersenPrinciplesPsychAssessment."}
Example Correlation Matrix 2. From @Petersen2024a and @PetersenPrinciplesPsychAssessment.
:::
However, it does no good if you have to use all (6) components because there is no data reduction from the original number of variables.\index{principal component analysis}\index{data!reduction}
When the goal is data reduction (as in PCA), the hope is that the first few components will explain most of the variance, so we can explain the variability in the data with fewer components than there are variables.\index{principal component analysis}\index{data!reduction}
The sum of all eigenvalues is equal to the number of variables in the analysis.\index{principal component analysis}\index{eigenvalue}
PCA does not have the same assumptions as [factor analysis](#sec-factorAnalysis), which assumes that measures are partly from common variance and error.\index{principal component analysis}\index{factor analysis}\index{measurement error}
But if you estimate (6) eigenvectors and only keep (2), the model is a two-component model and whatever left becomes error.\index{principal component analysis}\index{measurement error}\index{eigenvector}
Therefore, PCA does not have the same assumptions as [factor analysis](#sec-factorAnalysis), but it often ends up in the same place.\index{principal component analysis}\index{factor analysis}
## Decisions in Principal Component Analysis {#sec-pcaDecisions}
There are four primary decisions to make in PCA:
1. what variables to include in the model and how to scale them
1. whether and how to rotate components
1. how many components to retain
1. how to interpret and use the components
As in [factor analysis](#sec-factorAnalysis), the answer you get can differ highly depending on the decisions you make.
We provide guidance on each of these decisions below and in @sec-factorAnalysisDecisions.
### 1. Variables to Include and their Scaling {#sec-variablesToInclude-pca}
As in [factor analysis](#sec-factorAnalysis), the first decision when conducting a factor analysis is which variables to include and the scaling of those variables.
Guidance on which variables to include is in @sec-variablesToInclude-factorAnalysis.
In addition, before performing a PCA, it is important to ensure that the variables included in the PCA are on the same scale.
PCA seeks to identify components that explain variance in the data, so if the variables are not on the same scale, some variables may contribute considerably more variance than others.
A common way of ensuring that variables are on the same scale is to standardize them using, for example, [*z* scores](#sec-zScores).
### 2. Component Rotation {#sec-pcaRotation}
Similar considerations as in [factor analysis](#sec-factorAnalysis) can be used to determine whether and how to rotate components in PCA.
The considerations for determining whether and how to rotate factors in [factor analysis](#sec-factorAnalysis) are described in @sec-factorAnalysisRotation.
### 3. Determining the Number of Components to Retain {#sec-pcaNumComponents}
Similar criteria as in [factor analysis](#sec-factorAnalysis) can be used to determine the number of components to retain in PCA.
The criteria for determining the number of factors to retain in [factor analysis](#sec-factorAnalysis) are described in @sec-factorAnalysisNumFactors.
### 4. Interpreting and Using PCA Components {#sec-interpretingAndUsingComponentsPCA}
The next step is interpreting the PCA components.\index{principal component analysis}
Use theory to interpret and label the components.\index{principal component analysis}
## PCA Versus Factor Analysis {#sec-pcaVsFactorAnalysis}
Both [factor analysis](#sec-factorAnalysis) and PCA can be used for data reduction.
The key distinction between [factor analysis](#sec-factorAnalysis) and PCA is depicted in @fig-pcaVsFactorAnalysis.\index{factor analysis}\index{principal component analysis}
::: {#fig-pcaVsFactorAnalysis}
{fig-alt="Distinction Between Factor Analysis and Principal Component Analysis. From @PetersenPrinciplesPsychAssessment."}
Distinction Between Factor Analysis and Principal Component Analysis. From @PetersenPrinciplesPsychAssessment.
:::
There are several differences between [factor analysis](#sec-factorAnalysis) and PCA.\index{factor analysis}\index{principal component analysis}
[Factor analysis](#sec-factorAnalysis) has greater sophistication than PCA, but greater sophistication often results in greater assumptions.\index{factor analysis}\index{principal component analysis}
[Factor analysis](#sec-factorAnalysis) does not always work; the data may not always fit to a [factor analysis](#sec-factorAnalysis) model.\index{factor analysis}\index{principal component analysis}
However, PCA can decompose any data matrix; it always works.\index{principal component analysis}
PCA is okay if you are not interested in the factor structure.\index{principal component analysis}
PCA uses all variance of variables and assumes variables have no error, so it does not account for measurement error.\index{principal component analysis}\index{measurement error}
PCA is good if you just want to form a linear composite and perform data reduction.\index{principal component analysis}\index{linear composite}\index{construct!formative}\index{construct!reflective}
However, if you are interested in the factor structure, use [factor analysis](#sec-factorAnalysis), which estimates a latent variable that accounts for the common variance and discards error variance.\index{factor analysis}
[Factor analysis](#sec-factorAnalysis) better handles error than PCA—[factor analysis](#sec-factorAnalysis) assumes that what is in the variable is the combination of common construct variance and error.\index{principal component analysis}\index{factor analysis}
By contrast, PCA uses the total variance (not just the common variance) and assumes that the measures have no measurement error.\index{principal component analysis}\index{measurement error}
[Factor analysis](#sec-factorAnalysis) is useful for the identification of latent constructs—i.e., underlying dimensions or factors that explain (cause) observed scores.\index{factor analysis}
## Example of Principal Component Analysis {#sec-pcaExample}
We generated the scree plot in @fig-pcaScreePlot1 using the `psych::fa.parallel()` function of the `psych` package [@R-psych].
The number of components to keep would depend on which criteria one uses.
Based on the rule to keep factors whose eigenvalues are greater than one and based on the parallel test, we would keep nine components.
However, based on the Cattell scree test (the "elbow" of the screen plot minus one) [@Cattell1966], we would keep three components.
If using the optimal coordinates, we would keep four components; if using the acceleration factor, we would keep one component.
Therefore, interpretability of the components would be important for deciding how many components to keep.
```{r}
#| label: fig-pcaScreePlot1
#| fig-cap: "Scree Plot: With Comparisons to Simulated and Resampled Data."
#| fig-alt: "Scree Plot: With Comparisons to Simulated and Resampled Data."
psych::fa.parallel(
x = dataForPCA[pcaVars],
fa = "pc"
)
```
We generated the scree plot in @fig-pcaScreePlot2 using the `nFactors::nScree()` and `nFactors::plotnScree()` functions of the `nFactors` package [@R-nFactors].
```{r}
#| label: fig-pcaScreePlot2
#| fig-cap: "Scree Plot with Parallel Analysis."
#| fig-alt: "Scree Plot with Parallel Analysis."
screeDataPCA <- nFactors::nScree(
x = cor(
dataForPCA[pcaVars],
use = "pairwise.complete.obs"),
model = "components")
nFactors::plotnScree(screeDataPCA)
```
We generated the very simple structure (VSS) plots in Figures [-@fig-pcaVSSPlot1] and [-@fig-pcaVSSPlot2] using the `psych::vss()` and `psych::nfactors()` functions of the `psych` package [@R-psych].
The optimal number of components based on the VSS criterion is three components.
```{r}
#| label: fig-pcaVSSPlot1
#| fig-cap: "Very Simple Structure Plot."
#| fig-alt: "Very Simple Structure Plot."
psych::vss(
dataForPCA[pcaVars],
rotate = "oblimin",
fm = "pc")
```
```{r}
#| label: fig-pcaVSSPlot2
#| fig-cap: "Model Indices by Number of Components."
#| fig-alt: "Model Indices by Number of Components."
psych::nfactors(
dataForPCA[pcaVars],
rotate = "oblimin",
fm = "pc")
```
```{r}
pca1ComponentOblique <- psych::principal(
dataForPCA[pcaVars],
nfactors = 1,
rotate = "oblimin")
pca2ComponentOblique <- psych::principal(
dataForPCA[pcaVars],
nfactors = 2,
rotate = "oblimin")
pca3ComponentOblique <- psych::principal(
dataForPCA[pcaVars],
nfactors = 3,
rotate = "oblimin")
pca4ComponentOblique <- psych::principal(
dataForPCA[pcaVars],
nfactors = 4,
rotate = "oblimin")
pca5ComponentOblique <- psych::principal(
dataForPCA[pcaVars],
nfactors = 5,
rotate = "oblimin")
pca6ComponentOblique <- psych::principal(
dataForPCA[pcaVars],
nfactors = 6,
rotate = "oblimin")
pca7ComponentOblique <- psych::principal(
dataForPCA[pcaVars],
nfactors = 7,
rotate = "oblimin")
pca8ComponentOblique <- psych::principal(
dataForPCA[pcaVars],
nfactors = 8,
rotate = "oblimin")
pca9ComponentOblique <- psych::principal(
dataForPCA[pcaVars],
nfactors = 9,
rotate = "oblimin")
```
```{r}
pca1ComponentOblique
pca2ComponentOblique
pca3ComponentOblique
pca4ComponentOblique
pca5ComponentOblique
pca6ComponentOblique
pca7ComponentOblique
pca8ComponentOblique
pca9ComponentOblique
```
Based on the component solutions, the three-component solution maps onto our three-factor solution from [factor analysis](#sec-factorAnalysis).
The three-component solution explains more than half of the variance in the variables.
The fourth component in a four-component solution is not particularly interpretable, explains relatively little variance, and seems to be related to sacks: sacks suffered, sack yards lost, sack fumbles, and sack fumbles lost.
However, two of the sack-related variables (sacks suffered and sack yards lost) load more strongly onto the first component, suggesting that these sack-related variables are better captured by another component.
Moreover, sacks might be considered a scoring methodological component rather than a particular concept of interest.
For these resources we prefer the three-component solution and choose to retain three components.
```{r}
dataMergedWithPCA <- cbind(dataMerged, pca3ComponentOblique$scores)
```
Here are the variables that had a standardized component loading of 0.4 or greater on each component:
```{r}
component1vars <- c(
"completions","attempts","passing_yards","passing_tds","passing_interceptions",
"sacks_suffered","sack_yards_lost","sack_fumbles","sack_fumbles_lost",
"passing_air_yards","passing_yards_after_catch","passing_first_downs",
#"passing_epa","passing_cpoe","passing_2pt_conversions","pacr","pass_40_yds",
"pass_inc","fumbles")#,"two_pts","pass_comp_pct",
#"avg_time_to_throw","avg_completed_air_yards","avg_intended_air_yards",
#"avg_air_yards_differential","aggressiveness","max_completed_air_distance",
#"avg_air_yards_to_sticks","passer_rating", #,"completion_percentage"
#"expected_completion_percentage","completion_percentage_above_expectation",
#"avg_air_distance","max_air_distance")
component2vars <- c(
#"completions","attempts","passing_yards","passing_tds","passing_interceptions",
#"sacks_suffered","sack_yards_lost","sack_fumbles","sack_fumbles_lost",
"passing_air_yards",#"passing_yards_after_catch","passing_first_downs",
"pacr",#"passing_epa","passing_cpoe","passing_2pt_conversions","pass_40_yds",
#"pass_inc","pass_comp_pct","fumbles","two_pts",
"avg_completed_air_yards","avg_intended_air_yards",#"avg_time_to_throw",
"aggressiveness","max_completed_air_distance",#"avg_air_yards_differential",
"avg_air_yards_to_sticks",#"passer_rating", #,"completion_percentage"
"expected_completion_percentage",#"completion_percentage_above_expectation",
"avg_air_distance","max_air_distance")
component3vars <- c(
"passing_tds",#"completions","attempts","passing_yards","passing_interceptions",
#"sacks_suffered","sack_yards_lost","sack_fumbles","sack_fumbles_lost",
#"passing_air_yards","passing_yards_after_catch","passing_first_downs",
"passing_epa","passing_cpoe","pass_40_yds",#"passing_2pt_conversions","pacr",
"pass_comp_pct",#"fumbles","two_pts","pass_inc",
"avg_air_yards_differential","max_completed_air_distance",
"avg_time_to_throw","avg_completed_air_yards","avg_intended_air_yards",
#"aggressiveness",
"passer_rating",#"avg_air_yards_to_sticks", #,"completion_percentage"
"completion_percentage_above_expectation")#,"expected_completion_percentage",
#"avg_air_distance","max_air_distance")
```
The variables that loaded most strongly onto component 1 appear to reflect Quarterback usage: completions, incompletions, passing attempts, passing yards, passing touchdowns, interceptions thrown, fumbles, times sacked, sack yards lost ("reversed"—i.e., negatively associated with the component), sack fumbles, sack fumbles lost, passing air yards (total horizontal distance the ball travels on all pass attempts), passing yards after the catch, and first downs gained by passing.
Quarterbacks who tend to throw more tend to have higher levels on those variables.
Thus, we label component 1 as "Usage", which reflects total Quarterback involvement, regardless of efficiency or outcome.
The variables that loaded most strongly onto component 2 appear to reflect Quarterback aggressiveness: passing air yards, passing air conversion ratio (reversed; ratio of passing yards to passing air yards), average air yards on completed passes, average air yards on all attempted passes, aggressiveness (percentage of passing attempts thrown into tight windows, where there is a defender within one yard or less of the receiver at the time of the completion or incompletion), expected completion percentage (reversed; based on air distance, receiver separation, Quarterback/Wide Receiver movement, pass location, whether there was pressure on the Quarterback when throwing, the throw angle and trajectory, receiver and defender positioning at the catch point, and defensive coverage scheme), average amount of air yards ahead of or behind the first down marker on passing attempts, average air distance (the true three-dimensional distance the ball travels in the air), maximum air distance, and maximum air distance on completed passes.
Quarterbacks who throw the ball farther and into tighter windows tend to have higher values on those variables.
Thus, we label component 2 as "Aggressiveness", which reflects throwing longer, more difficult passes with a tight window.
The variables that loaded most strongly onto component 3 appear to reflect Quarterback performance: passing touchdowns, passing expected points added, passing completion percentage above expectation, passes completed of 40 yards or more, pass completion percentage, air yards differential (intended air yards $-$ completed air yards; attempting deeper passes than he on average completes), maximum completed air distance, average time to throw, average completed air yards, average intended air yards, and passer rating.
Quarterbacks who perform better tend to have higher values on those variables.
Thus, we label component 3 as "Performance".
Below are component scores from the PCA for the first six players:
```{r}
pca3ComponentOblique$scores %>%
na.omit() %>%
head()
```
Here are the players and weeks that showed the highest levels of Quarterback "Usage":
```{r}
dataMergedWithPCA %>%
arrange(-TC1) %>%
select(player_display_name, season, week, TC1, all_of(component1vars)) %>%
na.omit() %>%
head()
```
Here are the players and weeks that showed the lowest levels of Quarterback "Usage":
```{r}
dataMergedWithPCA %>%
arrange(TC1) %>%
select(player_display_name, season, week, TC1, all_of(component1vars)) %>%
na.omit() %>%
head()
```
Here are the players and weeks that showed the highest levels of Quarterback "Aggressiveness":
```{r}
dataMergedWithPCA %>%
arrange(-TC2) %>%
select(player_display_name, season, week, TC2, all_of(component2vars)) %>%
na.omit() %>%
head()
```
Here are the players and weeks that showed the lowest levels of Quarterback "Aggressiveness":
```{r}
dataMergedWithPCA %>%
arrange(TC2) %>%
select(player_display_name, season, week, TC2, all_of(component2vars)) %>%
na.omit() %>%
head()
```
Here are the players and weeks that showed the highest levels of Quarterback "Performance":
```{r}
dataMergedWithPCA %>%
arrange(-TC3) %>%
select(player_display_name, season, week, TC3, all_of(component3vars)) %>%
na.omit() %>%
head()
```
Here are the players and weeks that showed the lowest levels of Quarterback "Performance":
```{r}
dataMergedWithPCA %>%
arrange(TC3) %>%
head %>%
select(player_display_name, season, week, TC3, all_of(component3vars)) %>%
na.omit() %>%
head()
```
## Conclusion {#sec-pcaConclusion}
Principal component analysis (PCA) is a technique used for data reduction—reducing a large set of a variables down to a smaller set of components that capture most of the variance in the larger set.
There are many [decisions to make in factor analysis](#sec-pcaDecisions).
These decisions can have important impacts on the resulting solution.
Thus, it can be helpful for theory and interpretability to help guide decision-making when conducting factor analysis.
There are several differences between [factor analysis](#sec-factorAnalysis) and PCA.
Unlike [factor analysis](#sec-factorAnalysis), which estimates the latent factors as the common variance among the variables that load onto that factor and discards the remaining variance as "error", PCA uses all variance of variables and assumes variables have no error.
Thus, PCA does not account for measurement error.
Using PCA, we were able to identify three PCA components that accounted for considerable variance in the variables we examined, pertaining to Quarterbacks: 1) usage; 2) aggressiveness; 3) performance.
We were then able to determine which players were highest and lowest on each of these components.
::: {.content-visible when-format="html"}
## Session Info {#sec-pcaSessionInfo}
```{r}
sessionInfo()
```
:::