-
-
Notifications
You must be signed in to change notification settings - Fork 5
Expand file tree
/
Copy pathevaluating-prediction-accuracy.qmd
More file actions
6285 lines (5167 loc) · 291 KB
/
evaluating-prediction-accuracy.qmd
File metadata and controls
6285 lines (5167 loc) · 291 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
{{< include _chunk-timing.qmd >}}
# Evaluation of Prediction/Forecasting Accuracy {#sec-predictionAccuracy}
::: {.content-visible when-format="html:js"}
> "*Nothing ruins fantasy more than reality.*" – [Renee Miller, Ph.D.](#fig-fantasy-vs-reality) [@Yahoo2024]
::: {#fig-fantasy-vs-reality}
{{< video https://www.youtube.com/watch?v=gmpLFWs5ae0 start="1430" >}}
:::
:::
::: {.content-visible unless-format="html:js"}
> "*Nothing ruins fantasy more than reality.*" – [Renee Miller, Ph.D.](https://www.youtube.com/watch?v=gmpLFWs5ae0&t=1430) [@Yahoo2024]
:::
This chapter provides an overview of ways to evaluate the accuracy of predictions.
In addition, we evaluate the accuracy of fantasy football projections.
## Getting Started {#sec-predictionAccuracyGettingStarted}
### Load Packages {#sec-predictionAccuracyLoadPackages}
```{r}
library("petersenlab")
library("pROC")
library("magrittr")
library("viridis")
library("viridisLite")
library("msir")
library("tidymodels")
library("tidyverse")
```
### Load Data {#sec-predictionAccuracyLoadData}
```{r}
#| eval: false
#| include: false
load(file = file.path(path, "/OneDrive - University of Iowa/Teaching/Courses/Fantasy Football/Data/player_stats_weekly.RData", fsep = ""))
load(file = file.path(path, "/OneDrive - University of Iowa/Teaching/Courses/Fantasy Football/Data/player_stats_seasonal.RData", fsep = ""))
load(file = file.path(path, "/OneDrive - University of Iowa/Teaching/Courses/Fantasy Football/Data/players_projections_weekly.RData", fsep = ""))
load(file = file.path(path, "/OneDrive - University of Iowa/Teaching/Courses/Fantasy Football/Data/players_projections_seasonal.RData", fsep = ""))
load(file = file.path(path, "/OneDrive - University of Iowa/Teaching/Courses/Fantasy Football/Data/nfl_playerIDs.RData", fsep = ""))
```
```{r}
load(file = "./data/player_stats_weekly.RData")
load(file = "./data/player_stats_seasonal.RData")
load(file = "./data/players_projections_weekly.RData")
load(file = "./data/players_projections_seasonal.RData")
load(file = "./data/nfl_playerIDs.RData")
```
We created the `player_stats_weekly.RData` and `player_stats_seasonal.RData` objects in @sec-calculatePlayerAge.
The `players_projections_weekly` and `players_projections_seasonal` objects were derived from projected points objects that were created in @sec-fantasyFootballProjections.
### Specify Options {#predictionAccuracySpecifyOptions}
```{r}
options(scipen = 999) # prevent scientific notation
```
### Prepare Data {#sec-predictionAccuracyPrepareData}
#### Seasonal Projections {#sec-predictionAccuracyPrepareDataSeasonal}
To evaluate the accuracy of projections, we must first merge projections with actual performance.
Below, we merge seasonal projections with actual performance.
```{r}
player_stats_seasonal_subset <- player_stats_seasonal %>%
filter(!is.na(player_id))
nfl_playerIDs_subset <- nfl_playerIDs %>%
filter(!is.na(gsis_id)) %>%
distinct(gsis_id, .keep_all = TRUE) %>% # keep only rows that do not have duplicate values of gsis_id
select(-all_of(c("team", "position", "height", "weight", "age")))
players_projectedPoints_seasonal_combined$season <- as.integer(players_projectedPoints_seasonal_combined$season)
players_projections_seasonal_average_merged$season <- as.integer(players_projections_seasonal_average_merged$season)
player_stats_seasonal_subset_IDs <- left_join(
player_stats_seasonal_subset,
nfl_playerIDs_subset,
by = c("player_id" = "gsis_id")
) %>%
filter(!is.na(mfl_id))
projectionsWithActuals_seasonal <- full_join(
player_stats_seasonal_subset_IDs,
players_projectedPoints_seasonal_combined,
by = c("mfl_id" = "id", "season"),
suffix = c("", "_proj"),
)
crowdAveragedProjectionsWithActuals_seasonal <- full_join(
player_stats_seasonal_subset_IDs,
players_projections_seasonal_average_merged %>% filter(avg_type == "average"),
by = c("mfl_id" = "id", "season"),
suffix = c("", "_proj"),
)
projectionsWithActuals_seasonal <- projectionsWithActuals_seasonal %>%
unite(
"player_id_season",
player_id,
season,
remove = FALSE
)
crowdAveragedProjectionsWithActuals_seasonal <- crowdAveragedProjectionsWithActuals_seasonal %>%
unite(
"player_id_season",
player_id,
season,
remove = FALSE
)
```
Players in the `projectionsWithActuals_seasonal` object are (supposed to be) uniquely identified by `player_id`-`season`-`data_src`:
```{r}
projectionsWithActuals_seasonal %>%
filter(!is.na(player_id)) %>%
group_by(player_id, season, data_src) %>%
filter(n() > 1) %>%
head()
```
Players in the `crowdAveragedProjectionsWithActuals_seasonal` object are (supposed to be) uniquely identified by `player_id`-`season`-`avg_type`:
```{r}
crowdAveragedProjectionsWithActuals_seasonal %>%
filter(!is.na(player_id)) %>%
group_by(player_id, season, avg_type) %>%
filter(n() > 1) %>%
head()
```
We save the data object for use in other chapters:
```{r}
save(
projectionsWithActuals_seasonal, crowdAveragedProjectionsWithActuals_seasonal,
file = "./data/projectionsWithActuals_seasonal.RData"
)
```
```{r}
playersWithHighProjectedOrActualPoints <- projectionsWithActuals_seasonal %>%
filter(raw_points > 100 | fantasyPoints > 100) %>%
select(player_id_season) %>%
pull()
projectionsWithActuals_seasonal_qb <- projectionsWithActuals_seasonal %>%
filter(position_group == "QB")
projectionsWithActuals_seasonal_rb <- projectionsWithActuals_seasonal %>%
filter(position_group == "RB")
projectionsWithActuals_seasonal_wr <- projectionsWithActuals_seasonal %>%
filter(position_group == "WR")
projectionsWithActuals_seasonal_te <- projectionsWithActuals_seasonal %>%
filter(position_group == "TE")
projectionsWithActuals_seasonal_k <- projectionsWithActuals_seasonal %>%
filter(position == "K")
projectionsWithActuals_seasonal_dl <- projectionsWithActuals_seasonal %>%
filter(position_group == "DL")
projectionsWithActuals_seasonal_lb <- projectionsWithActuals_seasonal %>%
filter(position_group == "LB")
projectionsWithActuals_seasonal_db <- projectionsWithActuals_seasonal %>%
filter(position_group == "DB")
crowdAveragedProjectionsWithActuals_seasonal_qb <- crowdAveragedProjectionsWithActuals_seasonal %>%
filter(position_group == "QB")
crowdAveragedProjectionsWithActuals_seasonal_rb <- crowdAveragedProjectionsWithActuals_seasonal %>%
filter(position_group == "RB")
crowdAveragedProjectionsWithActuals_seasonal_wr <- crowdAveragedProjectionsWithActuals_seasonal %>%
filter(position_group == "WR")
crowdAveragedProjectionsWithActuals_seasonal_te <- crowdAveragedProjectionsWithActuals_seasonal %>%
filter(position_group == "TE")
crowdAveragedProjectionsWithActuals_seasonal_k <- crowdAveragedProjectionsWithActuals_seasonal %>%
filter(position == "K")
crowdAveragedProjectionsWithActuals_seasonal_dl <- crowdAveragedProjectionsWithActuals_seasonal %>%
filter(position_group == "DL")
crowdAveragedProjectionsWithActuals_seasonal_lb <- crowdAveragedProjectionsWithActuals_seasonal %>%
filter(position_group == "LB")
crowdAveragedProjectionsWithActuals_seasonal_db <- crowdAveragedProjectionsWithActuals_seasonal %>%
filter(position_group == "DB")
```
#### Weekly Projections {#sec-predictionAccuracyPrepareDataWeekly}
Below, we merge weekly projections with actual performance.
```{r}
player_stats_weekly_subset <- player_stats_weekly %>%
filter(!is.na(player_id))
nfl_playerIDs_subset <- nfl_playerIDs %>%
filter(!is.na(gsis_id)) %>%
distinct(gsis_id, .keep_all = TRUE) %>% # keep only rows that do not have duplicate values of gsis_id
select(-all_of(c("team", "position", "height", "weight", "age")))
players_projectedPoints_weekly_combined$season <- as.integer(players_projectedPoints_weekly_combined$season)
players_projections_weekly_average_merged$season <- as.integer(players_projections_weekly_average_merged$season)
player_stats_weekly_subset_IDs <- left_join(
player_stats_weekly_subset,
nfl_playerIDs_subset,
by = c("player_id" = "gsis_id")
) %>%
filter(!is.na(mfl_id))
projectionsWithActuals_weekly <- full_join(
player_stats_weekly_subset_IDs,
players_projectedPoints_weekly_combined,
by = c("mfl_id" = "id", "season", "week"),
suffix = c("", "_proj"),
)
crowdAveragedProjectionsWithActuals_weekly <- full_join(
player_stats_weekly_subset_IDs,
players_projections_weekly_average_merged,
by = c("mfl_id" = "id", "season", "week"),
suffix = c("", "_proj"),
)
projectionsWithActuals_weekly <- projectionsWithActuals_weekly %>%
unite(
"player_id_season_week",
player_id,
season,
week,
remove = FALSE
)
crowdAveragedProjectionsWithActuals_weekly <- crowdAveragedProjectionsWithActuals_weekly %>%
unite(
"player_id_season_week",
player_id,
season,
week,
remove = FALSE
)
```
Players in the `projectionsWithActuals_weekly` object are (supposed to be) uniquely identified by `player_id`-`season`-`week`-`data_src`:
```{r}
projectionsWithActuals_weekly %>%
filter(!is.na(player_id)) %>%
group_by(player_id, season, week, data_src) %>%
filter(n() > 1) %>%
head()
```
Players in the `crowdAveragedProjectionsWithActuals_weekly` object are (supposed to be) uniquely identified by `player_id`-`season`-`week`-`avg_type`:
```{r}
crowdAveragedProjectionsWithActuals_weekly %>%
filter(!is.na(player_id)) %>%
group_by(player_id, season, week, avg_type) %>%
filter(n() > 1) %>%
head()
```
We save the data object for use in other chapters:
```{r}
save(
projectionsWithActuals_weekly, crowdAveragedProjectionsWithActuals_weekly,
file = "./data/projectionsWithActuals_weekly.RData"
)
```
```{r}
playersWithHighProjectedOrActualPoints_weekly <- projectionsWithActuals_weekly %>%
filter(raw_points > 6 | fantasyPoints > 6) %>%
select(player_id_season_week) %>%
pull()
projectionsWithActuals_weekly_qb <- projectionsWithActuals_weekly %>%
filter(position_group == "QB")
projectionsWithActuals_weekly_rb <- projectionsWithActuals_weekly %>%
filter(position_group == "RB")
projectionsWithActuals_weekly_wr <- projectionsWithActuals_weekly %>%
filter(position_group == "WR")
projectionsWithActuals_weekly_te <- projectionsWithActuals_weekly %>%
filter(position_group == "TE")
projectionsWithActuals_weekly_k <- projectionsWithActuals_weekly %>%
filter(position == "K")
projectionsWithActuals_weekly_dl <- projectionsWithActuals_weekly %>%
filter(position_group == "DL")
projectionsWithActuals_weekly_lb <- projectionsWithActuals_weekly %>%
filter(position_group == "LB")
projectionsWithActuals_weekly_db <- projectionsWithActuals_weekly %>%
filter(position_group == "DB")
crowdAveragedProjectionsWithActuals_weekly_qb <- crowdAveragedProjectionsWithActuals_weekly %>%
filter(position_group == "QB")
crowdAveragedProjectionsWithActuals_weekly_rb <- crowdAveragedProjectionsWithActuals_weekly %>%
filter(position_group == "RB")
crowdAveragedProjectionsWithActuals_weekly_wr <- crowdAveragedProjectionsWithActuals_weekly %>%
filter(position_group == "WR")
crowdAveragedProjectionsWithActuals_weekly_te <- crowdAveragedProjectionsWithActuals_weekly %>%
filter(position_group == "TE")
crowdAveragedProjectionsWithActuals_weekly_k <- crowdAveragedProjectionsWithActuals_weekly %>%
filter(position == "K")
crowdAveragedProjectionsWithActuals_weekly_dl <- crowdAveragedProjectionsWithActuals_weekly %>%
filter(position_group == "DL")
crowdAveragedProjectionsWithActuals_weekly_lb <- crowdAveragedProjectionsWithActuals_weekly %>%
filter(position_group == "LB")
crowdAveragedProjectionsWithActuals_weekly_db <- crowdAveragedProjectionsWithActuals_weekly %>%
filter(position_group == "DB")
```
## Overview {#sec-predictionAccuracyOverview}
Predictions can come in different types.
Some predictions involve categorical data, whereas other predictions involve continuous data.
When dealing with a dichotomous ([nominal data](#sec-nominal) that are binary) [predictor](#sec-correlationalStudy) and [outcome](#sec-correlationalStudy) variable (or continuous data that have been dichotomized using a cutoff), we can evaluate predictions using a 2x2 table known as a [confusion matrix](#sec-confusionMatrix) (see @fig-confusionMatrix), or with logistic regression models.
When dealing with a continuous [outcome variable](#sec-correlationalStudy) (e.g., [ordinal](#sec-ordinal), [interval](#sec-interval), or [ratio](#sec-ratio) data), we can evaluate predictions using [multiple regression](#sec-multipleRegression) or similar variants such as structural equation modeling and mixed models.
In fantasy football, we most commonly predict continuous [outcome variables](#sec-correlationalStudy) (e.g., fantasy points, rushing yards).
Nevertheless, it is also important to understand principles in the prediction of categorical [outcomes variables](#sec-correlationalStudy).
In any domain, it is important to evaluate the accuracy of predictions, so we can know how (in)accurate we are, and we can strive to continually improve our predictions.
Fantasy performance—and human behavior more general—is incredibly challenging to predict.
Indeed, many things in the world, in particular long-term trends, are unpredictable [@Kahneman2011].
In fantasy football, there is considerable luck/chance/randomness.
There are relatively few (i.e. 17) games, and there is a sizeable injury risk for each player in a given game.
These and other factors combine to render fantasy football predictions not highly accurate.
Domains with high uncertainty and unpredictability are considered "low-validity environments" [@Kahneman2011, p. 223].
But, first, let's learn about the various ways we can evaluate the accuracy of predictions.
## Types of Accuracy {#sec-accuracyTypes}
There are two primary dimensions of accuracy: (1) [discrimination](#sec-discrimination) and (2) [calibration](#sec-calibration).
[Discrimination](#sec-discrimination) and [calibration](#sec-calibration) are distinct forms of accuracy.
Just because predictions are high in one form of accuracy does not mean that they will be high in the other form of accuracy.
As described by @Lindhiem2020, predictions can follow any of the following configurations (and anywhere in between):
- high [discrimination](#sec-discrimination), high [calibration](#sec-calibration)
- high [discrimination](#sec-discrimination), low [calibration](#sec-calibration)
- low [discrimination](#sec-discrimination), high [calibration](#sec-calibration)
- low [discrimination](#sec-discrimination), low [calibration](#sec-calibration)
Some general indexes of accuracy combine discrimination and calibration, as described in @sec-generalAccuracy.
In addition, accuracy indices can be [threshold-dependent](#sec-thresholdDependentAccuracy) or [-independent](#sec-thresholdIndependentAccuracy) and can be scale-dependent or -independent.
[Threshold-dependent accuracy indices](#sec-thresholdDependentAccuracy) differ based on the cutoff (i.e., threshold), whereas [threshold-independent accuracy indices](#sec-thresholdIndependentAccuracy) do not.
Thus, raising or lowering the cutoff will change [threshold-dependent](#sec-thresholdDependentAccuracy) accuracy indices.
Scale-dependent accuracy indices depend on the metric/scale of the data, whereas scale-independent accuracy indices do not.
Thus, scale-dependent accuracy indices cannot be directly compared when using measures of differing scales, whereas scale-independent accuracy indices can be compared across data of differing scales.
### Discrimination {#sec-discrimination}
When dealing with a categorical outcome, discrimination is the ability to separate events from non-events.
When dealing with a continuous outcome, discrimination is the strength of the association between the predictor and the outcome.
Aspects of discrimination at a particular cutoff (e.g., sensitivity, specificity, area under the ROC curve) are described in @sec-thresholdDependentAccuracy.
### Calibration {#sec-calibration}
When dealing with a categorical outcome, calibration is the degree to which a probabilistic estimate of an event reflects the true underlying probability of the event.
When dealing with a continuous outcome, calibration is the degree to which the predicted values are close in value to the outcome values.
The importance of examining calibration (in addition to discrimination) is described by @Lindhiem2020.
Calibration is relevant to all kinds of predictions, including weather forecasts.
For instance, on the days that the meteorologist says there is a 60% chance of rain, it should rain about 60% of the time.
Calibration is also important for fantasy football predictions.
When projections state that a group of players is each expected to score 200 points, their projections would be miscalibrated if those players scored only 150 points on average.
There are four general patterns of miscalibration: overextremity, underextremity, overprediction, and underprediction (see @fig-miscalibration).
*Overextremity* exists when the predicted probabilities are too close to the extremes (zero or one).
*Underextremity* exists when the predicted probabilities are too far away from the extremes.
*Overprediction* exists when the predicted probabilities are consistently greater than the observed probabilities.
*Underprediction* exists when the predicted probabilities are consistently less than the observed probabilities.
For a more thorough description of these types of miscalibration, see @Lindhiem2020.
Indices for evaluating calibration are described in @sec-calibrationIndices.
### General Accuracy {#sec-generalAccuracy}
General accuracy indices combine estimates of [discrimination](#sec-discrimination) and [calibration](#sec-calibration).
## Prediction of Categorical Outcomes {#sec-predictionCategorical}
To evaluate the accuracy of our predictions for categorical outcome variables (e.g., binary, dichotomous, or [nominal](#sec-nominal) data), we can use either [threshold-dependent](#sec-thresholdDependentAccuracy) or [threshold-independent](#sec-thresholdIndependentAccuracy) accuracy indices.
## Prediction of Continuous Outcomes {#sec-predictionContinuous}
To evaluate the accuracy of our predictions for continuous outcome variables (e.g., [ordinal](#sec-ordinal), [interval](#sec-interval), or [ratio](#sec-ratio) data), the outcome variable does not have cutoffs, so we would use [threshold-independent accuracy indices](#sec-thresholdIndependentAccuracy).
## Threshold-Dependent Accuracy Indices {#sec-thresholdDependentAccuracy}
### Decision Outcomes {#sec-decisionOutcomes}
To consider how we can evaluate the accuracy of predictions for a categorical outcome, consider an example adapted from @Meehl1955.
The military conducts a test of its prospective members to screen out applicants who would likely fail basic training.
To evaluate the accuracy of our predictions using the test, we can examine a [confusion matrix](#sec-confusionMatrix).
A [confusion matrix](#sec-confusionMatrix) is a matrix that presents the predicted outcome on one dimension and the actual outcome (truth) on the other dimension.
If the predictions and outcomes are dichotomous, the [confusion matrix](#sec-confusionMatrix) is a 2x2 matrix with two rows and two columns that represent four possible predicted-actual combinations (decision outcomes), as in @fig-confusionMatrix: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).
::: {#fig-confusionMatrix}
{fig-alt="A Two-by-Two Confusion Matrix. From @Petersen2024a and @PetersenPrinciplesPsychAssessment."}
A Two-by-Two Confusion Matrix. From @Petersen2024a and @PetersenPrinciplesPsychAssessment.
:::
When discussing the four decision outcomes, "true" means an accurate judgment, whereas "false" means an inaccurate judgment.
"Positive" means that the judgment was that the person has the characteristic of interest, whereas "negative" means that the judgment was that the person does not have the characteristic of interest.
A *true positive* is a correct judgment (or prediction) where the judgment was that the person has (or will have) the characteristic of interest, and, in truth, they actually have (or will have) the characteristic.
A *true negative* is a correct judgment (or prediction) where the judgment was that the person does not have (or will not have) the characteristic of interest, and, in truth, they actually do not have (or will not have) the characteristic.
A *false positive* is an incorrect judgment (or prediction) where the judgment was that the person has (or will have) the characteristic of interest, and, in truth, they actually do not have (or will not have) the characteristic.
A *false negative* is an incorrect judgment (or prediction) where the judgment was that the person does not have (or will not have) the characteristic of interest, and, in truth, they actually do have (or will have) the characteristic.
An example of a [confusion matrix](#sec-confusionMatrix) is in @fig-confusionMatrix1.
::: {#fig-confusionMatrix1}
{fig-alt="Example of a Two-by-Two Confusion Matrix. TP = true positives; TN = true negatives; FP = false positives; FN = false negatives. From @Petersen2024a and @PetersenPrinciplesPsychAssessment."}
Example of a Two-by-Two Confusion Matrix. TP = true positives; TN = true negatives; FP = false positives; FN = false negatives. From @Petersen2024a and @PetersenPrinciplesPsychAssessment.
:::
With the information in the [confusion matrix](#sec-confusionMatrix), we can calculate the marginal sums and the proportion of people in each cell (in parentheses), as depicted in @fig-confusionMatrix2.
::: {#fig-confusionMatrix2}
{fig-alt="Example of a Two-by-Two Confusion Matrix With Marginal Sums. TP = true positives; TN = true negatives; FP = false positives; FN = false negatives. From @Petersen2024a and @PetersenPrinciplesPsychAssessment."}
Example of a Two-by-Two Confusion Matrix With Marginal Sums. TP = true positives; TN = true negatives; FP = false positives; FN = false negatives. From @Petersen2024a and @PetersenPrinciplesPsychAssessment.
:::
That is, we can sum across the rows and columns to identify how many people actually showed poor adjustment ($n = 100$) versus good adjustment ($n = 1,900$), and how many people were selected to reject ($n = 508$) versus retain ($n = 1,492$).
If we sum the column of predicted marginal sums ($508 + 1,492$) or the row of actual marginal sums ($100 + 1,900$), we get the total number of people ($N = 2,000$).
Based on the marginal sums, we can compute the [marginal probabilities](#sec-baseRate), as depicted in @fig-confusionMatrix3.
::: {#fig-confusionMatrix3}
{fig-alt="Example of a Two-by-Two Confusion Matrix With Marginal Sums and Marginal Probabilities. TP = true positives; TN = true negatives; FP = false positives; FN = false negatives; BR = base rate; SR = selection ratio. From @Petersen2024a and @PetersenPrinciplesPsychAssessment."}
Example of a Two-by-Two Confusion Matrix With Marginal Sums and Marginal Probabilities. TP = true positives; TN = true negatives; FP = false positives; FN = false negatives; BR = base rate; SR = selection ratio. From @Petersen2024a and @PetersenPrinciplesPsychAssessment.
:::
The [marginal probability](#sec-baseRate) of the person having the characteristic of interest (i.e., showing poor adjustment) is called the [*base rate*](#baseRate) (BR).
That is, the [base rate](#baseRate) is the proportion of people who have the characteristic.
It is calculated by dividing the number of people with poor adjustment ($n = 100$) by the total number of people ($N = 2,000$): $BR = \frac{FN + TP}{N}$.
Here, the [base rate](#baseRate) reflects the prevalence of poor adjustment.
In this case, the [base rate](#baseRate) is .05, so there is a 5% chance that an applicant will be poorly adjusted.
The [marginal probability](#baseRate) of good adjustment is equal to 1 minus the [base rate](#baseRate) of poor adjustment.
The [marginal probability](#baseRate) of predicting that a person has the characteristic (i.e., rejecting a person) is called the *selection ratio* (SR).
The selection ratio is the proportion of people who will be selected (in this case, rejected rather than retained); i.e., the proportion of people who are identified as having the characteristic.
The selection ratio is calculated by dividing the number of people selected to reject ($n = 508$) by the total number of people ($N = 2,000$): $SR = \frac{TP + FP}{N}$.
In this case, the selection ratio is .25, so 25% of people are rejected.
The [marginal probability](#baseRate) of not selecting someone to reject (i.e., the [marginal probability](#baseRate) of retaining) is equal to 1 minus the selection ratio.
The selection ratio might be something that the test dictates according to its cutoff score.
Or, the selection ratio might be imposed by external factors that place limits on how many people you can assign a positive test value.
For instance, when deciding whether to treat a client, the selection ratio may depend on how many therapists are available and how many cases can be treated.
### Percent Accuracy {#sec-percentAccuracyOverview}
Based on the [confusion matrix](#sec-confusionMatrix), we can calculate the prediction accuracy based on the percent accuracy of the predictions.
The percent accuracy is the number of correct predictions divided by the total number of predictions, and multiplied by 100.
In the context of a [confusion matrix](#sec-confusionMatrix), this is calculated as: $100\% \times \frac{\text{TP} + \text{TN}}{N}$.
In this case, our percent accuracy was 78%—that is, 78% of our predictions were accurate, and 22% of our predictions were inaccurate.
### Percent Accuracy by Chance {#sec-accuracyByChance}
78% sounds pretty accurate.
And it is much higher than 50%, so we are doing a pretty good job, right?
Well, it is important to compare our accuracy to what accuracy we would expect to get by chance alone, if predictions were made by a random process rather than using a test's scores.
Our selection ratio was 25.4%.
How accurate would we be if we randomly selected 25.4% of people to reject?
To determine what accuracy we could get by chance alone given the selection ratio and the base rate, we can calculate the chance probability of true positives and the chance probability of true negatives.
The probability of a given cell in the [confusion matrix](#sec-confusionMatrix) is a [joint probability](#sec-jointProbability)—the probability of two events occurring simultaneously.
To calculate a [joint probability](#sec-jointProbability), we multiply the probability of each event.
So, to get the chance expectancies of true positives, we would multiply the respective [marginal probabilities](#baseRate), as in @eq-truePositivesByChanceExample:
$$
\begin{aligned}
P(TP) &= P(\text{Poor adjustment}) \times P(\text{Reject})\\
&= BR \times SR \\
&= .05 \times .254 \\
&= .0127
\end{aligned}
$$ {#eq-truePositivesByChanceExample}
To get the chance expectancies of true negatives, we would multiply the respective [marginal probabilities](#sec-baseRate), as in @eq-trueNegativesByChanceExample:
$$
\begin{aligned}
P(TN) &= P(\text{Good adjustment}) \times P(\text{Retain})\\
&= (1 - BR) \times (1 - SR) \\
&= .95 \times .746 \\
&= .7087
\end{aligned}
$$ {#eq-trueNegativesByChanceExample}
To get the percent accuracy by chance, we sum the chance expectancies for the correct predictions (TP and TN): $.0127 + .7087 = .7214$.
Thus, the percent accuracy you can get by chance alone is 72%.
This is because most of our predictions are to retain people, and the [base rate](#sec-baseRate) of poor adjustment is quite low (.05).
Our measure with 78% accuracy provides only a 6% increment in correct predictions.
Thus, you cannot judge how good your judgment or prediction is until you know how you would do by random chance.
The chance expectancies for each cell of the [confusion matrix](#sec-confusionMatrix) are in @fig-confusionMatrix4.
::: {#fig-confusionMatrix4}
{fig-alt="Chance Expectancies in Two-by-Two Confusion Matrix. BR = base rate; SR = selection ratio. From @Petersen2024a and @PetersenPrinciplesPsychAssessment."}
Chance Expectancies in Two-by-Two Confusion Matrix. BR = base rate; SR = selection ratio. From @Petersen2024a and @PetersenPrinciplesPsychAssessment.
:::
### Predicting from the Base Rate {#sec-predictingFromBaseRate}
Now, let us consider how well you would do if you were to predict from the [base rate](#sec-baseRate).
Predicting from the [base rate](#sec-baseRate) is also called "betting from the [base rate](#sec-baseRate)", and it involves setting the selection ratio by taking advantage of the [base rate](#sec-baseRate) so that you go with the most likely outcome in every prediction.
Because the [base rate](#sec-baseRate) is quite low (.05), we could predict from the [base rate](#sec-baseRate) by selecting no one to reject (i.e., setting the selection ratio at zero).
Our percent accuracy by chance if we predict from the [base rate](#sec-baseRate) would be calculated by multiplying the [marginal probabilities](#sec-baseRate), as we did above, but with a new selection ratio, as in @eq-predictingFromBaseRateExample:
$$
\begin{aligned}
P(TP) &= P(\text{Poor adjustment}) \times P(\text{Reject})\\
&= BR \times SR \\
&= .05 \times 0 \\
&= 0 \\ \\
P(TN) &= P(\text{Good adjustment}) \times P(\text{Retain})\\
&= (1 - BR) \times (1 - SR) \\
&= .95 \times 1 \\
&= .95
\end{aligned}
$$ {#eq-predictingFromBaseRateExample}
We sum the chance expectancies for the correct predictions (TP and TN): $0 + .95 = .95$.
Thus, our percent accuracy by predicting from the [base rate](#sec-baseRate) is 95%.
This is damning to our measure because it is a much higher accuracy than the accuracy of our measure.
That is, we can be much more accurate than our measure simply by predicting from the [base rate](#sec-baseRate) and selecting no one to reject.
Going with the most likely outcome in every prediction (predicting from the [base rate](#sec-baseRate)) can be highly accurate (in terms of percent accuracy) as noted by @Meehl1955, especially when the [base rate](#sec-baseRate) is very low or very high.
This should serve as an important reminder that we need to compare the accuracy of our measures to the accuracy by (1) random chance and (2) predicting from the [base rate](#sec-baseRate).
There are several important implications of the impact of [base rates](#sec-baseRate) on prediction accuracy.
One implication is that using the same test in different settings with different [base rates](#sec-baseRate) will markedly change the accuracy of the test.
Oftentimes, using a test will actually *decrease* the predictive accuracy when the [base rate](#sec-baseRate) deviates greatly from .50.
But percent accuracy is not everything.
Percent accuracy treats different kinds of errors as if they are equally important.
However, the value we place on different kinds of errors may be different, as described next.
### Different Kinds of Errors Have Different Costs {#sec-differentErrorsDifferentCosts}
Some errors have a high cost, and some errors have a low cost.
Among the four decision outcomes, there are two types of errors: false positives and false negatives.
The extent to which false positives and false negatives are costly depends on the prediction problem.
So, even though you can often be most accurate by going with the [base rate](#sec-baseRate), it may be advantageous to use a screening instrument despite lower overall accuracy because of the huge difference in costs of false positives versus false negatives in some cases.
Consider the example of a screening instrument for HIV.
False positives would be cases where we said that someone is at high risk of HIV when they are not, whereas false negatives are cases where we said that someone is not at high risk when they actually are.
The costs of false positives include a shortage of blood, some follow-up testing, and potentially some anxiety, but that is about it.
The costs of false negatives may be people getting HIV.
In this case, the costs of false negatives greatly outweigh the costs of false positives, so we use a screening instrument to try to identify the cases at high risk for HIV because of the important consequences of failing to do so, even though using the screening instrument will lower our overall accuracy level.
Another example is when the Central Intelligence Agency (CIA) used a screen for protective typists during wartime to try to detect spies.
False positives would be cases where the CIA believes that a person is a spy when they are not, and the CIA does not hire them.
False negatives would be cases where the CIA believes that a person is not a spy when they actually are, and the CIA hires them.
In this case, a false positive would be fine, but a false negative would be really bad.
How you weigh the costs of different errors depends considerably on the domain and context.
Possible costs of false positives to society include: unnecessary and costly treatment with side effects and sending an innocent person to jail (despite our presumption of innocence in the United States criminal justice system that a person is innocent until proven guilty).
Possible costs of false negatives to society include: setting a guilty person free, failing to detect a bomb or tumor, and preventing someone from getting treatment who needs it.
The differential costs of different errors also depend on how much flexibility you have in the selection ratio in being able to set a stringent versus loose selection ratio.
Consider if there is a high cost of getting rid of people during the selection process.
For example, if you must hire 100 people and only 100 people apply for the position, you cannot lose people, so you need to hire even high-risk people.
However, if you do not need to hire many people, then you can hire more conservatively.
Any time the selection ratio differs from the [base rate](#sec-baseRate), you will make errors.
For example, if you reject 25% of applicants, and the [base rate](#sec-baseRate) of poor adjustment is 5%, then you are making errors of over-rejecting (false positives).
By contrast, if you reject 1% of applicants and the [base rate](#sec-baseRate) of poor adjustment is 5%, then you are making errors of under-rejecting or over-accepting (false negatives).
### Difficulty Predicting Low Base Rate Events {#sec-difficultyPredictingLowBRevents}
A low [base rate](#sec-baseRate) makes it harder to make predictions, and tends to lead to less accurate predictions.
For instance, it is very challenging to predict low [base rate](#sec-baseRate) behaviors, including suicide [@Kessler2020].
For this reason, it is likely much more challenging to predict touchdowns—which happen relatively less often—than it is to predict passing/rushing/receiving yards—which are more frequent and continuously distributed.
Here is the accuracy of the prediction of passing touchdowns versus passing yards among Quarterbacks:
```{r}
petersenlab::accuracyOverall(
predicted = projectionsWithActuals_weekly_qb$pass_tds,
actual = projectionsWithActuals_weekly_qb$passing_tds,
dropUndefined = TRUE
)
petersenlab::accuracyOverall(
predicted = projectionsWithActuals_weekly_qb$pass_yds,
actual = projectionsWithActuals_weekly_qb$passing_yards,
dropUndefined = TRUE
)
```
```{r}
#| include: false
rSquared_passTds <- petersenlab::accuracyOverall(
predicted = projectionsWithActuals_weekly_qb$pass_tds,
actual = projectionsWithActuals_weekly_qb$passing_tds,
dropUndefined = TRUE
)$rsquared
rSquared_passYds <- petersenlab::accuracyOverall(
predicted = projectionsWithActuals_weekly_qb$pass_yds,
actual = projectionsWithActuals_weekly_qb$passing_yards,
dropUndefined = TRUE
)$rsquared
```
The accuracy for predicting passing yards ($R^2 = `r petersenlab::apa(rSquared_passYds, decimals = 2, leading = FALSE)`$) is higher than the accuracy for predicting passing touchdowns ($R^2 = `r petersenlab::apa(rSquared_passTds, decimals = 2, leading = FALSE)`$).
Here is the accuracy of the prediction of rushing touchdowns versus rushing yards among Running Backs:
```{r}
petersenlab::accuracyOverall(
predicted = projectionsWithActuals_weekly_rb$rush_tds,
actual = projectionsWithActuals_weekly_rb$rushing_tds,
dropUndefined = TRUE
)
petersenlab::accuracyOverall(
predicted = projectionsWithActuals_weekly_rb$rush_yds,
actual = projectionsWithActuals_weekly_rb$rushing_yards,
dropUndefined = TRUE
)
```
```{r}
#| include: false
rSquared_rushTds <- petersenlab::accuracyOverall(
predicted = projectionsWithActuals_weekly_rb$rush_tds,
actual = projectionsWithActuals_weekly_rb$rushing_tds,
dropUndefined = TRUE
)$rsquared
rSquared_rushYds <- petersenlab::accuracyOverall(
predicted = projectionsWithActuals_weekly_rb$rush_yds,
actual = projectionsWithActuals_weekly_rb$rushing_yards,
dropUndefined = TRUE
)$rsquared
```
The accuracy for predicting rushing yards ($R^2 = `r petersenlab::apa(rSquared_rushYds, decimals = 2, leading = FALSE)`$) is higher than the accuracy for predicting rushing touchdowns ($R^2 = `r petersenlab::apa(rSquared_rushTds, decimals = 2, leading = FALSE)`$).
Here is the accuracy of the prediction of receiving touchdowns versus receiving yards among Wide Receivers:
```{r}
petersenlab::accuracyOverall(
predicted = projectionsWithActuals_weekly_wr$rec_tds,
actual = projectionsWithActuals_weekly_wr$receiving_tds,
dropUndefined = TRUE
)
petersenlab::accuracyOverall(
predicted = projectionsWithActuals_weekly_wr$rec_yds,
actual = projectionsWithActuals_weekly_wr$receiving_yards,
dropUndefined = TRUE
)
```
```{r}
#| include: false
rSquared_recTds <- petersenlab::accuracyOverall(
predicted = projectionsWithActuals_weekly_wr$rec_tds,
actual = projectionsWithActuals_weekly_wr$receiving_tds,
dropUndefined = TRUE
)$rsquared
rSquared_recYds <- petersenlab::accuracyOverall(
predicted = projectionsWithActuals_weekly_wr$rec_yds,
actual = projectionsWithActuals_weekly_wr$receiving_yards,
dropUndefined = TRUE
)$rsquared
```
The accuracy for predicting receiving yards ($R^2 = `r petersenlab::apa(rSquared_recYds, decimals = 2, leading = FALSE)`$) is higher than the accuracy for predicting rushing touchdowns ($R^2 = `r petersenlab::apa(rSquared_recTds, decimals = 2, leading = FALSE)`$) among Wide Receivers.
Here is the accuracy of the prediction of receiving touchdowns versus receiving yards among Tight Ends:
```{r}
petersenlab::accuracyOverall(
predicted = projectionsWithActuals_weekly_te$rec_tds,
actual = projectionsWithActuals_weekly_te$receiving_tds,
dropUndefined = TRUE
)
petersenlab::accuracyOverall(
predicted = projectionsWithActuals_weekly_te$rec_yds,
actual = projectionsWithActuals_weekly_te$receiving_yards,
dropUndefined = TRUE
)
```
```{r}
#| include: false
rSquared_recTds_te <- petersenlab::accuracyOverall(
predicted = projectionsWithActuals_weekly_te$rec_tds,
actual = projectionsWithActuals_weekly_te$receiving_tds,
dropUndefined = TRUE
)$rsquared
rSquared_recYds_te <- petersenlab::accuracyOverall(
predicted = projectionsWithActuals_weekly_te$rec_yds,
actual = projectionsWithActuals_weekly_te$receiving_yards,
dropUndefined = TRUE
)$rsquared
```
The accuracy for predicting receiving yards ($R^2 = `r petersenlab::apa(rSquared_recYds_te, decimals = 2, leading = FALSE)`$) is higher than the accuracy for predicting rushing touchdowns ($R^2 = `r petersenlab::apa(rSquared_recTds_te, decimals = 2, leading = FALSE)`$) among Tight Ends.
In sum, the proportion of variance explained by the prediction (i.e., $R^2$) is much higher for passing/rushing/receiving yards than it is for touchdowns.
This is consistent with the notion that prediction accuracy tends to be lower for lower [base rate](#sec-baseRate) events (like touchdowns) compared to higher [base rate](#sec-baseRate) events.
### Sensitivity, Specificity, PPV, and NPV {#sec-sensitivitySpecificityPPVnpv}
As described earlier, percent accuracy is not the only important aspect of accuracy.
Percent accuracy can be misleading because it is highly influenced by [base rates](#sec-baseRate).
You can have a high percent accuracy by predicting from the base rate and saying that no one has the condition (if the [base rate](#sec-baseRate) is low) or that everyone has the condition (if the [base rate](#sec-baseRate) is high).
Thus, it is also important to consider other aspects of accuracy, including sensitivity (SN), specificity (SP), positive predictive value (PPV), and negative predictive value (NPV).
We want our predictions to be sensitive to be able to detect the characteristic but also to be specific so that we classify only people actually with the characteristic as having the characteristic.
Let us return to the [confusion matrix](#sec-confusionMatrix) in @fig-confusionMatrix3.
If we know the frequency of each of the four predicted-actual combinations of the [confusion matrix](#sec-confusionMatrix) (TP, TN, FP, FN), we can calculate sensitivity, specificity, PPV, and NPV.
Sensitivity is the proportion of those with the characteristic ($\text{TP} + \text{FN}$) that we identified with our measure ($\text{TP}$): $\frac{\text{TP}}{\text{TP} + \text{FN}} = \frac{86}{86 + 14} = .86$.
Specificity is the proportion of those who do not have the characteristic ($\text{TN} + \text{FP}$) that we correctly classify as not having the characteristic ($\text{TN}$): $\frac{\text{TN}}{\text{TN} + \text{FP}} = \frac{1,478}{1,478 + 422} = .78$.
PPV is the proportion of those who we classify as having the characteristic ($\text{TP} + \text{FP}$) who actually have the characteristic ($\text{TP}$): $\frac{\text{TP}}{\text{TP} + \text{FP}} = \frac{86}{86 + 422} = .17$.
NPV is the proportion of those we classify as not having the characteristic ($\text{TN} + \text{FN}$) who actually do not have the characteristic ($\text{TN}$): $\frac{\text{TN}}{\text{TN} + \text{FN}} = \frac{1,478}{1,478 + 14} = .99$.
Sensitivity, specificity, PPV, and NPV are proportions, and their values therefore range from 0 to 1, where higher values reflect greater accuracy.
With sensitivity, specificity, PPV, and NPV, we have a good snapshot of how accurate the measure is at a given cutoff.
In our case, our measure is good at finding whom to reject (high sensitivity), but it is rejecting too many people who do not need to be rejected (lower PPV due to many FPs).
Most people whom we classify as having the characteristic do not actually have the characteristic.
However, the fact that we are over-rejecting could be okay depending on our goals, for instance, if we do not care about over-dropping (i.e., the PPV being low).
#### Some Accuracy Estimates Depend on the Cutoff {#sec-accuracyCutoff}
Sensitivity, specificity, PPV, and NPV differ based on the cutoff (i.e., threshold) for classification.
Consider the following example.
Aliens visit Earth, and they develop a test to determine whether a berry is edible or inedible.
@fig-classificationDistributions depicts the distributions of scores by berry type.
Note how there are clearly two distinct distributions.
However, the distributions overlap to some degree.
Thus, any cutoff will have at least some inaccurate classifications.
The extent of overlap of the distributions reflects the amount of measurement error of the measure with respect to the characteristic of interest.
```{r}
#| label: fig-classificationDistributions
#| fig-cap: "Distribution of Test Scores by Berry Type."
#| fig-alt: "Distribution of Test Scores by Berry Type."
#| code-fold: true
#No Cutoff
sampleSize <- 1000
edibleScores <- rnorm(sampleSize, 50, 15)
inedibleScores <- rnorm(sampleSize, 100, 15)
edibleData <- data.frame(
score = c(
edibleScores,
inedibleScores),
type = c(
rep("edible", sampleSize),
rep("inedible", sampleSize)))
cutoff <- 75
hist_edible <- density(
edibleScores,
from = 0,
to = 150) %$% # exposition pipe magrittr::`%$%`
data.frame(
x = x,
y = y) %>%
mutate(area = x >= cutoff)
hist_edible$type[hist_edible$area == TRUE] <- "edible_FP"
hist_edible$type[hist_edible$area == FALSE] <- "edible_TN"
hist_inedible <- density(
inedibleScores,
from = 0,
to = 150) %$% # exposition pipe magrittr::`%$%`
data.frame(
x = x,
y = y) %>%
mutate(area = x < cutoff)
hist_inedible$type[hist_inedible$area == TRUE] <- "inedible_FN"
hist_inedible$type[hist_inedible$area == FALSE] <- "inedible_TP"
density_data <- bind_rows(
hist_edible,
hist_inedible)
density_data$type <- factor(
density_data$type,
levels = c(
"edible_TN",
"inedible_TP",
"edible_FP",
"inedible_FN"))
ggplot(
data = edibleData,
aes(
x = score,
ymin = 0,
fill = type)) +
geom_density(alpha = .5) +
scale_fill_manual(
name = "Berry Type",
values = c(
viridis::viridis(2)[1],
viridis::viridis(2)[2])) +
scale_y_continuous(name = "Frequency") +
theme_bw() +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
```
@fig-classificationStandardCutoff depicts the distributions of scores by berry type with a cutoff.
The red line indicates the cutoff—the level above which berries are classified by the test as inedible.
There are errors on each side of the cutoff.
Below the cutoff, there are some false negatives (blue): inedible berries that are inaccurately classified as edible.
Above the cutoff, there are some false positives (green): edible berries that are inaccurately classified as inedible.
Costs of false negatives could include sickness or death from eating the inedible berries.
Costs of false positives could include taking longer to find food, finding insufficient food, and starvation.
```{r}
#| label: fig-classificationStandardCutoff
#| fig-cap: "Classifications Based on a Cutoff. Note that some true negatives and true positives are hidden behind the false positives and false negatives."
#| fig-alt: "Classifications Based on a Cutoff. Note that some true negatives and true positives are hidden behind the false positives and false negatives."
#| code-fold: true
#Standard Cutoff
ggplot(
data = density_data,
aes(
x = x,
ymin = 0,
ymax = y,
fill = type)) +
geom_ribbon(alpha = 1) +
scale_fill_manual(
name = "Berry Type",
values = c(
viridis::viridis(4)[4],
viridis::viridis(4)[1],
viridis::viridis(4)[3],
viridis::viridis(4)[2]),
breaks = c("edible_TN","inedible_TP","edible_FP","inedible_FN"),
labels = c("Edible: TN","Inedible: TP","Edible: FP","Inedible: FN")) +
geom_line(aes(y = y)) +
geom_vline(
xintercept = cutoff,
color = "red",
linewidth = 2) +
scale_x_continuous(name = "score") +
scale_y_continuous(name = "Frequency") +
theme_bw() +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
```
Based on our assessment goals, we might use a different selection ratio by changing the cutoff.
@fig-classificationRaiseCutoff depicts the distributions of scores by berry type when we raise the cutoff.
There are now more false negatives (blue) and fewer false positives (green).
If we raise the cutoff (to be more conservative), the number of false negatives increases and the number of false positives decreases.
Consequently, as the cutoff increases, sensitivity and NPV decrease (because we have more false negatives), whereas specificity and PPV increase (because we have fewer false positives).
A higher cutoff could be optimal if the costs of false positives are considered greater than the costs of false negatives.
For instance, if the aliens cannot risk eating the inedible berries because the berries are fatal, and there are sufficient edible berries that can be found to feed the alien colony.
```{r}
#| label: fig-classificationRaiseCutoff
#| fig-cap: "Classifications Based on Raising the Cutoff. Note that some true negatives and true positives are hidden behind the false positives and false negatives."
#| fig-alt: "Classifications Based on Raising the Cutoff. Note that some true negatives and true positives are hidden behind the false positives and false negatives."
#| code-fold: true
#Raise the cutoff
cutoff <- 85
hist_edible <- density(
edibleScores,
from = 0,
to = 150) %$% # exposition pipe magrittr::`%$%`
data.frame(
x = x,
y = y) %>%
mutate(area = x >= cutoff)
hist_edible$type[hist_edible$area == TRUE] <- "edible_FP"
hist_edible$type[hist_edible$area == FALSE] <- "edible_TN"
hist_inedible <- density(
inedibleScores,
from = 0,
to = 150) %$% # exposition pipe magrittr::`%$%`
data.frame(
x = x,
y = y) %>%
mutate(area = x < cutoff)
hist_inedible$type[hist_inedible$area == TRUE] <- "inedible_FN"
hist_inedible$type[hist_inedible$area == FALSE] <- "inedible_TP"
density_data <- bind_rows(
hist_edible,
hist_inedible)
density_data$type <- factor(
density_data$type,
levels = c(
"edible_TN",
"inedible_TP",
"edible_FP",
"inedible_FN"))
ggplot(
data = density_data,
aes(
x = x,
ymin = 0,
ymax = y,
fill = type)) +
geom_ribbon(alpha = 1) +
scale_fill_manual(
name = "Berry Type",
values = c(
viridis::viridis(4)[4],
viridis::viridis(4)[1],
viridis::viridis(4)[3],
viridis::viridis(4)[2]),
breaks = c("edible_TN","inedible_TP","edible_FP","inedible_FN"),
labels = c("Edible: TN","Inedible: TP","Edible: FP","Inedible: FN")) +
geom_line(aes(y = y)) +
geom_vline(
xintercept = cutoff,
color = "red",
linewidth = 2) +
scale_x_continuous(name = "score") +
scale_y_continuous(name = "Frequency") +
theme_bw() +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
```
@fig-classificationLowerCutoff depicts the distributions of scores by berry type when we lower the cutoff.
There are now fewer false negatives (blue) and more false positives (green).