-
-
Notifications
You must be signed in to change notification settings - Fork 5
Expand file tree
/
Copy pathbase-rates.qmd
More file actions
1179 lines (929 loc) · 61.6 KB
/
base-rates.qmd
File metadata and controls
1179 lines (929 loc) · 61.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
{{< include _chunk-timing.qmd >}}
# Base Rates {#sec-baseRates}
This chapter provides an overview of base rates and the important roles they play in prediction.
## Getting Started {#sec-baseRatesGettingStarted}
### Load Packages {#sec-baseRatesLoadPackages}
```{r}
library("petersenlab")
library("psych")
library("tidyverse")
```
### Load Data {#sec-baseRatesLoadData}
```{r}
#| eval: false
#| include: false
load(file = file.path(path, "/OneDrive - University of Iowa/Teaching/Courses/Fantasy Football/Data/player_stats_weekly.RData", fsep = ""))
load(file = file.path(path, "/OneDrive - University of Iowa/Teaching/Courses/Fantasy Football/Data/player_stats_seasonal.RData", fsep = ""))
```
```{r}
load(file = "./data/player_stats_weekly.RData")
load(file = "./data/player_stats_seasonal.RData")
```
We created the `player_stats_weekly.RData` and `player_stats_seasonal.RData` objects in @sec-calculatePlayerAge.
## Overview {#sec-baseRatesOverview}
Predicting player performance is a complex prediction task.
Performance is probabilistically influenced by many processes, including processes internal to the player in addition to external processes.
Moreover, people's performance occurs in the context of a dynamic system with nonlinear, probabilistic, and cascading influences that change across time.
The ever-changing system makes behavior challenging to predict.
And, similar to chaos theory, one small change in the system can lead to large differences later on.
Moreover, there are important factors to keep in mind when making predictions.
Let's consider a prediction example, assuming the following probabilities:
- The probability of contracting HIV is .3%
- The probability of a positive test for HIV is 1%
- The probability of a positive test if you have HIV is 95%
What is the probability of HIV if you have a positive test?
As we will see, the probability is: $\frac{95\% \times .3\%}{1\%} = 28.5\%$.
So based on the above probabilities, if you have a positive test, the probability that you have HIV is 28.5%.
Most people tend to vastly overestimate the likelihood that the person has HIV in this example.
Why?
Because they do not pay enough attention to the base rate (in this example, the base rate of HIV is .3%).
Base rates are important to account for when making predictions and people tend to ignore them, a phenomenon called [base rate neglect](#sec-fallaciesBaseRate), which is described in @sec-fallaciesBaseRate.
In general, people tend to overestimate the likelihood of low base-rate events [@Kahneman2011].
That is, if the base rate of an event or condition—such as schizophrenia—is low (e.g., ~0.5%), people overestimate the likelihood that a person has schizophrenia when given specific information about the person such as their symptoms and history.
As an example of people overestimating the likelihood of low base-rate events, @Fox1998a asked National Basketball Association (NBA) fans to estimate the probability that each of 8 teams in the playoff would win the playoffs.
The median sum of probability judgments for the eight teams was 240% [@Fox1998a; @Kahneman2011] even though, in reality, the true probabilities must sum to 100%.
The base rate for a given team to win is 12.5% (i.e., 1/8), whereas people were giving each team (on average) a 30% chance (i.e., 240/8) to win.
The finding suggests that people were evaluating each team individually, considering the reasons why that particular team could win, without properly accounting for the total probability (and the base rate).
In addition, people tend to overweight unlikely events in their decisions [@Kahneman2011].
People tend to make judgments based on the [representativeness](#sec-heuristicsRepresentativeness) (i.e., similarity to a prototype, or stereotyping) and [availability](#sec-heuristicsAvailability) heuristics rather than the base rate [@Kahneman2011].
For instance, professional scouts often make judgments about players based in part on their build and look—i.e., whethey they look the part—rather than just based on their performance [@Kahneman2011].
Another important phenomenon is that low base-rate events (i.e., unlikely events) are difficult to predict accurately.
Predictions of lower base rate phenomena tend to be lower than predictions of more common phenomena.
For instance, predictions of touchdowns (less common) tend to be less accurate than predictions of yardage (more common).
Teams in the NFL also tend to neglect base rates.
For instance, NFL teams give too much weight to scouting evidence in deciding which players to draft, and fail to integrate such evidence with the prior probabilities of a player's successful future performance [i.e., the base rate of success; @Massey2013].
"...consider how suspiciously often we hear a college prospect described as a 'once-in-a-lifetime player'." [@Massey2013, p. 1482].
In this chapter, we describe [ways to account for base rates](#sec-accountForBaseRates) in judgments and predictions.
## Issues Around Probability {#sec-probability}
### Types of Probabilities {#sec-probabilityTypes}
It is important to distinguish between different types of probabilities: marginal probabilities, joint probabilities, and conditional probabilities.
#### Base Rate (Marginal Probability) {#sec-baseRate}
The *base rate* is a marginal probability, which is the general probability of an event irrespective of other things.
For instance, the base rate of HIV is the probability of developing HIV.
In the U.S., the prevalence rate of HIV is ~0.4% of the adult population [@AIDSVu2022; archived at <https://perma.cc/8GE6-GAPC>].
For instance, we can consider the following marginal probabilities:
$P(C_i)$ is the probability (i.e., base rate) of a classification, $C$, independent of other things.
A base rate is often used as the "*prior probability*" in a Bayesian model.
In our example above, $P(C_i)$ is the base rate (i.e., prevalence) of HIV in the population: $P(\text{HIV}) = .3\%$.
$P(R_i)$ is the probability (base rate) of a response, $R$, independent of other things.
In the example above, $P(R_i)$ is the base rate of a positive test for HIV: $P(\text{positive test}) = 1\%$.
The base rate of a positive test is known as the *positivity rate* or *selection ratio*.
#### Joint Probability {#sec-jointProbability}
A *joint probability* is the probability of two (or more) events occurring simultaneously.
For instance, the probability of events $A$ and $B$ both occurring together is $P(A, B)$.
A joint probability can be calculated using the [marginal probability](#sec-baseRate) of each event, as in @eq-jointProbability:
$$
P(A, B) = P(A) \cdot P(B)
$$ {#eq-jointProbability}
Conversely (and rearranging the terms for the calculation of [conditional probability](#sec-conditionalProbability)), a [joint probability](#sec-jointProbability) can also be calculated using the [conditional probability](#sec-conditionalProbability) and [marginal probability](#sec-baseRate), as in @eq-jointProbability2:
$$
P(A, B) = P(A | B) \cdot P(B)
$$ {#eq-jointProbability2}
#### Conditional Probability {#sec-conditionalProbability}
A *conditional probability* is the probability of one event occurring given the occurrence of another event.
Conditional probabilities are written as: $P(A | B)$.
This is read as the probability that event $A$ occurs given that event $B$ occurred.
For instance, we can consider the following conditional probabilities:
$P(C | R)$ is the probability of a classification, $C$, given a response, $R$.
In other words, $P(C | R)$ is the probability of having HIV given a positive test: $P(\text{HIV} | \text{positive test})$.
$P(R | C)$ is the probability of a response, $R$, given a classification, $C$.
In the example above, $P(R | C)$ is the probability of having a positive test given that a person has HIV: $P(\text{positive test} | \text{HIV}) = 95\%$.
A conditional probability can be calculated using the [joint probability](#sec-jointProbability) and [marginal probability](#sec-baseRate) (base rate), as in @eq-conditionalProbability:
$$
P(A, B) = P(A | B) \cdot P(B)
$$ {#eq-conditionalProbability}
### Confusion of the Inverse {#sec-inverseFallacy}
A [conditional probability](#sec-conditionalProbability) is not the same thing as its reverse (or inverse) [conditional probability](#sec-conditionalProbability).
Unless the [base rate](#sec-baseRate) of the two events ($C$ and $R$) are the same, $P(C | R) \neq P(R | C)$.
However, people frequently make the mistake of thinking that two inverse [conditional probabilities](#sec-conditionalProbability) are the same.
This mistake is known as the "confusion of the inverse", or the "inverse fallacy", or the "conditional probability fallacy".
The confusion of inverse probabilities is the logical error of representative thinking that leads people to assume that the probability of $C$ given $R$ is the same as the probability of $R$ given C, even though this is not true.
As a few examples to demonstrate the logical fallacy, if 93% of breast cancers occur in high-risk women, this does not mean that 93% of high-risk women will eventually get breast cancer.
As another example, if 77% of car accidents take place within 15 miles of a driver's home, this does not mean that you will get in an accident 77% of times you drive within 15 miles of your home.
Which car is the most frequently stolen?
It is often the Honda Accord or Honda Civic—probably because they are among the most popular/commonly available cars.
The probability that the car is a Honda Accord given that a car was stolen ($p(\text{Honda Accord } | \text{ Stolen})$) is what the media reports and what the police care about.
However, that is not what buyers and car insurance companies should care about.
Instead, they care about the probability that the car will be stolen given that it is a Honda Accord ($p(\text{Stolen } | \text{ Honda Accord})$).
Applied to fantasy football, the probability that a given player will be injured given that he is a Running Back ($p(\text{Injured } | \text{ RB})$) is not the same as the probability that a given player is a Running Back given that he is injured ($p(\text{RB } | \text{ Injured})$).
To calculate the probability of the inverse conditional, we can leverage Bayesian statistics to calculate the probability of the inverse conditional given a conditional probability (i.e., likelihood) and the marginal probabilities (base rates) of both events.
Bayesian statistics is another branch of statistics and is different from frequentist statistics and null hypothesis significance testing.
Bayesian statistics is based on Bayes’ theorem, which allows updating our probability estimates based on prior information (e.g., base rate) and new data.
### Bayes' Theorem {#sec-bayesTheorem}
#### Standard Formulation {#sec-bayesTheoremStandard}
An alternative way of calculating a [conditional probability](#sec-conditionalProbability) is using the inverse [conditional probability](#sec-conditionalProbability) (instead of the [joint probability](#sec-jointProbability)).
This is known as Bayes' theorem.
Bayes' theorem can help us calculate a [conditional probability](#sec-conditionalProbability) of some classification, $C$, given some response, $R$, if we know the inverse [conditional probability](#sec-conditionalProbability) and the [base rate](#sec-baseRate) (marginal probability) of each.
Bayes' theorem is in @eq-bayes1:
$$
\begin{aligned}
P(C | R) &= \frac{P(R | C) \cdot P(C_i)}{P(R_i)}
\end{aligned}
$$ {#eq-bayes1}
Or, equivalently (rearranging the terms):
$$
\begin{aligned}
\frac{P(C | R)}{P(R | C)} = \frac{P(C_i)}{P(R_i)}
\end{aligned}
$$ {#eq-bayes2}
Or, equivalently (rearranging the terms):
$$
\begin{aligned}
\frac{P(C | R)}{P(C_i)} = \frac{P(R | C)}{P(R_i)}
\end{aligned}
$$ {#eq-bayes3}
More generally, Bayes' theorem has been described as:
$$
\begin{aligned}
P(H | E) &= \frac{P(E | H) \cdot P(H)}{P(E)} \\
\text{posterior probability} &= \frac{\text{likelihood} \times \text{prior probability}}{\text{model evidence}}
\end{aligned}
$$ {#eq-bayes6}
where $H$ is the hypothesis, and $E$ is the evidence—the new information that was not used in computing the prior probability.
In Bayesian terms, the *posterior probability* is the conditional probability of one event occurring given another event—it is the updated probability after the evidence is considered.
In this case, the posterior probability is the probability of the classification occurring ($C$) given the response ($R$).
The *likelihood* is the inverse conditional probability—the probability of the response ($R$) occurring given the classification ($C$).
The *prior probability* is the marginal probability of the event (i.e., the classification) occurring, before we take into account any new information.
The *model evidence* is the marginal probability of the other event occurring—i.e., the marginal probability of seeing the evidence.
Bayes' theorem provides the foundation for a paradigm of statistics called Bayesian statistics, which (unlike frequentist statistics) does not use *p*-values.
In the HIV example above, we can calculate the [conditional probability](#sec-conditionalProbability) of HIV given a positive test using three terms: the [conditional probability](#sec-conditionalProbability) of a positive test given HIV (i.e., the sensitivity of the test), the [base rate](#sec-baseRate) of HIV, and the [base rate](#sec-baseRate) of a positive test for HIV.
The [conditional probability](#sec-conditionalProbability) of HIV given a positive test is in @eq-hivExample1:
$$
\begin{aligned}
P(C | R) &= \frac{P(R | C) \cdot P(C_i)}{P(R_i)} \\
P(\text{HIV} | \text{positive test}) &= \frac{P(\text{positive test} | \text{HIV}) \cdot P(\text{HIV})}{P(\text{positive test})} \\
&= \frac{\text{sensitivity of test} \times \text{base rate of HIV}}{\text{base rate of positive test}} \\
&= \frac{95\% \times .3\%}{1\%} = \frac{.95 \times .003}{.01}\\
&= 28.5\%
\end{aligned}
$$ {#eq-hivExample1}
The [`petersenlab`](https://cran.r-project.org/web/packages/petersenlab/index.html) package [@R-petersenlab] contains the `petersenlab::pAgivenB()` function that estimates the probability of one event, $A$, given another event, $B$.
```{r}
petersenlab::pAgivenB(
pBgivenA = .95,
pA = .003,
pB = .01)
```
Thus, assuming the probabilities in the example above, the [conditional probability](#conditionalProbability) of having HIV if a person has a positive test is 28.5%.
Given a positive test, chances are higher than not that the person does not have HIV.
Now let's see what happens if the person tests positive a second time.
We would revise our "[prior probability](#sec-baseRate)" for HIV from the general prevalence in the population (0.3%) to be the "posterior probability" of HIV given a first positive test (28.5%).
This is known as *Bayesian updating*.
We would also update the "evidence" to be the [marginal probability](#sec-baseRate) of getting a second positive test.
If we do not know a [marginal probability](#sec-baseRate) (i.e., base rate) of an event (e.g., getting a second positive test), we can calculate a [marginal probability](#sec-baseRate) with the *law of total probability* using [conditional probabilities](#sec-conditionalProbability) and the [marginal probability](#sec-baseRate) of another event (e.g., having HIV).
According to the law of total probability, the probability of getting a positive test is the probability that a person with HIV gets a positive test (i.e., sensitivity) times the base rate of HIV plus the probability that a person without HIV gets a positive test (i.e., false positive rate) times the [base rate](#sec-baseRate) of not having HIV, as in @eq-lawOfTotalProbability:
$$
\begin{aligned}
P(\text{not } C_i) &= 1 - P(C_i) \\
P(R_i) &= P(R | C) \cdot P(C_i) + P(R | \text{not } C) \cdot P(\text{not } C_i) \\
1\% &= 95\% \times .3\% + P(R | \text{not } C) \times 99.7\% \\
\end{aligned}
$$ {#eq-lawOfTotalProbability}
In this case, we know the [marginal probability](#sec-baseRate) ($P(R_i)$), and we can use that to solve for the unknown [conditional probability](#sec-conditionalProbability) that reflects the false positive rate ($P(R | \text{not } C)$), as in @eq-conditionalProbabilityRevised:
$$
\scriptsize
\begin{aligned}
P(R_i) &= P(R | C) \cdot P(C_i) + P(R | \text{not } C) \cdot P(\text{not } C_i) && \\
P(R_i) - [P(R | \text{not } C) \cdot P(\text{not } C_i)] &= P(R | C) \cdot P(C_i) && \text{Move } P(R | \text{not } C) \text{ to the left side} \\
- [P(R | \text{not } C) \cdot P(\text{not } C_i)] &= P(R | C) \cdot P(C_i) - P(R_i) && \text{Move } P(R_i) \text{ to the right side} \\
P(R | \text{not } C) \cdot P(\text{not } C_i) &= P(R_i) - [P(R | C) \cdot P(C_i)] && \text{Multiply by } -1 \\
P(R | \text{not } C) &= \frac{P(R_i) - [P(R | C) \cdot P(C_i)]}{P(\text{not } C_i)} && \text{Divide by } P(R | \text{not } C) \\
&= \frac{1\% - [95\% \times .3\%]}{99.7\%} = \frac{.01 - [.95 \times .003]}{.997}\\
&= .7171515\% \\
\end{aligned}
$$ {#eq-conditionalProbabilityRevised}
We can then estimate the marginal probability of the event, substititing in $P(R | \text{not } C)$, using the law of total probability.
The [`petersenlab`](https://cran.r-project.org/web/packages/petersenlab/index.html) package [@R-petersenlab] contains the `petersenlab::pA()` function that estimates the marginal probability of one event, $A$.
```{r}
petersenlab::pA(
pAgivenB = .95,
pB = .003,
pAgivenNotB = .007171515)
```
The [`petersenlab`](https://cran.r-project.org/web/packages/petersenlab/index.html) package [@R-petersenlab] contains the `petersenlab::pBgivenNotA()` function that estimates the probability of one event, $B$, given that another event, $A$, did not occur.
```{r}
petersenlab::pBgivenNotA(
pBgivenA = .95,
pA = .003,
pB = .01)
```
With this [conditional probability](#sec-conditionalProbability) ($P(R | \text{not } C)$), the updated [marginal probability](#sec-baseRate) of having HIV ($P(C_i)$), and the updated marginal probability of not having HIV ($P(\text{not } C_i)$), we can now calculate an updated estimate of the [marginal probability](#sec-baseRate) of getting a second positive test.
The probability of getting a second positive test is the probability that a person with HIV gets a second positive test (i.e., sensitivity) times the updated probability of HIV plus the probability that a person without HIV gets a second positive test (i.e., false positive rate) times the updated probability of not having HIV, as in @eq-baseRateUpdated:
$$
\begin{aligned}
P(R_{i}) &= P(R | C) \cdot P(C_i) + P(R | \text{not } C) \cdot P(\text{not } C_i) \\
&= 95\% \times 28.5\% + .7171515\% \times 71.5\% = .95 \times .285 + .007171515 \times .715 \\
&= 27.58776\%
\end{aligned}
$$ {#eq-baseRateUpdated}
The [`petersenlab`](https://cran.r-project.org/web/packages/petersenlab/index.html) package [@R-petersenlab] contains the `petersenlab::pB()` function that estimates the marginal probability of one event, $B$.
```{r}
petersenlab::pB(
pBgivenA = .95,
pA = .285,
pBgivenNotA = .007171515)
```
We then substitute the updated [marginal probability](#sec-baseRate) of HIV ($P(C_i)$) and the updated [marginal probability](#sec-baseRate) of getting a second positive test ($P(R_i)$) into Bayes' theorem to get the probability that the person has HIV if they have a second positive test (assuming the errors of each test are independent, i.e., uncorrelated), as in @eq-baseRateUpdated2:
$$
\begin{aligned}
P(C | R) &= \frac{P(R | C) \cdot P(C_i)}{P(R_i)} \\
P(\text{HIV} | \text{a second positive test}) &= \frac{P(\text{a second positive test} | \text{HIV}) \cdot P(\text{HIV})}{P(\text{a second positive test})} \\
&= \frac{\text{sensitivity of test} \times \text{updated base rate of HIV}}{\text{updated base rate of positive test}} \\
&= \frac{95\% \times 28.5\%}{27.58776\%} \\
&= 98.14\%
\end{aligned}
$$ {#eq-baseRateUpdated2}
The [`petersenlab`](https://cran.r-project.org/web/packages/petersenlab/index.html) package [@R-petersenlab] contains the `petersenlab::pAgivenB()` function that estimates the probability of one event, $A$, given another event, $B$.
```{r}
petersenlab::pAgivenB(
pBgivenA = .95,
pA = .285,
pB = .2758776)
```
Thus, a second positive test greatly increases the posterior probability that the person has HIV from 28.5% to over 98%.
As seen in the rearranged formula in @eq-bayes2, the ratio of the [conditional probabilities](#sec-conditionalProbability) is equal to the ratio of the [base rates](#sec-baseRate).
Thus, it is important to consider [base rates](#sec-baseRate).
People have a strong tendency to ignore (or give insufficient weight to) [base rates](#sec-baseRate) when making predictions.
The failure to consider the [base rate](#sec-baseRate) when making predictions when given specific information about a case is known as the [base rate fallacy](#sec-fallaciesBaseRate) or as [base rate neglect](#sec-fallaciesBaseRate).
For example, people tend to say that the probability of a rare event is more likely than it actually is given specific information.
As seen in the rearranged formula in @eq-bayes3, the inverse [conditional probabilities](#sec-conditionalProbability) ($P(C | R)$ and $P(R | C)$) are not equal unless the [base rates](#sec-baseRate) of $C$ and $R$ are the same.
If the [base rates](#sec-baseRate) are not equal, we are making at least some prediction errors.
If $P(C_i) > P(R_i)$, our predictions must include some false negatives.
If $P(R_i) > P(C_i)$, our predictions must include some false positives.
#### Alternative Formulation {#sec-bayesTheoremAlternative}
Using the law of total probability, we can substitute the calculation of the [marginal probability](#sec-baseRate) ($P(R_i)$) into Bayes' theorem to get an alternative formulation of Bayes' theorem, as in @eq-baseRateUpdated3:
$$
\begin{aligned}
P(C | R) &= \frac{P(R | C) \cdot P(C_i)}{P(R_i)} \\
&= \frac{P(R | C) \cdot P(C_i)}{P(R | C) \cdot P(C_i) + P(R | \text{not } C) \cdot P(\text{not } C_i)} \\
&= \frac{P(R | C) \cdot P(C_i)}{P(R | C) \cdot P(C_i) + P(R | \text{not } C) \cdot [1 - P(C_i)]}
\end{aligned}
$$ {#eq-baseRateUpdated3}
Instead of using [marginal probability](#sec-baseRate) ([base rate](#sec-baseRate)) of $R$, as in the original formulation of Bayes' theorem, it uses the [conditional probability](#sec-conditionalProbability), $P(R|\text{not } C)$.
Thus, it uses three terms: two [conditional probabilities](#sec-conditionalProbability)—$P(R|C)$ and $P(R|\text{not } C)$—and one [marginal probability](#sec-baseRate), $P(C_i)$.
Let us see how the alternative formulation of Bayes' theorem applies to the HIV example above.
We can calculate the probability of HIV given a positive test using three terms: the [conditional probability](#sec-conditionalProbability) that a person with HIV gets a positive test (i.e., [sensitivity](#sec-sensitivity)), the [conditional probability](#sec-conditionalProbability) that a person without HIV gets a positive test (i.e., [false positive rate](#sec-falsePositiveRate)), and the [base rate](#sec-baseRate) of HIV.
Using the $P(R|\text{not } C)$ calculated in @eq-conditionalProbabilityRevised, the [conditional probability](#sec-conditionalProbability) of HIV given a single positive test is in @eq-bayes4:
$$
\small
\begin{aligned}
P(C | R) &= \frac{P(R | C) \cdot P(C_i)}{P(R | C) \cdot P(C_i) + P(R | \text{not } C) \cdot [1 - P(C_i)]} \\
&= \frac{\text{sensitivity of test} \times \text{base rate of HIV}}{\text{sensitivity of test} \times \text{base rate of HIV} + \text{false positive rate of test} \times (1 - \text{base rate of HIV})} \\
&= \frac{95\% \times .3\%}{95\% \times .3\% + .7171515\% \times (1 - .3\%)} = \frac{.95 \times .003}{.95 \times .003 + .007171515 \times (1 - .003)}\\
&= 28.5\%
\end{aligned}
$$ {#eq-bayes4}
The [`petersenlab`](https://cran.r-project.org/web/packages/petersenlab/index.html) package [@R-petersenlab] contains the `petersenlab::pAgivenB()` function that estimates the probability of one event, $A$, given another event, $B$.
```{r}
pAgivenB(
pBgivenA = .95,
pA = .003,
pBgivenNotA = .007171515)
pAgivenB(
pBgivenA = .95,
pA = .003,
pBgivenNotA = pBgivenNotA(
pBgivenA = .95,
pA = .003,
pB = .01))
```
To calculate the [conditional probability](#sec-conditionalProbability) of HIV given a second positive test, we update our priors because the person has now tested positive for HIV.
We update the [prior probability](#sec-baseRate) of HIV ($P(C_i)$) based on the posterior probability of HIV after a positive test ($P(C | R)$) that we calculated above.
We can calculate the [conditional probability](#sec-conditionalProbability) of HIV given a second positive test using three terms: the [conditional probability](#sec-conditionalProbability) that a person with HIV gets a positive test (i.e., [sensitivity](#sec-sensitivity); which stays the same), the [conditional probability](#sec-conditionalProbability) that a person without HIV gets a positive test (i.e., [false positive rate](#sec-falsePositiveRate); which stays the same), and the updated [marginal probability](#sec-baseRate) of HIV.
The [conditional probability](#sec-conditionalProbability) of HIV given a second positive test is in @eq-baseRateUpdated4:
$$
\scriptsize
\begin{aligned}
P(C | R) &= \frac{P(R | C) \cdot P(C_i)}{P(R | C) \cdot P(C_i) + P(R | \text{not } C) \cdot [1 - P(C_i)]} \\
&= \frac{\text{sensitivity of test} \times \text{updated base rate of HIV}}{\text{sensitivity of test} \times \text{updated base rate of HIV} + \text{false positive rate of test} \times (1 - \text{updated base rate of HIV})} \\
&= \frac{95\% \times 28.5\%}{95\% \times 28.5\% + .7171515\% \times (1 - 28.5\%)} = \frac{.95 \times .285}{.95 \times .285 + .007171515 \times (1 - .285)}\\
&= 98.14\%
\end{aligned}
$$ {#eq-baseRateUpdated4}
The [`petersenlab`](https://cran.r-project.org/web/packages/petersenlab/index.html) package [@R-petersenlab] contains the `petersenlab::pAgivenB()` function that estimates the probability of one event, $A$, given another event, $B$.
```{r}
pAgivenB(
pBgivenA = .95,
pA = .285,
pBgivenNotA = .007171515)
pAgivenB(
pBgivenA = .95,
pA = .285,
pBgivenNotA = pBgivenNotA(
pBgivenA = .95,
pA = .003,
pB = .01))
```
#### Interim Summary
In sum, the [marginal probability](#sec-baseRate), including the [prior probability](#sec-baseRate) or [base rate](#sec-baseRate), should be weighed heavily in predictions unless there are sufficient data to indicate otherwise, i.e., to update the posterior probability based on new evidence.
People tend to ignore base rates (i.e., [base rate neglect](#sec-fallaciesBaseRate)).
In general, people tend to a) overestimate the likelihood of low base-rate events and b) overweight low base-rate events in their decisions [@Kahneman2011].
Bayes' theorem specifies how prior beliefs (i.e., [base rate](#sec-baseRate) information) should be integrated with the [predictive accuracy](#sec-predictiveValidity) of the evidence to make predictions.
It thus provides a powerful tool to [anchor](#sec-heuristicsAnchoringAdjustment) predictions to the [base rate](#sec-baseRate) unless sufficient evidence changes the posterior probability (by updating the evidence and [prior probability](#sec-baseRate)).
In general, you should [anchor](#sec-heuristicsAnchoringAdjustment) your predictions to the [base rate](#sec-baseRate) and [adjust](#sec-heuristicsAnchoringAdjustment) from there.
It is also important to question the validity of the evidence [@Kahneman2011].
As noted by @Kahneman2011, if you have doubts about the quality of the evidence for a particular prediction question, keep your predictions close to the [base rate](#sec-baseRate), and modify them only slightly (if at all) based on the new information.
## Cab Example {#sec-cabExample}
Below is an example:
> *A cab was involved in a hit-and-run accident at night.
> Two cab companies, the Green and the Blue, operate in the city.
> You are given the following data:*
>
> - *85% of the cabs in the city are Green and 15% are Blue.*
> - *A witness identified the cab as Blue.
> The court tested the reliability of the witness under the circumstances that existed on the night of the accident and concluded that the witness correctly identified each one of the two colors 80% of the time and failed 20% of the time.*
>
> *What is the probability that the cab involved in the accident was Blue rather than Green?*
>
> --- Kahneman [-@Kahneman2011, p. 166]
Thus, we know the following:
$$
\begin{aligned}
P(\text{Blue}) &= .15 && \text{prior probability of a Blue cab}\\
P(\text{Green}) &= .85 && \text{prior probability of a Green cab}\\
P(\text{Correct}|\text{Blue}) &= .80 && \text{probability the witness correctly identifies a Blue cab}\\
P(\text{Correct}|\text{Green}) &= .80 && \text{probability the witness correctly identifies a Green cab}\\
P(\text{Incorrect}|\text{Blue}) &= .20 && \text{probability the witness incorrectly identifies a Blue cab}\\
P(\text{Incorrect}|\text{Green}) &= .20 && \text{probability the witness incorrectly identifies a Green cab}\\
\end{aligned}
$$ {#eq-cabExampleGivenInfo}
We want to know the probability that the cab involved in the accident was Blue, given that the witness identified it as Blue ($P(\text{Blue}|\text{Identified as Blue})$).
To estimate this probability, we can apply [Bayes' theorem](#sec-bayesTheorem) to estimate the posterior probability:
$$
\begin{aligned}
P(C | R) &= \frac{P(R | C) \cdot P(C_i)}{P(R_i)}\\
P(\text{Blue}|\text{Identified as Blue}) &= \frac{P(\text{Identified as Blue}|\text{Blue}) \cdot P(\text{Blue})}{P(\text{Identified as Blue})}
\end{aligned}
$$ {#eq-cabExampleBayesTheorem}
We can compute the term in the denominator ($P(\text{Identified as Blue})$) using the law of total probability (described in @sec-bayesTheorem).
$$
\begin{aligned}
P(R_i) &= P(R | C) \cdot P(C_i) + P(R | \text{not } C) \cdot P(\text{not } C_i)\\
P(R_i) &= P(\text{Identified as Blue}|\text{Blue}) \cdot P(\text{Blue}) + P(\text{Identified as Blue}|\text{Green}) \cdot P(\text{Green})\\
0.29 &= (.80 \times .15) + (.20 \times .85) \\
\end{aligned}
$$ {#eq-cabExampleLawOfTotalProbability}
```{r}
petersenlab::pA(
pAgivenB = .80,
pB = .15,
pAgivenNotB = .20)
```
We can now substitute this value into the denominator of Bayes' theorem to estimate the posterior probability:
$$
\begin{aligned}
P(C | R) &= \frac{P(R | C) \cdot P(C_i)}{P(R_i)}\\
P(\text{Blue}|\text{Identified as Blue}) &= \frac{P(\text{Identified as Blue}|\text{Blue}) \cdot P(\text{Blue})}{P(\text{Identified as Blue})}\\
0.414 &= \frac{0.80 \times 0.15}{0.29}
\end{aligned}
$$ {#eq-cabExamplePosterior}
```{r}
petersenlab::pAgivenB(
pBgivenA = .80,
pA = .15,
pB = .29)
```
Thus, there was a 41.4% probability that the car involved in the accident was Blue rather than Green.
However, when faced with this problem, people tend to [ignore the base rate](#sec-fallaciesBaseRate) and go with the witness [@Kahneman2011].
According to @Kahneman2011, the most frequent response to this question regarding is that there is an 80% that the car was Blue.
## Nate Silver Examples {#sec-nateSilverExamples}
@Silver2012 provides several examples that leverage the [alternative formulation of Bayes' theorem](#sec-bayesTheoremAlternative) provided in @eq-baseRateUpdated3 and summarized below:
$$
\begin{aligned}
P(C | R) &= \frac{P(R | C) \cdot P(C_i)}{P(R | C) \cdot P(C_i) + P(R | \text{not } C) \cdot [1 - P(C_i)]}
\end{aligned}
$$ {#eq-baseRateAlternative}
In each example, the formula uses three elements to calculate the probability that the hypothesis is true:
1. the [conditional probability](#sec-conditionalProbability) the likelihood of observing the evidence, $R$, given that the hypothesis, $C$, is true (i.e., $P(R|C)$; [true positive rate](#sec-sensitivity))
1. the [conditional probability](#sec-conditionalProbability) the likelihood of observing the evidence, $R$, given that the hypothesis, $C$, is false (i.e., $P(R | \text{not } C)$; [false positive rate](#sec-falsePositiveRate))
1. the [marginal probability](#sec-baseRate) (base rate) of the event occurring (i.e., the prior probability of the hypothesis, $C$, being true; $P(C_i)$)
Thus, the formula uses the [base rate](#sec-baseRate), the [true positive rate](#sec-sensitivity) (sensitivity), and the [false positive rate](#sec-falsePositiveRate).
The ratio of the [true positive rate](#sec-sensitivity) to the [false positive rate](#sec-falsePositiveRate) is called the [positive likelihood ratio](#sec-positiveLikelihoodRatio), and is used in [Bayesian updating](#sec-bayesianUpdating).
### Example 1: Is Your Partner Cheating on You? {#sec-nateSilverExample1}
Example 1: You came home and found a strange pair of underwear in your underwear drawer.
What is the probability that your partner is cheating on you?
- the [prior probability](#sec-baseRate) that your partner is cheating on you: 4%
- the [conditional probability](#sec-conditionalProbability) of underwear appearing given that your partner is cheating on you: 50%
- the [conditional probability](#sec-conditionalProbability) of underwear appearing given that your partner is *not* cheating on you: 5%
```{r}
pAgivenB(
pBgivenA = .50,
pA = .04,
pBgivenNotA = .05)
```
### Example 2: Does a Person Have Breast Cancer? {#sec-nateSilverExample2}
Example 2: What is the probability that a woman in her 40s has breast cancer if she tested positive on a mammogram?
- the [prior probability](#sec-baseRate) that she has breast cancer: 1.4%
- the [conditional probability](#sec-conditionalProbability) that she has a positive test given that she has breast cancer: 75%
- the [conditional probability](#sec-conditionalProbability) that she has a positive test given that she does *not* have breast cancer: 10%
```{r}
pAgivenB(
pBgivenA = .75,
pA = .014,
pBgivenNotA = .10)
```
### Example 3: Was it a Terrorist Attack? {#sec-nateSilverExample3}
#### Example 3A: The First Plane Hit the World Trade Center {#sec-nateSilverExample3A}
Example 3A: Consider the information we had on 9/11 when the first plane hit the World Trade Center.
What is the probability that a terror attack occurred given that the first plane hit the World Trade Center?
- the [prior probability](#sec-baseRate) that terrorists crash a plane into a Manhattan skyscraper: 0.005%
- the [conditional probability](#sec-conditionalProbability) that a plane crashes into a Manhattan skyscraper if terrorists are attacking Manhattan skyscrapers: 100%
- the [conditional probability](#sec-conditionalProbability) that a plane crashes into a Manhattan skyscraper if terrorists are *not* attacking Manhattan skyscrapers (i.e., it is an accident): 0.008%
```{r}
pAgivenB(
pBgivenA = 1,
pA = .00005,
pBgivenNotA = .00008)
```
#### Example 3B: The Second Plane Hit the World Trade Center {#sec-nateSilverExample3B}
Example 3B: Now, consider that a second plane just hit the World Trade Center.
What is the probability that a terror attack occurred given that a second plane hit the World Trade Center?
- the revised [prior probability](#sec-baseRate) that terrorists crash a plane into a Manhattan skyscraper (from [Example 3A](#sec-nateSilverExample3A)): 38.46272%
- the [conditional probability](#sec-conditionalProbability) that a plane crashes into a Manhattan skyscraper if terrorists are attacking Manhattan skyscrapers: 100%
- the [conditional probability](#sec-conditionalProbability) that a plane crashes into a Manhattan skyscraper if terrorists are *not* attacking Manhattan skyscrapers (i.e., it is an accident): 0.008%
```{r}
pAgivenB(
pBgivenA = 1,
pA = .3846272,
pBgivenNotA = .00008)
```
## Base Rates Applied to Fantasy Football {#sec-baseRateFantasyFootball}
Base rates are also relevant to fantasy football.
Unlike yardage (e.g., passing yards, rushing yards, receiving yards), touchdowns occur *relatively* less frequently.
Whereas a solid Wide Receiver may log 100+ receptions and 1,200+ yards in a season, they may have "only" 8–14 receiving touchdowns in a given season.
As noted in @sec-difficultyPredictingLowBRevents, lower base-rate events—including touchdowns—are harder to predict accurately.
As noted by @Harris2012: "NFL statistical projections are basically impossible to get right. (Take it from someone who helps create them for a living.) Yes, we can do a passable job with yardage totals for players who don't suffer unexpected injuries or depth-chart pratfalls. But so much of fantasy football hinges on touchdowns, and touchdowns are impossibly difficult to predict from season to season (let alone week to week)." (archived at <https://perma.cc/4QNH-J2LD>).
Thus, it is important not to lend too much credence to predictions of touchdowns.
Focus on other things that may be more predictable (and that may be indirectly prognostic of touchdowns) such as yards, carries/targets, receptions, depth of targets, red zone carries/targets, short distance carries/targets, etc.
In @sec-difficultyPredictingLowBRevents, we evaluate the accuracy of predicting touchdowns versus yardage.
When dealing with numeric predictions (rather than categorical outcomes), the equivalent of the base rate is the average value.
For instance, the "base rate" of fantasy points for a given position is the average number of fantasy points for that position.
We could subdivide even further to identify, for instance, the "base rate" of fantasy points for the Wide Receiver at the top of the depth chart on a team.
## Base Rate of Rookie Performance {#sec-baseRateRookiePerformance}
We examine the base rates of rookie performance compared to performance of non-rookies.
It can be challenging to predict rookie performance because they have not yet played in the NFL.
Thus, it is important to consider the prior probabilities (base rates) of success among rookies in the NFL.
Rookies who play tend to be better players who were drafted in early rounds of the National Football League (NFL) draft.
Thus, to more fairly compare rookies and non-rookies, we examine only players who were drafted in the first two rounds of the NFL Draft.
```{r}
rookiesDraftedEarly <- player_stats_seasonal %>%
filter(years_of_experience == 0) %>%
filter(draft_round %in% c(1,2)) %>%
arrange(player_id, season) %>%
group_by(player_id) %>%
slice_head(n = 1) %>%
ungroup() %>%
mutate(
rookie = 0,
rookieLabel = "Rookie"
) %>%
arrange(player_display_name)
nonRookiesDraftedEarly <- player_stats_seasonal %>%
filter(years_of_experience > 0) %>%
filter(draft_round %in% c(1,2)) %>%
mutate(
rookie = 1,
rookieLabel = "Non-rookie"
) %>%
arrange(player_display_name)
playersDraftedEarly <- bind_rows(
rookiesDraftedEarly,
nonRookiesDraftedEarly
) %>%
mutate(rookieLabel = factor(rookieLabel, levels = c("Rookie","Non-rookie")))
```
```{r}
#| label: fig-densityPlotRookieVsNonRookieByPosition
#| fig-cap: "Density Plot of Fantasy Points for Rookies Versus Non-Rookies by Position Among Players Who Were Drafted in the First Two Rounds of the National Football League Draft."
#| fig-alt: "Density Plot of Fantasy Points for Rookies Versus Non-Rookies by Position Among Players Who Were Drafted in the First Two Rounds of the National Football League Draft."
ggplot2::ggplot(
data = playersDraftedEarly %>% filter(position %in% c("QB","RB","WR","TE")),
aes(
x = fantasyPoints)) +
geom_density(
aes(
fill = rookieLabel),
alpha = 0.5) +
facet_wrap( ~ position) +
labs(
x = "Fantasy Points",
y = "Density",
title = "Density Plot of Fantasy Points for Rookies Versus Non-Rookies by Position"
) +
theme_bw() +
theme(
axis.title.y = element_text(angle = 0, vjust = 0.5), # horizontal y-axis title
legend.title = element_blank()) # remove legend title
```
### Quarterbacks {#sec-baseRateRookiePerformanceQBs}
```{r}
nonRookiesDraftedEarly %>%
filter(position == "QB") %>%
select(fantasyPoints) %>%
psych::describe()
```
```{r}
rookiesDraftedEarly %>%
filter(position == "QB") %>%
select(fantasyPoints) %>%
psych::describe()
```
```{r}
#| label: fig-histogramRookieQBs
#| fig-cap: "Histogram of Fantasy Points Among Rookie Quarterbacks Who Were Drafted in the First Two Rounds of the National Football League Draft."
#| fig-alt: "Histogram of Fantasy Points Among Rookie Quarterbacks Who Were Drafted in the First Two Rounds of the National Football League Draft."
rookiesDraftedEarly %>%
filter(position == "QB") %>%
select(fantasyPoints) %>%
hist()
```
Among Quarterbacks who played a full season:
```{r}
nonRookiesDraftedEarly %>%
filter(position == "QB") %>%
filter(games >= 16) %>%
select(fantasyPoints) %>%
psych::describe()
```
```{r}
rookiesDraftedEarly %>%
filter(position == "QB") %>%
filter(games >= 16) %>%
select(fantasyPoints) %>%
psych::describe()
```
### Running Backs {#sec-baseRateRookiePerformanceRBs}
```{r}
nonRookiesDraftedEarly %>%
filter(position == "RB") %>%
select(fantasyPoints) %>%
psych::describe()
nonRookiesDraftedEarly %>%
filter(position == "RB") %>%
select(rushing_tds) %>%
psych::describe()
```
```{r}
rookiesDraftedEarly %>%
filter(position == "RB") %>%
select(fantasyPoints) %>%
psych::describe()
rookiesDraftedEarly %>%
filter(position == "RB") %>%
select(rushing_tds) %>%
psych::describe()
```
```{r}
#| label: fig-histogramRookieRBs
#| fig-cap: "Histogram of Fantasy Points Among Rookie Running Backs Who Were Drafted in the First Two Rounds of the National Football League Draft."
#| fig-alt: "Histogram of Fantasy Points Among Rookie Running Backs Who Were Drafted in the First Two Rounds of the National Football League Draft."
rookiesDraftedEarly %>%
filter(position == "RB") %>%
select(fantasyPoints) %>%
hist()
```
Among Running Backs who played a full season:
```{r}
nonRookiesDraftedEarly %>%
filter(position == "RB") %>%
filter(games >= 16) %>%
select(fantasyPoints) %>%
psych::describe()
nonRookiesDraftedEarly %>%
filter(position == "RB") %>%
filter(games >= 16) %>%
select(rushing_tds) %>%
psych::describe()
```
```{r}
rookiesDraftedEarly %>%
filter(position == "RB") %>%
filter(games >= 16) %>%
select(fantasyPoints) %>%
psych::describe()
rookiesDraftedEarly %>%
filter(position == "RB") %>%
filter(games >= 16) %>%
select(rushing_tds) %>%
psych::describe()
```
### Wide Receivers {#sec-baseRateRookiePerformanceWRs}
```{r}
nonRookiesDraftedEarly %>%
filter(position == "WR") %>%
select(fantasyPoints) %>%
psych::describe()
nonRookiesDraftedEarly %>%
filter(position == "WR") %>%
select(receiving_tds) %>%
psych::describe()
```
```{r}
rookiesDraftedEarly %>%
filter(position == "WR") %>%
select(fantasyPoints) %>%
psych::describe()
rookiesDraftedEarly %>%
filter(position == "WR") %>%
select(receiving_tds) %>%
psych::describe()
```
```{r}
#| label: fig-histogramRookieWRs
#| fig-cap: "Histogram of Fantasy Points Among Rookie Wide Receivers Who Were Drafted in the First Two Rounds of the National Football League Draft."
#| fig-alt: "Histogram of Fantasy Points Among Rookie Wide Receivers Who Were Drafted in the First Two Rounds of the National Football League Draft."
rookiesDraftedEarly %>%
filter(position == "WR") %>%
select(fantasyPoints) %>%
hist()
```
Among Wide Receivers who played a full season:
```{r}
nonRookiesDraftedEarly %>%
filter(position == "WR") %>%
filter(games >= 16) %>%
select(fantasyPoints) %>%
psych::describe()
nonRookiesDraftedEarly %>%
filter(position == "WR") %>%
filter(games >= 16) %>%
select(receiving_tds) %>%
psych::describe()
```
```{r}
rookiesDraftedEarly %>%
filter(position == "WR") %>%
filter(games >= 16) %>%
select(fantasyPoints) %>%
psych::describe()
rookiesDraftedEarly %>%
filter(position == "WR") %>%
filter(games >= 16) %>%
select(receiving_tds) %>%
psych::describe()
```
### Tight Ends {#sec-baseRateRookiePerformanceTEs}
```{r}
nonRookiesDraftedEarly %>%
filter(position == "TE") %>%
select(fantasyPoints) %>%
psych::describe()
nonRookiesDraftedEarly %>%
filter(position == "TE") %>%
select(receiving_tds) %>%
psych::describe()
```
```{r}
rookiesDraftedEarly %>%
filter(position == "TE") %>%
select(fantasyPoints) %>%
psych::describe()
rookiesDraftedEarly %>%
filter(position == "TE") %>%
select(receiving_tds) %>%
psych::describe()
```
```{r}
#| label: fig-histogramRookieTEs
#| fig-cap: "Histogram of Fantasy Points Among Rookie Tight Ends Who Were Drafted in the First Two Rounds of the National Football League Draft."
#| fig-alt: "Histogram of Fantasy Points Among Rookie Tight Ends Who Were Drafted in the First Two Rounds of the National Football League Draft."
rookiesDraftedEarly %>%
filter(position == "TE") %>%
select(fantasyPoints) %>%
hist()
```
Among Tight Ends who played a full season:
```{r}
nonRookiesDraftedEarly %>%
filter(position == "TE") %>%
filter(games >= 16) %>%
select(fantasyPoints) %>%
psych::describe()
nonRookiesDraftedEarly %>%
filter(position == "TE") %>%
filter(games >= 16) %>%
select(receiving_tds) %>%
psych::describe()
```
```{r}
rookiesDraftedEarly %>%
filter(position == "TE") %>%
filter(games >= 16) %>%
select(fantasyPoints) %>%
psych::describe()
rookiesDraftedEarly %>%
filter(position == "TE") %>%
filter(games >= 16) %>%
select(receiving_tds) %>%
psych::describe()
```
## How to Account for Base Rates {#sec-accountForBaseRates}
There are various ways to account for [base rates](#sec-baseRate), including the use of [actuarial formulas](#sec-accountForBaseRatesActuarial) and the use of [Bayesian updating](#sec-bayesianUpdating).
### Actuarial Formula {#sec-accountForBaseRatesActuarial}
One approach to account for [base rates](#sec-baseRate) is to use [actuarial formulas](#sec-actuarialPrediction) (rather than [human judgment](#sec-humanJudgment)) to make the predictions.
[Actuarial formulas](#sec-actuarialPrediction) based on [multiple regression](#sec-multipleRegression) or [machine learning](#sec-machineLearning) can account for the [base rate](#sec-baseRate) of the event.
### Bayesian Updating {#sec-bayesianUpdating}
Another approach to account for [base rates](#sec-baseRate) is to leverage Bayes' theorem, using Bayesian updating and the [probability nomogram](#sec-probabilityNomogram).
Bayesian updating is a form of [anchoring and adjustment](#sec-heuristicsAnchoringAdjustment); however, unlike the [anchoring and adjustment heuristic](#sec-heuristicsAnchoringAdjustment), it is a systematic approach to [anchoring and adjustment](#sec-heuristicsAnchoringAdjustment) that anchors one's predictions to the base rate, and then adjusts according to new information.
That is, we start with a [pretest probability](#sec-baseRate) (i.e., [base rate](#sec-baseRate)) and update our predictions based on the extent of new information (i.e., the [likelihood ratio](#sec-diagnosticLikelihoodRatio)).
Bayesian updating can also be applied to continuous outcomes like fantasy points.
For an example of applying Bayesian updating to fantasy points, see @Braun2012 (archived at <https://web.archive.org/web/20161028142225/http://www.bayesff.com/bayesian101/>).
Applying Bayes' theorem to continuous outcomes, the posterior distribution is equal to the prior distribution times the likelihood (data).
For instance, we can start with our prior belief (distribution) for a player's performance based on, for example, average draft position.
Then, we observe the Week 1 performance.
For instance, if Tom Brady scores 35 points in Week 1, our likelihood represents the likelihood that Tom Brady would score 35 points (the data).
Using Bayesian updating, we can then calculate a posterior distribution that represents our new best prediction moving forward for how many points Tom Brady will score in Week 2.
We then observe Week 2, generate a new likelihood and posterior distribution, and use that as our new prior distribution for Week 3, and so on.
To keep things simple, we use a binary outcome below to demonstrate Bayesian updating.
To perform Bayesian updating involves comparing the relative probability of two outcomes, $P(C | R)$ versus $P(\text{not } C | R)$.
If we want to compare the relative probability of two outcomes, we can use the odds form of Bayes' theorem, as in @eq-bayes5:
$$
\begin{aligned}
P(C | R) &= \frac{P(R | C) \cdot P(C_i)}{P(R_i)} \\
P(\text{not } C | R) &= \frac{P(R | \text{not } C) \cdot P(\text{not } C_i)}{P(R_i)} \\
\frac{P(C | R)}{P(\text{not } C | R)} &= \frac{\frac{P(R | C) \cdot P(C_i)}{P(R_i)}}{\frac{P(R | \text{not } C) \cdot P(\text{not } C_i)}{P(R_i)}} \\
&= \frac{P(R | C) \cdot P(C_i)}{P(R | \text{not } C) \cdot P(\text{not } C_i)} \\
&= \frac{P(C_i)}{P(\text{not } C_i)} \times \frac{P(R | C)}{P(R | \text{not } C)} \\
\text{posterior odds} &= \text{prior odds} \times \text{likelihood ratio}
\end{aligned}
$$ {#eq-bayes5}
As presented in @eq-bayes5, the posttest (or posterior) odds are equal to the pretest odds multiplied by the [likelihood ratio](#sec-diagnosticLikelihoodRatio).
Below, we describe the [likelihood ratio](#sec-diagnosticLikelihoodRatio).
#### Diagnostic Likelihood Ratio {#sec-diagnosticLikelihoodRatio}
A likelihood ratio is the ratio of two probabilities.
It can be used to compare the likelihood of two possibilities.
The diagnostic likelihood ratio is an index of the predictive validity of an instrument: it is the ratio of the probability that a test result is correct to the probability that the test result is incorrect.
The diagnostic likelihood ratio is also called the risk ratio.
There are two types of diagnostic likelihood ratios: the [positive likelihood ratio](#sec-positiveLikelihoodRatio) and the [negative likelihood ratio](#sec-negativeLikelihoodRatio).
##### Positive Likelihood Ratio (LR+) {#sec-positiveLikelihoodRatio}
The positive likelihood ratio (LR+) is the probability that a person with the disease tested positive for the disease ([true positive rate](#sec-sensitivity)) divided by the probability that a person without the disease tested positive for the disease ([false positive rate](#sec-falsePositiveRate)).
That is, the positive likelihood ratio (LR+) compares (i.e., is a ratio of) the [true positive rate](#sec-sensitivity) to the [false positive rate](#sec-falsePositiveRate).
Positive likelihood ratio values range from 1 to infinity.\index{positive likelihood ratio}
Higher values reflect greater accuracy, because it indicates the degree to which a [true positive](#sec-truePositive) is more likely than a [false positive](#sec-falsePositive).
Testing positive on a test with a high LR+ increases the probability of disease.
The formula for calculating the positive likelihood ratio is in @eq-positiveLikelihoodRatio.
$$
\begin{aligned}
\text{positive likelihood ratio (LR+)} &= \frac{\text{TPR}}{\text{FPR}} \\
&= \frac{P(R|C)}{P(R|\text{not } C)} \\
&= \frac{P(R|C)}{1 - P(\text{not } R|\text{not } C)} \\
&= \frac{\text{sensitivity}}{1 - \text{specificity}}
\end{aligned}
$$ {#eq-positiveLikelihoodRatio}
##### Negative Likelihood Ratio (LR−) {#sec-negativeLikelihoodRatio}
The negative likelihood ratio (LR−) is the probability that a person with the disease tested negative for the disease ([false negative rate](#sec-falseNegativeRate)) divided by the probability that a person without the disease tested negative for the disease ([true negative rate](#sec-specificity)).
That is, the negative likelihood ratio (LR−) compares (i.e., is a ratio of) the [false negative rate](#sec-falseNegativeRate) to the [true negative rate](#sec-specificity).
Negative likelihood ratio values range from 0 to 1.
Smaller values reflect greater accuracy, because it indicates that a [false negative](#sec-falseNegative) is less likely than a [true negative](#sec-trueNegative).
Testing negative on a test with a low LR– decreases the probability of disease.
The formula for calculating the negative likelihood ratio is in @eq-negativeLikelihoodRatio.
$$
\begin{aligned}
\text{negative likelihood ratio } (\text{LR}-) &= \frac{\text{FNR}}{\text{TNR}} \\
&= \frac{P(\text{not } R|C)}{P(\text{not } R|\text{not } C)} \\
&= \frac{1 - P(R|C)}{P(\text{not } R|\text{not } C)} \\
&= \frac{1 - \text{sensitivity}}{\text{specificity}}
\end{aligned}
$$ {#eq-negativeLikelihoodRatio}
#### Probability Nomogram {#sec-probabilityNomogram}
Using [Bayes' theorem](#sec-bayesTheorem) (described in @sec-bayesTheorem), solving for posttest odds (based on pretest odds and the [likelihood ratio](#sec-diagnosticLikelihoodRatio), as in @eq-bayes5), and converting odds to probabilities, we can use a Fagan probability nomogram to determine the posttest probability following a test result.
The calculation of posttest (posterior) probability is described in @sec-posttestProbability.
In its calculation, the probability nomogram automatically converts the pretest probability (i.e., base rate) to prior (pretest) odds and the posterior (posttest) odds to posterior probability, so you do not have to.