-
-
Notifications
You must be signed in to change notification settings - Fork 5
Expand file tree
/
Copy pathsimulation.qmd
More file actions
1050 lines (841 loc) · 37.1 KB
/
simulation.qmd
File metadata and controls
1050 lines (841 loc) · 37.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
{{< include _chunk-timing.qmd >}}
# Simulation: Bootstrapping and the Monte Carlo Method {#sec-simulation}
This chapter provides an overview of various approaches to simulation, including [bootstrapping](#sec-simulationProjectedPointsBootstrapping) and the [Monte Carlo](#sec-simulationOverviewMonteCarlo) method.
## Getting Started {#sec-simulationGettingStarted}
### Load Packages {#sec-simulationLoadPackages}
```{r}
library("ffanalytics")
library("data.table")
library("future")
library("future.apply")
library("progressr")
library("SimDesign")
library("fitdistrplus")
library("sn")
library("tidyverse")
```
### Load Data {#sec-simulationLoadData}
```{r}
#| eval: false
#| include: false
load(file = file.path(path, "/OneDrive - University of Iowa/Teaching/Courses/Fantasy Football/Data/players_projectedPoints_seasonal.RData", fsep = ""))
load(file = file.path(path, "/OneDrive - University of Iowa/Teaching/Courses/Fantasy Football/Data/nfl_playerIDs.RData", fsep = ""))
```
```{r}
load(file = "./data/players_projectedPoints_seasonal.RData")
load(file = "./data/nfl_playerIDs.RData")
```
## Overview {#sec-simulationOverview}
A simulation is an "imitative representation" of a phenomenon that could exist the real world.
In statistics, simulations are computer-driven investigations to better understand a phenomenon by studying its behavior under different conditions.
For instance, we might want to determine the likely range of outcomes for a player, in terms of the range of fantasy points that a player might score over the course of a season.
Simulations can be conducted in various ways.
Two common ways of conducting simulations are via [bootstrapping](#sec-simulationOverviewBootstrapping) and via [Monte Carlo simulation](#sec-simulationOverviewMonteCarlo).
### Bootstrapping {#sec-simulationOverviewBootstrapping}
Bootstrapping involves repeated resampling (with replacement) from observed data.
For instance, if we have 100 sources provide projections for a player, we could estimate the most likely range of fantasy points for the player by repeatedly sampling from the 100 projections.
### Monte Carlo Simulation {#sec-simulationOverviewMonteCarlo}
Monte Carlo simulation is named after the Casino de Monte-Carlo in Monaco because Monte Carlo simulations involve random samples (random chance), as might be found in casino gambling.
Monte Carlo simulation involves repeated random sampling from a known distribution [@Sigal2016].
For instance, if we know the population distribution for the likely outcomes for a player—e.g., a normal distribution with a known mean (e.g., 150 points) and standard deviation (e.g., 20 points)—we can repeatedly sample randomly from this distribution.
The distribution could be, as examples, a normal distribution, a log-normal distribution, a binomial distribution, a chi-square distribution, etc. [@Sigal2016].
The distribution provides a probability density function, which indicates the probability that any particular value would be observed if the data arose from that distribution.
## Simulation of Projected Statistics and Points {#sec-simulationProjectedPoints}
Below, we perform [bootstrapping](#sec-simulationProjectedPointsBootstrapping) and [Monte Carlo](#sec-simulationProjectedPointsMonteCarlo) simulations of projected statistics and points.
However, it is worth noting that—as for any simulation—the quality of the results depend on the quality of the inputs.
In this case, the quality of the simulation depends on the quality of the projections.
If the projections are no good, the simulation results will not be trustworthy.
Garbage in, garbage out.
As we evaluated in @sec-accuracyProjections, projections tend to show moderate accuracy for fantasy performance, but they are not highly accurate.
Thus, we should treat simulation results arising from fantasy projections with a good dose of skepticism.
### Bootstrapping {#sec-simulationProjectedPointsBootstrapping}
In this example, we simulate players' likely distribution of (projected) fantasy points based on bootstrapping players' projected statistics (e.g., passing yards, passing touchdowns).
That is, for each iteration of the bootstrap, we randomly draw a projected value from a projection source (with replacement), for each of various statistical categories.
For instance, assume we have 10 projection sources for Tom Brady.
In iteration 1, we may draw Brady's projected passing yards from ESPN and projected passing touchdowns from numberFire (i.e., randomly selected projection sources).
In iteration 2, we may draw Brady's projected passing yards from Yahoo and projected passing touchdowns from CBS (i.e., randomly selected projection sources).
We do this for many iterations.
We can compute projected fantasy points at each iteration and then summarize the distribution of fantasy points for Tom Brady (and all players) across all iterations.
This bootstrap simulation is valuable because it would give us a fuller distribution of the likely range of players' performance than just the mean (or range) of projected points across sources.
The simulated distribution provides information about our [uncertainty](#sec-fantasyValueUncertainty) of a player's performance.
For instance, using these distributions, we can then sample from the simulated distribution of likely fantasy points for each player to perform a simulation to identify the most optimal lineup across the simulation runs using our [lineup optimizer](#sec-lineupOptimization) (described in @sec-lineupOptimization)
#### Prepare Data {#sec-simulationProjectedPointsBootstrappingPrepareData}
```{r}
all_proj <- dplyr::bind_rows(players_projectedPoints_seasonal)
```
```{r}
#| label: vars-to-bootstrap-by-position
vars_by_pos <- list(
QB = c(
"games",
"pass_att", "pass_comp", "pass_inc", "pass_yds", "pass_tds", "pass_int",
"rush_att", "rush_yds", "rush_tds",
"fumbles_lost", "fumbles_total", "two_pts",
"sacks",
"pass_09_tds", "pass_1019_tds", "pass_2029_tds", "pass_3039_tds", "pass_4049_tds", "pass_50_tds",
"pass_40_yds", "pass_250_yds", "pass_300_yds", "pass_350_yds", "pass_400_yds",
"rush_40_yds", "rush_50_yds", "rush_100_yds", "rush_150_yds", "rush_200_yds"
),
RB = c(
"games",
"rush_att", "rush_yds", "rush_tds",
"rec_tgt", "rec", "rec_yds", "rec_tds", "rec_rz_tgt",
"fumbles_lost", "fumbles_total", "two_pts",
"return_yds", "return_tds",
"rush_09_tds", "rush_1019_tds", "rush_2029_tds", "rush_3039_tds", "rush_4049_tds", "rush_50_tds",
"rush_40_yds", "rush_50_yds", "rush_100_yds", "rush_150_yds", "rush_200_yds",
"rec_40_yds", "rec_50_yds", "rec_100_yds", "rec_150_yds", "rec_200_yds"
),
WR = c(
"games",
"pass_att", "pass_comp", "pass_inc", "pass_yds", "pass_tds", "pass_int",
"rush_att", "rush_yds", "rush_tds",
"rec_tgt", "rec", "rec_yds", "rec_tds", "rec_rz_tgt",
"fumbles_lost", "fumbles_total", "two_pts",
"return_yds", "return_tds",
"rush_09_tds", "rush_1019_tds", "rush_2029_tds", "rush_3039_tds", "rush_4049_tds", "rush_50_tds",
"rush_40_yds", "rush_50_yds", "rush_100_yds", "rush_150_yds", "rush_200_yds",
"rec_40_yds", "rec_50_yds", "rec_100_yds", "rec_150_yds", "rec_200_yds"
),
TE = c(
"games",
"pass_att", "pass_comp", "pass_inc", "pass_yds", "pass_tds", "pass_int",
"rush_att", "rush_yds", "rush_tds",
"rec_tgt", "rec", "rec_yds", "rec_tds", "rec_rz_tgt",
"fumbles_lost", "fumbles_total", "two_pts",
"return_yds", "return_tds",
"rush_09_tds", "rush_1019_tds", "rush_2029_tds", "rush_3039_tds", "rush_4049_tds", "rush_50_tds",
"rush_40_yds", "rush_50_yds", "rush_100_yds", "rush_150_yds", "rush_200_yds",
"rec_40_yds", "rec_50_yds", "rec_100_yds", "rec_150_yds", "rec_200_yds"
),
K = c(
"fg_0019", "fg_2029", "fg_3039", "fg_4049", "fg_50", "fg_50_att",
"fg_39", "fg_att_39", "fg_49", "fg_49_att",
"fg", "fg_att", "fg_miss", "xp", "xp_att"
),
D = c(
"idp_solo", "idp_asst", "idp_sack", "idp_int", "idp_fum_force", "idp_fum_rec", "idp_pd", "idp_td", "idp_safety"
),
DL = c(
"idp_solo", "idp_asst", "idp_sack", "idp_int", "idp_fum_force", "idp_fum_rec", "idp_pd", "idp_td", "idp_safety"
),
LB = c(
"idp_solo", "idp_asst", "idp_sack", "idp_int", "idp_fum_force", "idp_fum_rec", "idp_pd", "idp_td", "idp_safety"
),
DB = c(
"idp_solo", "idp_asst", "idp_sack", "idp_int", "idp_fum_force", "idp_fum_rec", "idp_pd", "idp_td", "idp_safety"
),
DST = c(
"dst_fum_recvr", "dst_fum_rec", "dst_int", "dst_safety", "dst_sacks", "dst_td", "dst_blk",
"dst_fumbles", "dst_tackles", "dst_yds_against", "dst_pts_against", "dst_pts_allowed", "dst_ret_yds"
)
)
```
#### Bootstrapping Function {#sec-simulationProjectedPointsBootstrappingFunction}
For performing the bootstrapping, we leverage the `data.table` [@R-data.table] (`data.table::as.data.table()`; `data.table::data.table()`; `data.table::rbindlist()`), `future` [@R-future] (`future::plan()`; `future::multisession()`), and `future.apply` [@R-future.apply] (`future.apply::future_lapply()`) packages for speed (by using parallel processing) and memory efficiency.
We use the `progressr` [@R-progressr] package (`progressr::handlers()`; `progressr::with_progress()`; `progressr::progressor()`) to create a progress bar.
```{r}
#| label: function-for-bootstrap-simulation
bootstrapSimulation <- function(
projectedStats,
vars_by_pos,
n_iter = 10000,
seed = NULL,
progress = TRUE) {
dt <- data.table::as.data.table(projectedStats) # use data.table for speed
all_ids <- unique(dt$id)
if (!is.null(seed)) set.seed(seed)
future::plan(future::multisession) # parallelize tasks across multiple background R sessions using multiple cores to speed up simulation
if (progress) progressr::handlers("txtprogressbar") # specify progress-bar style
results <- progressr::with_progress({ # wrap in with_progress for progress bar
p <- if (progress) progressr::progressor(along = all_ids) else NULL # create progressor for progress bar
future.apply::future_lapply(
all_ids, # apply the function below to each player using a parallelized loop
function(player_id) {
if (!is.null(p)) p() # advance progress bar
player_data <- dt[id == player_id]
player_pos <- unique(player_data$pos)
if (length(player_pos) != 1 || !player_pos %in% names(vars_by_pos)) return(NULL)
stat_vars <- vars_by_pos[[player_pos]] # pull the relevant stat variables to simulate for this player's position
out <- data.table(iteration = seq_len(n_iter), id = player_id, pos = player_pos)
for (var in stat_vars) { # loop over each stat variable that should be simulated for the player's position
if (var %in% names(player_data)) {
non_na_values <- player_data[[var]][!is.na(player_data[[var]])] # pull non-missing values of the stat for the player (from all projection sources)
if (length(non_na_values) > 0) {
out[[var]] <- sample(non_na_values, n_iter, replace = TRUE) # if there are valid values, sample with replacement to simulate n_iter values
} else {
out[[var]] <- NA_real_ # specify a numeric missing value (if all values were missing)
}
} else {
out[[var]] <- NA_real_ # specify a numeric missing value (if the stat variable doesn't exist)
}
}
return(out)
},
future.seed = TRUE # ensures that each parallel process gets a reproducible random seed
)
})
data.table::rbindlist(results, use.names = TRUE, fill = TRUE) # combines all the individual player results into one large data table, aligning columns by name; fill = TRUE ensures that missing columns are filled with NA where necessary
}
```
#### Run the Bootstrapping Simulation {#sec-simulationProjectedPointsBootstrappingRun}
```{r}
#| label: code-to-run-bootstrap-simulation
#| eval: false
bootstappedStats <- bootstrapSimulation(
projectedStats = all_proj,
vars_by_pos = vars_by_pos,
n_iter = 5000,
seed = 52242)
```
```{r}
#| label: run-bootstrap-simulation
#| include: false
bootstappedStats <- bootstrapSimulation(
projectedStats = all_proj,
vars_by_pos = vars_by_pos,
n_iter = 5000,
seed = 52242,
progress = FALSE)
```
```{r}
#| label: free-up-memory1
#| include: false
rm(all_proj)
gc()
```
#### Score Fantasy Points from the Simulation {#sec-simulationProjectedPointsBootstrappingScore}
`data.table::setnames()`
```{r}
#| label: put-simulated-data-in-format-for-scoring1
data.table::setnames(bootstappedStats, "iteration", "data_src") # data.table equivalent to: bootstappedStats$data_src <- bootstappedStats$iteration
bootstappedStatsByPosition <- split(
bootstappedStats,
by = "pos",
keep.by = TRUE)
```
```{r}
#| label: free-up-memory2
#| include: false
rm(bootstappedStats)
gc()
```
`base::lapply()`
```{r}
#| label: put-simulated-data-in-format-for-scoring2
bootstappedStatsByPosition <- lapply(
bootstappedStatsByPosition,
setDF)
attr(bootstappedStatsByPosition, "season") <- 2024
attr(bootstappedStatsByPosition, "week") <- 0
```
```{r}
#| label: free-up-memory3
#| include: false
gc()
```
```{r}
#| label: score-bootstrap-simulation
bootstrappedFantasyPoints <- ffanalytics:::source_points(
data_result = bootstappedStatsByPosition,
scoring_rules = ffanalytics::scoring)
```
```{r}
#| label: free-up-memory4
#| include: false
rm(bootstappedStatsByPosition)
gc()
```
```{r}
#| label: postprocessing-bootstrap-simulation
bootstrappedFantasyPoints$iteration <- bootstrappedFantasyPoints$data_src
bootstrappedFantasyPoints$data_src <- NULL
bootstrappedFantasyPoints <- bootstrappedFantasyPoints %>%
left_join(
nfl_playerIDs[,c("mfl_id","name","merge_name","team")],
by = c("id" = "mfl_id")
)
bootstrappedFantasyPoints <- bootstrappedFantasyPoints %>%
rename(projectedPoints = raw_points)
```
```{r}
#| label: free-up-memory5
#| include: false
gc()
```
```{r}
#| label: compute-fantasy-points-example
#| eval: false
#| include: false
# https://github.com/FantasyFootballAnalytics/ffanalytics/blob/08cab3cbe6eb645b853bb814e671e186b78a0837/R/calc_projections.R#L218-L253
# This gives an example for the data format accepted by the source_points() function
l = ffanalytics:::scrape_espn(season = 2024, week = 0)
# The two important things for the "data_result" object going into the source_points()
# function are the format of the list (with positions as names) and the attributes:
# season and week
# Before sending the data_result to the source_points() function make sure it has these attributes set
# (e.g., this is for 2024 week 0)
attr(data_result, "season") <- 2024
attr(data_result, "week") <- 0
df <- ffanalytics:::source_points(data_result = l, scoring_rules = ffanalytics::scoring)
```
#### Summarize Players' Distribution of Projected Fantasy Points {#sec-simulationProjectedPointsBootstrappingSummarize}
```{r}
#| label: summarize-bootstrap-simulation-scores
bootstrappedFantasyPoints_summary <- bootstrappedFantasyPoints %>%
group_by(id) %>%
summarise(
mean = mean(projectedPoints, na.rm = TRUE),
SD = sd(projectedPoints, na.rm = TRUE),
min = min(projectedPoints, na.rm = TRUE),
max = max(projectedPoints, na.rm = TRUE),
q10 = quantile(projectedPoints, .10, na.rm = TRUE), # 10th quantile
q90 = quantile(projectedPoints, .90, na.rm = TRUE), # 90th quantile
range = max(projectedPoints, na.rm = TRUE) - min(projectedPoints, na.rm = TRUE),
IQR = IQR(projectedPoints, na.rm = TRUE),
MAD = mad(projectedPoints, na.rm = TRUE),
CV = SD/mean,
median = median(projectedPoints, na.rm = TRUE),
pseudomedian = DescTools::HodgesLehmann(projectedPoints, na.rm = TRUE),
mode = petersenlab::Mode(projectedPoints, multipleModes = "mean"),
skewness = psych::skew(projectedPoints, na.rm = TRUE),
kurtosis = psych::kurtosi(projectedPoints, na.rm = TRUE)
)
```
```{r}
#| label: free-up-memory6
#| include: false
gc()
```
#### View Players' Distribution of Projected Fantasy Points {#sec-simulationProjectedPointsBootstrappingView}
```{r}
#| label: merge-player-info
bootstrappedFantasyPoints_summary <- bootstrappedFantasyPoints_summary %>%
left_join(
nfl_playerIDs[,c("mfl_id","name","merge_name","team","position")],
by = c("id" = "mfl_id")
) %>%
select(name, team, position, mean:kurtosis, everything()) %>%
arrange(-mean)
```
```{r}
#| label: free-up-memory7
#| include: false
gc()
```
```{r}
#| label: display-bootstrap-simulation-summaries-qb
bootstrappedFantasyPoints_summary %>%
filter(position == "QB") %>%
mutate(
across(
where(is.numeric),
\(x) round(x, digits = 2)))
```
```{r}
#| label: display-bootstrap-simulation-summaries-rb
bootstrappedFantasyPoints_summary %>%
filter(position == "RB") %>%
mutate(
across(
where(is.numeric),
\(x) round(x, digits = 2)))
```
```{r}
#| label: display-bootstrap-simulation-summaries-wr
bootstrappedFantasyPoints_summary %>%
filter(position == "WR") %>%
mutate(
across(
where(is.numeric),
\(x) round(x, digits = 2)))
```
```{r}
#| label: display-bootstrap-simulation-summaries-te
bootstrappedFantasyPoints_summary %>%
filter(position == "TE") %>%
mutate(
across(
where(is.numeric),
\(x) round(x, digits = 2)))
```
```{r}
#| label: display-bootstrap-simulation-summaries-k
bootstrappedFantasyPoints_summary %>%
filter(position == c("K","PK")) %>%
mutate(
across(
where(is.numeric),
\(x) round(x, digits = 2)))
```
```{r}
#| label: display-bootstrap-simulation-summaries-dl
bootstrappedFantasyPoints_summary %>%
filter(position %in% c("DL","DT","DE")) %>%
mutate(
across(
where(is.numeric),
\(x) round(x, digits = 2)))
```
```{r}
#| label: display-bootstrap-simulation-summaries-lb
bootstrappedFantasyPoints_summary %>%
filter(position %in% c("LB","MLB","OLB")) %>%
mutate(
across(
where(is.numeric),
\(x) round(x, digits = 2)))
```
```{r}
#| label: display-bootstrap-simulation-summaries-db
bootstrappedFantasyPoints_summary %>%
filter(position %in% c("DB","S","CB")) %>%
mutate(
across(
where(is.numeric),
\(x) round(x, digits = 2)))
```
```{r}
#| label: identify-players-with-lo-or-hi-cv
#| eval: false
#| include: false
bootstrappedFantasyPoints_summary %>% filter(mean > 100 & position == "QB") %>% arrange(CV) %>% head
bootstrappedFantasyPoints_summary %>% filter(mean > 100 & position == "QB") %>% arrange(-CV) %>% head
```
An example distribution of projected fantasy points is in @fig-bootstrapping1.
```{r}
#| label: fig-bootstrapping1
#| fig-cap: "Distribution of Projected Fantasy Points for Patrick Mahomes from Bootstrapping."
#| fig-alt: "Distribution of Projected Fantasy Points for Patrick Mahomes from Bootstrapping."
ggplot2::ggplot(
data = bootstrappedFantasyPoints %>%
filter(pos == "QB" & name == "Patrick Mahomes"),
mapping = aes(
x = projectedPoints)
) +
geom_histogram(
aes(y = after_stat(density)),
color = "#000000",
fill = "#0099F8"
) +
geom_density(
color = "#000000",
fill = "#F85700",
alpha = 0.6 # add transparency
) +
geom_rug() +
#coord_cartesian(
# xlim = c(0,400)) +
labs(
x = "Fantasy Points",
y = "Density",
title = "Distribution of Projected Fantasy Points for Patrick Mahomes"
) +
theme_classic() +
theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
```
Projections of two players—one with relatively narrow [uncertainty](#sec-fantasyValueUncertainty) and one with relatively wide [uncertainty](#sec-fantasyValueUncertainty)—are depicted in @fig-bootstrapping2.
```{r}
#| label: fig-bootstrapping2
#| fig-cap: "Distribution of Projected Fantasy Points for Two Players from Bootstrapping. There is relatively narrow uncertainty around projected fantasy points for Dak Prescott, whereas there is relatively wide uncertainty around the projected fantasy points for Drake Maye."
#| fig-alt: "Distribution of Projected Fantasy Points for Two Players from Bootstrapping. There is relatively narrow uncertainty around projected fantasy points for Dak Prescott, whereas there is relatively wide uncertainty around the projected fantasy points for Drake Maye."
ggplot2::ggplot(
data = bootstrappedFantasyPoints %>%
filter(pos == "QB" & (name %in% c("Dak Prescott", "Drake Maye"))),
mapping = aes(
x = projectedPoints,
group = name,
#color = name,
fill = name)
) +
geom_histogram(
aes(y = after_stat(density))
) +
geom_density(
alpha = 0.6 # add transparency
) +
coord_cartesian(
xlim = c(0,NA),
expand = FALSE) +
#geom_rug() +
labs(
x = "Fantasy Points",
y = "Density",
fill = "",
color = "",
title = "Distribution of Projected Fantasy Points"
) +
theme_classic() +
theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
```
```{r}
#| label: free-up-memory8
#| include: false
rm(bootstrappedFantasyPoints)
gc()
```
### Monte Carlo Simulation {#sec-simulationProjectedPointsMonteCarlo}
In general, a Monte Carlo simulation study involves several steps:
1. generate the data for the simulation from the relevant probability distribution(s)
1. analyze the simulated data
1. summarize the results across the simulation replications
Below we conduct a Monte Carlo simulation to determine the range of each player's most likely points scored based on their projected points.
To do this, we use the `SimDesign` [@R-SimDesign; @Chalmers2020_packages] package.
#### `SimDesign` Package {#sec-simulationProjectedPointsMonteCarloSimDesign}
The `SimDesign` [@R-SimDesign; @Chalmers2020_packages] package uses a five-step process to perform Monte Carlo simulations:
1. Step 1—Design Grid: specify the fixed conditions to investigate in the simulation (e.g., sample size, generating distribution)
1. Step 2—Generate: specify how to generate the data for the simulation from the relevant probability distribution(s)
1. Step 3—Analyze: specify how to analyze the simulation data
1. Step 4—Summarize: specify how to summarize the results across the simulation replications
1. Step 5—Run the Simulation: run the simulation study (i.e., generate the data, analyze the simulated data, and summarize the results across the simulation replications)
You can generate a template for Monte Carlo simulations in the `SimDesign` [@R-SimDesign; @Chalmers2020_packages] package using the following code:
```{r}
SimDesign::SimFunctions()
```
```{r}
#| eval: false
#| include: false
#-------------------------------------------------------------------
Design <- SimDesign::createDesign(
factor1 = NA,
factor2 = NA)
#-------------------------------------------------------------------
Generate <- function(condition, fixed_objects) {
dat <- data.frame()
dat
}
Analyse <- function(condition, dat, fixed_objects) {
ret <- nc(stat1 = NaN, stat2 = NaN)
ret
}
Summarise <- function(condition, results, fixed_objects) {
ret <- c(bias = NaN, RMSE = NaN)
ret
}
#-------------------------------------------------------------------
res <- SimDesign::runSimulation(
design = Design,
replications = 2,
generate = Generate,
analyse = Analyse,
summarise = Summarise)
res
```
When designing a Monte Carlo simulation, @Sigal2016 (p. 151) recommend the following:
- beginning with a small number of replications and conditions (i.e., supplying a design object with fewer rows) to ensure that the code runs correctly, and increasing replications/conditions only when the majority of the bugs have been worked out
- documenting strange aspects of your simulation or code that you are likely to forget
- saving the state of your simulation in case of power failure or other mechanical failures (e.g., using the `save = TRUE` and/or `save_results = TRUE`, and/or `debug` arguments in `SimDesign::runSimulation()`), and, as necessary, performing debugging within a particular function (`debug` argument in `SimDesign::runSimulation()`)
#### Prepare Data {#sec-simulationProjectedPointsMonteCarloPrepareData}
```{r}
#| include: false
all_proj <- dplyr::bind_rows(players_projectedPoints_seasonal)
```
```{r}
all_proj <- all_proj %>%
rename(projectedPoints = raw_points)
```
#### Optimal Distribution for Each Player {#sec-simulationProjectedPointsMonteCarloOptimalDistribution}
For each player, we identify the optimal distribution of their projected points across sources as either a normal distribution, or as a skew-normal distribution.
The normal distribution was fit using the `fitdistrplus::fitdist()` function of the `fitdistrplus` package [@R-fitdistrplus; @DelignetteMuller2015_packages].
The skew-normal distribution was fit using the `sn::selm()` function of the `sn` package [@R-sn; @Azzalini2023_packages].
```{r}
# Function to identify the "best" distribution (Normal vs Skew‑Normal) for every player (tries both families and picks by AIC; uses empirical distribution if fewer than 2 unique scores)
fit_best <- function(x) {
# Basic checks
if (length(unique(x)) < 2 || all(is.na(x))) { # Use empirical distribution if there are fewer than 2 unique scores
return(list(type = "empirical", empirical = x))
}
# Try Normal Distribution
fit_norm <- tryCatch(
fitdistrplus::fitdist(x, distr = "norm"),
error = function(e) NULL
)
# Try Skew-Normal Distribution
fit_skew <- tryCatch(
sn::selm(x ~ 1),
error = function(e) NULL
)
# Handle bad fits: sd = NA, etc.
if (!is.null(fit_norm) && any(is.na(fit_norm$estimate))) {
fit_norm <- NULL
}
if (!is.null(fit_skew)) {
pars <- tryCatch(sn::coef(fit_skew, param.type = "dp"), error = function(e) NULL)
if (is.null(pars) || any(is.na(pars))) {
fit_skew <- NULL
}
}
# Choose best available
if (!is.null(fit_norm) && !is.null(fit_skew)) {
aic_norm <- AIC(fit_norm)
aic_skew <- AIC(fit_skew)
if (aic_skew + 2 < aic_norm) { # skew-normal is more complex (has more parameters) than normal distribution, so only select a skew-normal distribution if it fits substantially better than a normal distribution
pars <- sn::coef(fit_skew, param.type = "dp")
return(list(
type = "skewnorm",
xi = pars["dp.location"],
omega = pars["dp.scale"],
alpha = pars["dp.shape"]))
} else {
return(list(
type = "norm",
mean = fit_norm$estimate["mean"],
sd = fit_norm$estimate["sd"]))
}
} else if (!is.null(fit_norm)) {
return(list(
type = "norm",
mean = fit_norm$estimate["mean"],
sd = fit_norm$estimate["sd"]))
} else {
return(list(type = "empirical", empirical = x))
}
}
```
```{r}
proj_dists_tbl <- all_proj %>%
filter(!is.na(id) & id != "") %>%
group_by(id) %>%
summarise(
dist_info = list(fit_best(projectedPoints)),
n_proj = n(), # record of how many sources they have
.groups = "drop"
)
proj_dists <- proj_dists_tbl %>%
filter(!is.na(id) & id != "") %>%
distinct(id, .keep_all = TRUE) %>%
(\(x) setNames(x$dist_info, x$id))()
```
```{r}
proj_dists_tbl %>%
dplyr::mutate(
dist_type = purrr::map_chr(dist_info, ~ .x$type)
) %>%
dplyr::count(dist_type)
```
#### `SimDesign` Step 1: Design Grid {#sec-simulationProjectedPointsMonteCarloDesign}
First, we define what combinations of conditions to study (with one row per each unique combination), which is called our design grid.
For the design grid, the columns are the design factors (e.g., sample size, generating distribution), and each row indicates a unique combination of the factors (e.g., *N* = 200, normal distribution; *N* = 500, normal distribution; *N* = 200, Poisson distribution; *N* = 500, Poisson distribution).
When we execute the simulation, each row of the design grid object will be run sequentially.
In our simulation study, each player retains the same number of projections as they had in the original data, ensuring the simulation reflects real-world variation in data availability.
This number of projections defines our design grid.
The design factors (columns) are the player (`id`) and the number of projections available for that player (`n_sources`).
The rows, then are the combinations of player and number of projections available for that player.
```{r}
Design <- proj_dists_tbl %>%
dplyr::mutate(
id,
n_sources = n_proj, # number of projection sources available for player
.keep = "none"
)
Design
```
```{r}
#| include: false
#| eval: false
missing_ids <- setdiff(Design$id, names(proj_dists))
length(missing_ids) # should be 0
any(is.na(proj_dists_tbl$id)) # should be FALSE
any(is.na(Design$id)) # should be FALSE
any(is.na(names(proj_dists))) # should be FALSE
```
#### `SimDesign` Step 2: Generate {#sec-simulationProjectedPointsMonteCarloGenerate}
Second, we generate the data for the simulation from the relevant probability distribution function(s).
In this case, we use a normal distribution or skew-normal distribution as our probability distribution function for a given player, depending on which distribution fits better to that player's distribution of projected points across sources (stored in `proj_dists[["INSERT_ID"]]$type`).
The `Generate()` function has a `condition` argument, which takes a single row from the design grid object (`Design`), and which controls the data generation for subsequent analysis.
We use the (optional) `fixed_objects` argument so we can pass an object that should be held constant across all simulation replications—i.e., a "fixed" object.
In this case, we pass—via the `proj_dists` object—the best-fitting distribution for each player (normal or skew-normal) to be used for all simulation replications.
```{r}
Generate <- function(condition, fixed_objects = NULL) {
dist_info <- fixed_objects$proj_dists[[as.character(condition$id)]]
n_sources <- condition$n_sources
sim_points <- switch(
dist_info$type,
empirical = sample(
dist_info$empirical,
n_sources,
replace = TRUE),
norm = rnorm(
n_sources,
mean = dist_info$mean,
sd = dist_info$sd),
skewnorm = sn::rsn(
n_sources,
xi = dist_info$xi,
omega = dist_info$omega,
alpha = dist_info$alpha),
stop("Unknown distribution type: ", dist_info$type)
)
data.frame(
id = condition$id,
sim_points = sim_points)
}
```
#### `SimDesign` Step 3: Analyze {#sec-simulationProjectedPointsMonteCarloAnalyze}
Third, we analyze the simulated data.
For each iteration of the simulation, the `Analyse()` function will return a vector with a requested variables: player (`id`), their simulated mean (`mean_pts`) and standard deviation (`sd_pts`) of projected fantasy points across sources, the 10th (`q10`) and 90th (`q90`) quartiles of their distributions of projected points, and the simulated probability of the player scoring at least 100 (`p100`), 150 (`p150`), 200 (`p200`), 250 (`p250`), 300 (`p300`), and 350 (`p350`) fantasy points.
We then combine all the rows into a tibble dataframe, with each player as a different row.
A separate dataset is created for each of the simulation replications.
```{r}
Analyse <- function(condition, dat, fixed_objects = NULL) {
tibble::tibble(
id = condition$id,
mean_pts = mean(dat$sim_points, na.rm = TRUE),
sd_pts = sd(dat$sim_points, na.rm = TRUE),
q10 = quantile(dat$sim_points, 0.10, na.rm = TRUE),
q90 = quantile(dat$sim_points, 0.90, na.rm = TRUE),
p100 = mean(dat$sim_points >= 100, na.rm = TRUE),
p150 = mean(dat$sim_points >= 150, na.rm = TRUE),
p200 = mean(dat$sim_points >= 200, na.rm = TRUE),
p250 = mean(dat$sim_points >= 250, na.rm = TRUE),
p300 = mean(dat$sim_points >= 300, na.rm = TRUE),
p350 = mean(dat$sim_points >= 350, na.rm = TRUE)
)
}
```
#### `SimDesign` Step 4: Summarize {#sec-simulationProjectedPointsMonteCarloSummarize}
Fourth, we summarize the results across the simulation replications.
In this case, we calculate—for each variable—each player's mean and standard deviation across the datasets from the different simulation replications.
Some potential metastatistics that can be computed include:
- mean: `base::mean()`
- standard deviation: `base::sd()`
- bias: `SimDesign::bias()`
- standardized bias: `SimDesign::bias(, type = "standardized")`
- relative bias: `SimDesign::bias(, type = "relative")`
- relative absolute bias: `SimDesign::RAB()` or `SimDesign::bias(, type = "abs_relative")`
- mean absolute error (MAE): `SimDesign::MAE()`
- normalized mean absolute error: `SimDesign::MAE(, type = "NMAE")`
- standardized mean absolute error: `SimDesign::MAE(, type = "SMAE")`
- root mean squared error (RMSE): `SimDesign::RMSE()`
- normalized root mean squared error: `SimDesign::RMSE(, type = "NRMSE")`
- standardized root mean squared error: `SimDesign::RMSE(, type = "SRMSE")`
- root mean squared log error (RMSLE): `SimDesign::RMSE(, type = "RMSLE")`
- integrated root mean squared error (IRMSE): `SimDesign::IRMSE()`
- mean-square relative standard error (MSRSE): `SimDesign::MSRSE()`
- coefficient of variation: `SimDesign::RMSE(, type = "CV")`
- relative efficiency: `SimDesign::RE()`
- empirical detection/rejection rate: `SimDesign::EDR()`
- empirical coverage rate: `SimDesign::ECR()`
```{r}
Summarise <- function(condition, results, fixed_objects = NULL) {
dplyr::summarise(
results,
across(
where(is.numeric),
list(
mean = ~mean(.x, na.rm = TRUE),
sd = ~sd(.x, na.rm = TRUE)),
.names = "{.col}_{.fn}"
),
.groups = "drop"
)
}
```
#### `SimDesign` Step 5: Run the Simulation {#sec-simulationProjectedPointsMonteCarloRun}
Fifth, we run the simulation study.
We run the simulation study using the `SimDesign::runSimulation()` function.
::: {#nte-monteCarloSimulation .callout-note title="Monte Carlo Simulation"}
Note: the following code that runs the simulation takes a while.
If you just want to save time and load the results object instead of running the simulation, you can load the results object of the simulation (which has already been run) using this code:
```{r}
#| eval: false
load(url("https://osf.io/download/ues7n/"))
```
:::
```{r}
#| eval: false
monteCarloSim_results <- SimDesign::runSimulation(
design = Design,
replications = 1000,
generate = Generate,
analyse = Analyse,
summarise = Summarise,
fixed_objects = list(proj_dists = proj_dists), # passing a list of objects that should be held constant across all simulation replications — i.e., "fixed" objects
seed = SimDesign::genSeeds(Design, iseed = 52242), # for reproducibility
parallel = TRUE # for faster (parallel) processing
)
```
#### Simulation Results {#sec-simulationProjectedPointsMonteCarloResults}
```{r}
#| eval: false
monteCarloSim_results <- monteCarloSim_results %>%
left_join(
nfl_playerIDs[,c("mfl_id","name","merge_name","position","team")],
by = c("id" = "mfl_id")
) %>%
select(name, team, position, everything()) %>%
arrange(-mean_pts_mean)
```
```{r}
#| eval: false
#| include: false
save(
monteCarloSim_results,
file = "./data/monteCarloSim_results.RData")
```
```{r}
#| eval: false
#| include: false
load(file = file.path(path, "/OneDrive - University of Iowa/Teaching/Courses/Fantasy Football/Data/monteCarloSim_results.RData", fsep = ""))
```
```{r}
#| include: false
load(file = "./data/monteCarloSim_results.RData")
```
```{r}
#| eval: false
#| include: false
petersenlab::robust_load("https://osf.io/download/ues7n/") #load(url("https://osf.io/download/ues7n/"))
```
The `pX` variable represent the probability that a player scoring more than X number of points.
For example, the `p300` variable represents the probability that each player scores more than 300 points.
However, it is important to note that this is based on the distribution of *projected* points.
```{r}
monteCarloSim_results
```
```{r}
monteCarloSim_results %>%
filter(position == "QB")
```
```{r}
monteCarloSim_results %>%
filter(position == "RB")
```