ContextLab
diff --git a/‎code/notebooks/main/5_predictive-analyses.ipynb‎
Lines changed: 120 additions & 85 deletions b/‎code/notebooks/main/5_predictive-analyses.ipynb‎
Lines changed: 120 additions & 85 deletions
diff --git a/‎paper/CDL-bibliography/cdl.bib‎
Lines changed: 13 additions & 0 deletions b/‎paper/CDL-bibliography/cdl.bib‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎paper/changes.pdf‎
-1.94 KB b/‎paper/changes.pdf‎
-1.94 KB
diff --git a/‎paper/changes.tex‎
Lines changed: 20 additions & 29 deletions b/‎paper/changes.tex‎
Lines changed: 20 additions & 29 deletions
diff --git a/‎paper/figs/predict-knowledge-questions.pdf‎
-2.17 KB b/‎paper/figs/predict-knowledge-questions.pdf‎
-2.17 KB
diff --git a/‎paper/figs/source/predict-knowledge-questions.pdf‎
85 Bytes b/‎paper/figs/source/predict-knowledge-questions.pdf‎
85 Bytes
diff --git a/‎paper/figs/word-overlap-comparison.pdf‎
-116 Bytes b/‎paper/figs/word-overlap-comparison.pdf‎
-116 Bytes
diff --git a/‎paper/main.pdf‎
-2.03 KB b/‎paper/main.pdf‎
-2.03 KB
@@ -1,5 +1,18 @@
 
 
+@article{Ewen03,
+	author = {Ewens, W J},
+	date-added = {2024-02-21 19:20:19 -0500},
+	date-modified = {2024-02-21 19:26:45 -0500},
+	doi = {10.1086/346174},
+	journal = {American Journal of Human Genetics},
+	month = {February},
+	number = {2},
+	pages = {496--498},
+	title = {{On Estimating \textit{P} Values by Monte Carlo Methods}},
+	volume = {72},
+	year = {2003}}
+
 @article{NortEtal02,
 	author = {North, B V and Curtis, D and Sham, P C},
 	date-added = {2024-02-21 18:40:44 -0500},
 
@@ -1,7 +1,7 @@
 \documentclass[10pt]{article}
 %DIF LATEXDIFF DIFFERENCE FILE
 %DIF DEL old.tex    Mon Feb 19 07:49:49 2024
-%DIF ADD main.tex   Wed Feb 21 21:04:59 2024
+%DIF ADD main.tex   Thu Feb 22 00:08:08 2024
 %DIF 2a2
 \usepackage{amsmath} %DIF > 
 %DIF -------
@@ -830,7 +830,7 @@ \section*{Results}
 between what is known versus unknown. }\DIFdelend \DIFaddbegin \DIFadd{was associated with a lower likelihood of
 answering the question correctly (odds ratio $(OR) = 0.136$, likelihood-ratio
 test statistic $(\lambda_{LR}) = 19.749$, 95\%\ $\textnormal{CI} = [14.352,\
-26.545],\ p = 0.001$). This outcome suggests that our knowledge estimates do
+26.545],\ p < 0.001$). This outcome suggests that our knowledge estimates do
 }\textit{\DIFadd{not}} \DIFadd{provide useful information about participants' Quiz~1 performance
 when we aggregated across all question content areas. We speculated that this
 might either indicate that the knowledge estimates are uninformative in
@@ -840,8 +840,8 @@ \section*{Results}
 analysis for Quizzes~2 and~3, we found that }\textit{\DIFadd{higher}} \DIFadd{estimated knowledge
 for a given question predicted a greater likelihood of answering it correctly
 (Quiz~2: $OR = 2.905,\ \lambda_{LR} = 17.333,\ 95\%\ \textnormal{CI} =
-[14.966,\ 29.309],\ p = 0.002$; Quiz~3: $OR = 3.238,\ \lambda_{LR} = 6.882,\
-95\%\ \textnormal{CI} = [6.228,\ 8.184],\ p = 0.017$). Taken together, these
+[14.966,\ 29.309],\ p = 0.001$; Quiz~3: $OR = 3.238,\ \lambda_{LR} = 6.882,\
+95\%\ \textnormal{CI} = [6.228,\ 8.184],\ p = 0.016$). Taken together, these
 results suggest that our knowledge estimates reliably predict participants'
 performance on individual held-out quiz questions, but only after participants
 have received at least some training.
@@ -868,7 +868,7 @@ \section*{Results}
 difference did }\textit{\DIFdel{not}} %DIFAUXCMD
 \DIFdel{hold for }\DIFdelend \DIFaddbegin \DIFadd{-related questions did not reliably predict whether those questions were
 answered correctly ($OR = 1.891,\ \lambda_{LR} = 2.293,\ 95\%\ \textnormal{CI}
-= [2.091,\ 2.622],\ p = 0.139$). The same was true of knowledge estimates for
+= [2.091,\ 2.622],\ p = 0.138$). The same was true of knowledge estimates for
 held-out }\textit{\DIFadd{Birth of Stars}}\DIFadd{-related questions based on other }\textit{\DIFadd{Birth
 of Stars}}\DIFadd{-related questions ($OR = 0.722,\ \lambda_{LR} = 5.115,\ 95\%\
 \textnormal{CI} = [0.094,\ 0.146],\ p = 0.738$). As in our analysis that
@@ -880,12 +880,12 @@ \section*{Results}
 Stars}\DIFdelbegin \DIFdel{questions
 }\DIFdelend \DIFaddbegin \DIFadd{), we found that they now reliably predicted success on }\textit{\DIFadd{Four
 Fundamental Forces}}\DIFadd{-related questions ($OR = 9.023,\ \lambda_{LR} = 18.707,\
-95\%\ \textnormal{CI} = [10.877,\ 22.222],\ p = 0.001$) but not on
+95\%\ \textnormal{CI} = [10.877,\ 22.222],\ p < 0.001$) but not on
 }\textit{\DIFadd{Birth of Stars}}\DIFadd{-related questions }\DIFaddend (\DIFdelbegin \DIFdel{$\U = 7419,~p = 0.739$). Again, we suggest that this might reflect
 a floor
 effect whereby, at that point in the participants' training, their knowledge
 about the content of the }\DIFdelend \DIFaddbegin \DIFadd{$OR = 0.306,\ \lambda_{LR} = 5.115,\
-95\%\ \textnormal{CI} = [4.624,\ 5.655],\ p = 0.055$). Here, we speculate that
+95\%\ \textnormal{CI} = [4.624,\ 5.655],\ p = 0.054$). Here, we speculate that
 participants might have been guessing about the }\DIFaddend \textit{Birth of Stars} \DIFdelbegin \DIFdel{material is relatively low
 everywhere in that region of text embedding space.
 }%DIFDELCMD < 
@@ -908,7 +908,7 @@ \section*{Results}
 \DIFdel{questions from the same quiz and
 participant ($\U = 6126,~p = 0.006$}\DIFdelend \DIFaddbegin \DIFadd{-related
 questions could now reliably predict success on those questions ($OR = 5.467,\
-\lambda_{LR} = 10.670,\ 95\%\ \textnormal{CI} = [7.998, 12.532],\ p = 0.006$}\DIFaddend ).
+\lambda_{LR} = 10.670,\ 95\%\ \textnormal{CI} = [7.998, 12.532],\ p = 0.005$}\DIFaddend ).
 However, \DIFdelbegin \DIFdel{we found the }\textit{\DIFdel{opposite}}
 %DIFAUXCMD
 \DIFdel{effect when we carried out }\DIFdelend within-lecture knowledge \DIFdelbegin \DIFdel{predictions for held-out
@@ -925,7 +925,7 @@ \section*{Results}
 likelihood of successfully answering them and instead exhibited the inverse
 relationship we would expect to arise from unstructured knowledge (with respect
 to the embedding space; $OR = 0.013,\ \lambda_{LR} = 14.648,\ 95\%\
-\textnormal{CI} = [10.695, 23.096],\ p = 0.001$). }\DIFaddend Speculatively, we suggest
+\textnormal{CI} = [10.695, 23.096],\ p < 0.001$). }\DIFaddend Speculatively, we suggest
 that this may reflect participants forgetting some of the \textit{Four
 Fundamental Forces} content \DIFaddbegin \DIFadd{(e.g., perhaps in favor of prioritizing encoding
 the just-watched }\textit{\DIFadd{Birth of Stars}} \DIFadd{content in preparation for the third
@@ -952,11 +952,11 @@ \section*{Results}
 participants' abilities to correctly answer questions about }\textit{\DIFadd{Four
 Fundamental Forces}} \DIFadd{could be predicted from their responses to questions about
 }\textit{\DIFadd{Birth of Stars}} \DIFadd{($OR = 1.896,\ \lambda_{LR} = 7.205,\ 95\%\
-\textnormal{CI} = [6.224, 7.524],\ p = 0.039$) and similarly, that their
+\textnormal{CI} = [6.224, 7.524],\ p = 0.038$) and similarly, that their
 ability to correctly answer }\textit{\DIFadd{Birth of Stars}}\DIFadd{-related questions could be
 predicted from their responses to }\textit{\DIFadd{Four Fundamental Forces}}\DIFadd{-related
 questions ($OR = 1.522,\ \lambda_{LR} = 6.448,\ 95\%\ \textnormal{CI} = [5.656,
-6.843],\ p = 0.043$). Given the results from our analyses that included all
+6.843],\ p = 0.042$). Given the results from our analyses that included all
 questions and within-lecture predictions, we were surprised to find }\DIFaddend that the
 knowledge \DIFdelbegin \DIFdel{predictions are generalizable across the content areas spanned by the two lectures, while also specific enough to }\DIFdelend \DIFaddbegin \DIFadd{estimates could reliably (if weakly) predict participants'
 performance across content from different lectures. It is possible that this
@@ -967,10 +967,10 @@ \section*{Results}
 responses to }\textit{\DIFadd{Four Fundamental Forces}}\DIFadd{-related questions did
 }\textit{\DIFadd{not}} \DIFadd{reliably predict their success on }\textit{\DIFadd{Birth of Stars}}\DIFadd{-related
 questions ($OR = 1.865,\ \lambda_{LR} = 3.205,\ 95\%\ \textnormal{CI} = [3.027,
-3.600],\ p = 0.125$), nor did their responses to }\textit{\DIFadd{Birth of
+3.600],\ p = 0.124$), nor did their responses to }\textit{\DIFadd{Birth of
 Stars}}\DIFadd{-related questions reliably predict their success on }\textit{\DIFadd{Four
 Fundamental Forces}}\DIFadd{-related questions ($OR = 3.490,\ \lambda_{LR} = 3.266,\
-95\%\ \textnormal{CI} = [3.033, 3.866],\ p = 0.094$). These ``prediction
+95\%\ \textnormal{CI} = [3.033, 3.866],\ p = 0.093$). These ``prediction
 failures'' appear to come from the fact that any signal derived from
 participants' knowledge about the content of the }\textit{\DIFadd{Birth of Stars}}
 \DIFadd{lecture (prior to watching it) is overwhelmed by the much more dramatic increase in
@@ -983,9 +983,9 @@ \section*{Results}
 questions from Quiz~3 (when participants had now viewed }\textit{\DIFadd{both}}
 \DIFadd{lectures), we could again reliably predict success on questions about both
 }\textit{\DIFadd{Four Fundamental Forces}} \DIFadd{($OR = 11.294),\ \lambda_{LR} = 11.055,\ 95\%\
-\textnormal{CI} = [9.126, 18.476],\ p = 0.004$) and }\textit{\DIFadd{Birth of Stars}}
+\textnormal{CI} = [9.126, 18.476],\ p = 0.003$) and }\textit{\DIFadd{Birth of Stars}}
 \DIFadd{($OR = 7.302,\ \lambda_{LR} = 7.068,\ 95\%\ \textnormal{CI} = [6.490, 8.584],\
-p = 0.017$) using responses to questions about the other lecture's content.
+p = 0.016$) using responses to questions about the other lecture's content.
 Across all three versions of these analyses, our results suggest that (by and
 large) our knowledge estimates can reliably predict participants' abilities to
 answer individual quiz questions, }\DIFaddend distinguish between questions about \DIFdelbegin \DIFdel{more subtly different contentwithin the
@@ -994,14 +994,7 @@ \section*{Results}
 responses reflect a minimum level of ``real'' knowledge about both content on
 which these predictions are based and that for which they are made}\DIFaddend .
 
-%DIF >  our approach works when participants have a minimal baseline level of knowledge about content predicted and used to predict
-%DIF >  our approach generalizes when knowledge of content used to predict can be assumed to be a reasonable indicator of knowledge of content predicted
-%DIF >  our approach has enough specificity to distinguish between content within the same lecture when it was just watched -- maybe when people forget a little bit they forget "randomly"?.
-\DIFaddbegin 
-
-%DIF >  potential new transition/motivation -- in the previous analyses, we identified a particular set of constraints on our estimates of participants' knowledge. This made us wonder about another potential constraint: how far away in topic space does the relevance of being able to answer a question extend and influence ability to answer a different question?
-
-\DIFaddend That the knowledge predictions derived from the text embedding space reliably
+That the knowledge predictions derived from the text embedding space reliably
 distinguish between held-out correctly versus incorrectly answered questions
 (Fig.~\ref{fig:predictions}) suggests that spatial relationships within this
 space can help explain what participants know. But how far does this
@@ -1765,8 +1758,6 @@ \subsubsection*{Estimating dynamic knowledge traces}\label{subsec:traces}
 %DIF >  %We chose this stopping criterion as a conceptual ``middle ground'' between two popular but opposing approaches to model selection that advocate (respectively) for either retaining the maximal model that allows convergence, regardless of singular fits~\citep[at the potential cost of decreased power; e.g.,~][]{BarrEtal13} or testing individual parameters achieving a parsimonious model by discarding all parameters that don't significantly decrease goodness of fit ~\citep[at the potential cost of increased Type I error rates; e.g.,~][]{BateEtal15b}.
 %DIF >  Our threshold for inclusion of random effects is intended to achieve a reasonable balance between these trade-offs.
 
-%DIF > To assess the predictive value of our knowledge estimates for individual quiz questions, we used the same sets of observations used to fit each GLMM to fit a second set of ``null'' models (similarly with a logistic link function). We fit these models with the formula:
-
 \DIFadd{To assess the predictive value of our knowledge estimates, we compared each
 GLMM's ability to discriminate between correctly and incorrectly answered
 questions to that of an analogous model that did }\textit{\DIFadd{not}} \DIFadd{consider estimated
@@ -1780,15 +1771,15 @@ \subsubsection*{Estimating dynamic knowledge traces}\label{subsec:traces}
 We then compared each full model to its reduced (null) equivalent using a likelihood-ratio test (LRT). 
 Because the typical asymptotic $\chi^2_d$ approximation of the null distribution for the LRT statistic ($\lambda_{LR}$) is anti-conservative for models that differ in their random slope terms~\mbox{%DIFAUXCMD
 \citep{GoldSimo00,ScheEtal08b,SnijBosk11}}\hskip0pt%DIFAUXCMD
-, we computed $p$-values for these tests using a parametric bootstrapping procedure~\mbox{%DIFAUXCMD
-\citep{HaleHojs14}}\hskip0pt%DIFAUXCMD
+, we computed $p$-values for these tests using a parametric bootstrap procedure~\mbox{%DIFAUXCMD
+\citep{DaviHink97,HaleHojs14}}\hskip0pt%DIFAUXCMD
 . 
 For each of 1,000 bootstraps, we used the fitted null model to simulate a sample of observations of equal size to our original sample.
 We then re-fit both the null and full models to this simulated sample and compared them via an LRT.
 This yielded a distribution of $\lambda_{LR}$ statistics we may expect to observe under our null hypothesis.
 Following~\mbox{%DIFAUXCMD
-\citep{DaviHink97,NortEtal02}}\hskip0pt%DIFAUXCMD
-, we computed a corrected $p$-value for our observed $\lambda_{LR}$ as $\frac{r + 1}{n + 1}$, where $r$ is the number of simulated model comparisons that yielded a $\lambda_{LR}$ greater than or equal to our observed value and $n$ is the number of simulations we ran (1,000).
+\citet{Ewen03}}\hskip0pt%DIFAUXCMD
+, we computed a corrected $p$-value for our observed $\lambda_{LR}$ as $\frac{r}{n}$, where $r$ is the number of simulated model comparisons that yielded a $\lambda_{LR}$ greater than or equal to our observed value and $n$ is the number of simulations we ran (1,000).
 }
 
 \DIFaddend \subsubsection*{Estimating the ``smoothness'' of knowledge}\label{subsec:smoothness}