You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
= [2.091,\ 2.622],\ p = 0.139$). The same was true of knowledge estimates for
871
+
= [2.091,\ 2.622],\ p = 0.138$). The same was true of knowledge estimates for
872
872
held-out }\textit{\DIFadd{Birth of Stars}}\DIFadd{-related questions based on other }\textit{\DIFadd{Birth
873
873
of Stars}}\DIFadd{-related questions ($OR = 0.722,\ \lambda_{LR} = 5.115,\ 95\%\
874
874
\textnormal{CI} = [0.094,\ 0.146],\ p = 0.738$). As in our analysis that
@@ -880,12 +880,12 @@ \section*{Results}
880
880
Stars}\DIFdelbegin\DIFdel{questions
881
881
}\DIFdelend\DIFaddbegin\DIFadd{), we found that they now reliably predicted success on }\textit{\DIFadd{Four
882
882
Fundamental Forces}}\DIFadd{-related questions ($OR = 9.023,\ \lambda_{LR} = 18.707,\
883
-
95\%\ \textnormal{CI} = [10.877,\ 22.222],\ p =0.001$) but not on
883
+
95\%\ \textnormal{CI} = [10.877,\ 22.222],\ p <0.001$) but not on
884
884
}\textit{\DIFadd{Birth of Stars}}\DIFadd{-related questions }\DIFaddend (\DIFdelbegin\DIFdel{$\U = 7419,~p = 0.739$). Again, we suggest that this might reflect
885
885
a floor
886
886
effect whereby, at that point in the participants' training, their knowledge
887
887
about the content of the }\DIFdelend\DIFaddbegin\DIFadd{$OR = 0.306,\ \lambda_{LR} = 5.115,\
888
-
95\%\ \textnormal{CI} = [4.624,\ 5.655],\ p = 0.055$). Here, we speculate that
888
+
95\%\ \textnormal{CI} = [4.624,\ 5.655],\ p = 0.054$). Here, we speculate that
889
889
participants might have been guessing about the }\DIFaddend\textit{Birth of Stars} \DIFdelbegin\DIFdel{material is relatively low
890
890
everywhere in that region of text embedding space.
6.843],\ p = 0.043$). Given the results from our analyses that included all
959
+
6.843],\ p = 0.042$). Given the results from our analyses that included all
960
960
questions and within-lecture predictions, we were surprised to find }\DIFaddend that the
961
961
knowledge \DIFdelbegin\DIFdel{predictions are generalizable across the content areas spanned by the two lectures, while also specific enough to }\DIFdelend\DIFaddbegin\DIFadd{estimates could reliably (if weakly) predict participants'
962
962
performance across content from different lectures. It is possible that this
@@ -967,10 +967,10 @@ \section*{Results}
967
967
responses to }\textit{\DIFadd{Four Fundamental Forces}}\DIFadd{-related questions did
968
968
}\textit{\DIFadd{not}} \DIFadd{reliably predict their success on }\textit{\DIFadd{Birth of Stars}}\DIFadd{-related
p = 0.017$) using responses to questions about the other lecture's content.
988
+
p = 0.016$) using responses to questions about the other lecture's content.
989
989
Across all three versions of these analyses, our results suggest that (by and
990
990
large) our knowledge estimates can reliably predict participants' abilities to
991
991
answer individual quiz questions, }\DIFaddend distinguish between questions about \DIFdelbegin\DIFdel{more subtly different contentwithin the
@@ -994,14 +994,7 @@ \section*{Results}
994
994
responses reflect a minimum level of ``real'' knowledge about both content on
995
995
which these predictions are based and that for which they are made}\DIFaddend .
996
996
997
-
%DIF > our approach works when participants have a minimal baseline level of knowledge about content predicted and used to predict
998
-
%DIF > our approach generalizes when knowledge of content used to predict can be assumed to be a reasonable indicator of knowledge of content predicted
999
-
%DIF > our approach has enough specificity to distinguish between content within the same lecture when it was just watched -- maybe when people forget a little bit they forget "randomly"?.
1000
-
\DIFaddbegin
1001
-
1002
-
%DIF > potential new transition/motivation -- in the previous analyses, we identified a particular set of constraints on our estimates of participants' knowledge. This made us wonder about another potential constraint: how far away in topic space does the relevance of being able to answer a question extend and influence ability to answer a different question?
1003
-
1004
-
\DIFaddend That the knowledge predictions derived from the text embedding space reliably
997
+
That the knowledge predictions derived from the text embedding space reliably
1005
998
distinguish between held-out correctly versus incorrectly answered questions
1006
999
(Fig.~\ref{fig:predictions}) suggests that spatial relationships within this
1007
1000
space can help explain what participants know. But how far does this
%DIF > %We chose this stopping criterion as a conceptual ``middle ground'' between two popular but opposing approaches to model selection that advocate (respectively) for either retaining the maximal model that allows convergence, regardless of singular fits~\citep[at the potential cost of decreased power; e.g.,~][]{BarrEtal13} or testing individual parameters achieving a parsimonious model by discarding all parameters that don't significantly decrease goodness of fit ~\citep[at the potential cost of increased Type I error rates; e.g.,~][]{BateEtal15b}.
1766
1759
%DIF > Our threshold for inclusion of random effects is intended to achieve a reasonable balance between these trade-offs.
1767
1760
1768
-
%DIF > To assess the predictive value of our knowledge estimates for individual quiz questions, we used the same sets of observations used to fit each GLMM to fit a second set of ``null'' models (similarly with a logistic link function). We fit these models with the formula:
1769
-
1770
1761
\DIFadd{To assess the predictive value of our knowledge estimates, we compared each
1771
1762
GLMM's ability to discriminate between correctly and incorrectly answered
1772
1763
questions to that of an analogous model that did }\textit{\DIFadd{not}} \DIFadd{consider estimated
We then compared each full model to its reduced (null) equivalent using a likelihood-ratio test (LRT).
1781
1772
Because the typical asymptotic $\chi^2_d$ approximation of the null distribution for the LRT statistic ($\lambda_{LR}$) is anti-conservative for models that differ in their random slope terms~\mbox{%DIFAUXCMD
, we computed $p$-values for these tests using a parametric bootstrapping procedure~\mbox{%DIFAUXCMD
1784
-
\citep{HaleHojs14}}\hskip0pt%DIFAUXCMD
1774
+
, we computed $p$-values for these tests using a parametric bootstrap procedure~\mbox{%DIFAUXCMD
1775
+
\citep{DaviHink97,HaleHojs14}}\hskip0pt%DIFAUXCMD
1785
1776
.
1786
1777
For each of 1,000 bootstraps, we used the fitted null model to simulate a sample of observations of equal size to our original sample.
1787
1778
We then re-fit both the null and full models to this simulated sample and compared them via an LRT.
1788
1779
This yielded a distribution of $\lambda_{LR}$ statistics we may expect to observe under our null hypothesis.
1789
1780
Following~\mbox{%DIFAUXCMD
1790
-
\citep{DaviHink97,NortEtal02}}\hskip0pt%DIFAUXCMD
1791
-
, we computed a corrected $p$-value for our observed $\lambda_{LR}$ as $\frac{r + 1}{n + 1}$, where $r$ is the number of simulated model comparisons that yielded a $\lambda_{LR}$ greater than or equal to our observed value and $n$ is the number of simulations we ran (1,000).
1781
+
\citet{Ewen03}}\hskip0pt%DIFAUXCMD
1782
+
, we computed a corrected $p$-value for our observed $\lambda_{LR}$ as $\frac{r}{n}$, where $r$ is the number of simulated model comparisons that yielded a $\lambda_{LR}$ greater than or equal to our observed value and $n$ is the number of simulations we ran (1,000).
1792
1783
}
1793
1784
1794
1785
\DIFaddend\subsubsection*{Estimating the ``smoothness'' of knowledge}\label{subsec:smoothness}
0 commit comments