ContextLab
diff --git a/‎paper/main.pdf‎
-99 Bytes b/‎paper/main.pdf‎
-99 Bytes
diff --git a/‎paper/main.tex‎
Lines changed: 13 additions & 21 deletions b/‎paper/main.tex‎
Lines changed: 13 additions & 21 deletions
@@ -565,18 +565,10 @@ \section*{Results}
 We carried out three different versions of the analyses described above, wherein we considered different sources of information in our estimates of participants' knowledge for each quiz question. 
 First, we estimated knowledge at each question's embedding coordinate using \textit{all other} questions answered by the same participant on the same quiz (``All questions''; Fig.~\ref{fig:predictions}, top row).
 This test was intended to assess the overall predictive power of our approach. 
-Second, we estimated knowledge for each question about one lecture using only questions (from the same participant and quiz) about the \textit{other} lecture (``Across-lecture''; Fig.~\ref{fig:predictions}, middle rows). 
-This test was intended to assess the \textit{generalizability} of our approach by asking whether our predictions held across the content areas of the two lectures.
-Third, we estimated knowledge for each question about a given lecture using only the other questions (from the same participant and quiz) about that \textit{same} lecture (``Within-lecture''; Fig.~\ref{fig:predictions}, bottom rows). 
+Second, we estimated knowledge for each question about a given lecture using only the other questions (from the same participant and quiz) about that \textit{same} lecture (``Within-lecture''; Fig.~\ref{fig:predictions}, middle rows). 
 This test was intended to assess the \textit{specificity} of our approach by asking whether our predictions could distinguish between questions about different content covered by the same lecture.
-
-%When we estimated participants' knowledge for each Quiz~1 question based on all other Quiz~1 questions, we found an inverse relationship. 
-%Specifically, higher estimated knowledge at the embedding coordinate at a held-out question was associated with a lower likelihood of answering the question correctly ($\textrm{odds ratio}\ (OR) = 0.136,\ \textrm{likelihood-ratio test statistic}\ (\lambda_{LR}) = 19.749,\ \textrm{95\% CI} = [14.352,\ 26.545],\ p = 0.001$). 
-%However, this inverse relationship in fact represents the expected result under our null hypothesis (that estimated knowledge is \textit{not} predictive of success on a question). 
-%An intuition for this can be taken from the expected outcome of same analysis based on the simple proportion correct, rather than estimated knowledge. 
-%Suppose a participant answered $n$ out of 13 quiz questions correctly. 
-%If we held out a single correctly answered question and computed the proportion of remaining questions answered correctly, that proportion would be $(n - 1) / 12$. 
-%Whereas if we held out a single incorrectly answered question, the proportion of remaining questions answered correctly would be $n / 12$. 
+Third, we estimated knowledge for each question about one lecture using only questions (from the same participant and quiz) about the \textit{other} lecture (``Across-lecture''; Fig.~\ref{fig:predictions}, bottom rows). 
+This test was intended to assess the \textit{generalizability} of our approach by asking whether our predictions held across the content areas of the two lectures.
 
 In performing this set of analyses, our null hypothesis is that the knowledge estimates we compute based on the quiz questions' embedding coordinates do \textit{not} provide useful information about participants' abilities to answer those questions. 
 What result might we expect to see if this is the case? 
@@ -586,14 +578,14 @@ \section*{Results}
 Whereas if we held out a single \textit{incorrectly} answered question and did the same, that proportion would be $n / 12$. 
 Thus for a given participant and quiz, a ``knowledge estimate'' computed as the simple (i.e., unweighted) remaining proportion-correct is perfectly inversely related to success on a held-out question: it will always be \textit{lower} for correctly answered questions than for incorrectly answered questions. 
 Given that our knowledge estimates are computed as a weighted version of this same proportion-correct score (where each held-in question's weight reflects its embedding-space distance from the held-out question; see Eqn.~\ref{eqn:prop}), if these weights are uninformative (e.g., simply randomly distributed), then we should expect to see this same inverse relationship emerge, on average. 
-It is only if the spatial relationships among the quiz questions' embedding coordinates map onto participants' knowledge in a meaningful way that we would we expect this relationship to be non-negative [\textbf{PHRASING}].
+It is only if the spatial relationships among the quiz questions' embedding coordinates map onto participants' knowledge in a meaningful way that we would we expect this relationship to be non-negative.
 
 When we fit a GLMM to estimates of participants' knowledge for each Quiz~1 question based on all other Quiz~1 questions, we observed this null-hypothesized inverse relationship. 
 Specifically, higher estimated knowledge at the embedding coordinate of a held-out Quiz~1 question was associated with a lower likelihood of answering the question correctly (odds ratio $(OR) = 0.136$, likelihood-ratio test statistic $(\lambda_{LR}) = 19.749$, 95\%\ $\textnormal{CI} = [14.352,\ 26.545],\ p = 0.001$). 
 However, when we repeated this analysis for quizzes 2 and 3, the direction of this relationship reversed: higher estimated knowledge for a given question predicted a greater likelihood of answering it correctly (Quiz~2: $OR = 2.905,\ \lambda_{LR} = 17.333,\ 95\%\ \textnormal{CI} = [14.966,\ 29.309],\ p = 0.002$; Quiz~3: $OR = 3.238,\ \lambda_{LR} = 6.882,\ 95\%\ \textnormal{CI} = [6.228,\ 8.184],\ p = 0.017$). 
 Taken together, these results suggest that our knowledge estimations can reliably predict participants' likelihood of success on individual quiz questions, provided they have at least some amount of structured knowledge about the underlying concepts being tested. 
-In other words, when participants' correct responses primarily arise from knowledge about the content probed by each question (e.g., after watching one or both lectures), these successes can be predicted from their ability to answer other questions about conceptually similar content (as captured by embedding-space distance).
-However, when a sufficiently large portion of participants' correct responses (presumably) reflect successful random guessing (such as on a multiple-choice quiz taken before viewing either lecture), our approach fails to accurately predict these successes since they do not map onto embedding space distances in a meaningful way [\textbf{PHRASING}].
+In other words, when participants' correct responses arise primarily from knowledge about the content probed by each question (e.g., after watching one or both lectures), these successes can be predicted from their ability to answer other questions about conceptually similar content (as captured by embedding-space distance).
+However, when a sufficiently large portion of participants' correct responses (presumably) reflect successful random guessing (such as on a multiple-choice quiz taken before viewing either lecture), our approach fails to accurately predict these successes because they are not structured (with respect to spatial distance within the embedding space) in a meaningful way.
 
 We observed a similar pattern when we fit GLMMs to estimates of participants' knowledge for each question about one lecture derived from other questions about the \textit{same} lecture. 
 Specifically, for questions that participants answered on Quiz~1, prior to watching either lecture, knowledge for the embedding coordinates of \textit{Four Fundamental Forces}-related questions estimated using other \textit{Four Fundamental Forces}-related questions did not reliably predict whether those questions were answered correctly ($OR = 1.891,\ \lambda_{LR} = 2.293,\ 95\%\ \textnormal{CI} = [2.091,\ 2.622],\ p = 0.139$). 
@@ -606,16 +598,16 @@ \section*{Results}
 This might lead our approach to over-estimate knowledge for held-out questions about ``forgotten'' knowledge that participants answered incorrectly.
 % ALTERNATE EXPLANATION -- embedding space is essentially ``saturated'' with correctly answered questions, so just like how on quiz 1 when relatively few questions are correct, most questions ``around'' them will be incorrect, on quiz 3 when relatively few questions are incorrect, most questions nearby will be correct. And because of this, on average, when the ``held-out'' question is one of the few incorrect ones, there will tend to be more correct ones ``held in'' than there will be when the held-out question is correct.
 % Also, maybe worth noting: while negative relationship is significant, it's super weak -- per the model, a "1-unit" increase in estimated knowledge corresponds to only a 1.28% decrease in probability of correct answer (p = OR / (1 + OR)). For comparison, for quiz3/within-lecture/birth of stars, 1-unit increase in estimated knowledge corresponds to a 84.5% increase in probability. So decrease is sig. but basically negligible.
-Taken together, these results suggest that our approach can distinguish between questions about different content covered by a single lecture when participants have sufficiently structured knowledge about that lecture's content, though this specificity may decrease further in time from when the lecture in question was viewed.
+Taken together, these results suggest that our approach can distinguish between questions about different content covered by a single lecture when participants have sufficiently structured knowledge about its contents, though this specificity may decrease further in time from when the lecture in question was viewed.
 
-Finally, when we fit GLMMs to estimates of participants' knowledge for questions about one lecture based on questions (from the same quiz) about the other lecture, we observed a similar but slightly more nuanced pattern. 
-Essentially, while the previous set of analyses suggest that our approach's ability to make \textit{specific} predictions within content areas depends on participants having a minimum level of knowledge about the given content, the across-lecture analyses we performed suggest that our ability to \textit{generalize} these predictions across different content areas requires that participants' level of knowledge about the content used to make predictions be reasonably similar to their level of knowledge about the content for which these predictions are made [\textbf{PHRASING}].
-We found that using questions answered on Quiz~1, participants abilities to correctly answer questions about \textit{Four Fundamental Forces} could be predicted from their responses to questions about \textit{Birth of Stars} ($OR = 1.896,\ \lambda_{LR} = 7.205,\ 95\%\ \textnormal{CI} = [6.224, 7.524],\ p = 0.039$) and their ability to correctly answer \textit{Birth of Stars}-related questions could be predicted from their responses to \textit{Four Fundamental Forces}-related questions ($OR = 1.522,\ \lambda_{LR} =  6.448,\ 95\%\ \textnormal{CI} = [5.656, 6.843],\ p = 0.043$).
-We note, however, that these Quiz~1 knowledge estimates suffer from the same ``noise'' due to the (presumably) higher rate of participants successfully guessing correct answers on Quiz~1 as noted above, and as a result provide the weakest signal of any of the knowledge estimates that we found to reliably predict success.
+Finally, when we fit GLMMs to estimates of participants' knowledge for questions about one lecture using questions they answered (on the same quiz) about the \textit{other} lecture, we observed a similar but slightly more nuanced pattern. 
+Essentially, while the previous set of within-lecture analyses suggest that the \textit{specificity} of our predictions within a single content area depends on participants having a minimum level of knowledge about that content, these across-lecture analyses suggest that our ability to \textit{generalize} our predictions across different content areas requires that participants' level of knowledge about the content used to make predictions be reasonably similar to their level of knowledge about the content for which these predictions are made.
+Using questions answered on Quiz~1, we found that participants' abilities to correctly answer questions about \textit{Four Fundamental Forces} could be predicted from their responses to questions about \textit{Birth of Stars} ($OR = 1.896,\ \lambda_{LR} = 7.205,\ 95\%\ \textnormal{CI} = [6.224, 7.524],\ p = 0.039$) and similarly, that their ability to correctly answer \textit{Birth of Stars}-related questions could be predicted from their responses to \textit{Four Fundamental Forces}-related questions ($OR = 1.522,\ \lambda_{LR} =  6.448,\ 95\%\ \textnormal{CI} = [5.656, 6.843],\ p = 0.043$).
+We note, however, that these Quiz~1 knowledge estimates are subject to the same increased ``noise'' due to the (presumably) higher incidence of observed correct answers arising from successful random guessing (compared to the other two quizzes) as noted above, and as a result, provide the weakest signal of any of the knowledge estimates that we found reliably predicted success.
 When we repeated this analysis using questions from Quiz~2, we found participants' responses to \textit{Four Fundamental Forces}-related questions did not reliably predict their success on \textit{Birth of Stars}-related questions ($OR = 1.865,\ \lambda_{LR} = 3.205,\ 95\%\ \textnormal{CI} = [3.027, 3.600],\ p = 0.125$), nor did their responses to \textit{Birth of Stars}-related questions reliably predict their success on \textit{Four Fundamental Forces}-related questions ($OR = 3.490,\ \lambda_{LR} = 3.266,\ 95\%\ \textnormal{CI} = [3.033, 3.866],\ p = 0.094$). 
 \textbf{Sentence about why this makes sense given that participants hadn't viewed BoS yet. i.e., when predicting held-out FFF questions, correct vs. incorrect labels for held-in q's aren't meaningfully structured w.r.t. embedding space; when predicting held-out BoS q's, whether or not held-out q was correctly answered isn't meaningfully related to spatial structure of correctly answered q's in embedding space.}
-However, when we again computed these across-lecture knowledge predictions using questions from Quiz~3 (when participants had now viewed \textit{both} lectures, we found that we could again reliably predict success on questions about \textit{Four Fundamental Forces} ($OR = 11.294),\ \lambda_{LR} = 11.055,\ 95\%\ \textnormal{CI} = [9.126, 18.476],\ p = 0.004$) and \textit{Birth of Stars} ($OR = 7.302),\ \lambda_{LR} = 7.068,\ 95\%\ \textnormal{CI} = [6.490, 8.584],\ p = 0.017$).
-Across all three versions of these analyses, our results suggest that our knowledge estimations can reliably predict participants' abilities to answer individual quiz questions, distinguish between questions about similar content, and generalize across content areas, provided that participants' quiz responses reflect a minimum level of ``real'' knowledge about both content on which these predictions are based and that for which they are made [\textbf{PHRASING}].
+However, when we again computed these across-lecture knowledge predictions using questions from Quiz~3 (when participants had now viewed \textit{both} lectures, we found that we could again reliably predict success on questions about both \textit{Four Fundamental Forces} ($OR = 11.294),\ \lambda_{LR} = 11.055,\ 95\%\ \textnormal{CI} = [9.126, 18.476],\ p = 0.004$) and \textit{Birth of Stars} ($OR = 7.302,\ \lambda_{LR} = 7.068,\ 95\%\ \textnormal{CI} = [6.490, 8.584],\ p = 0.017$) using responses to questions questions about the other lecture's content.
+Across all three versions of these analyses, our results suggest that our knowledge estimations can reliably predict participants' abilities to answer individual quiz questions, distinguish between questions about similar content, and generalize across content areas, provided that participants' quiz responses reflect a minimum level of ``real'' knowledge about both content on which these predictions are based and that for which they are made.
 
 % our approach works when participants have a minimal baseline level of knowledge about content predicted and used to predict
 % our approach generalizes when knowledge of content used to predict can be assumed to be a reasonable indicator of knowledge of content predicted