Skip to content

Commit 3086bd0

Browse files
committed
reworking results section text related to fig 6
1 parent 45c7274 commit 3086bd0

File tree

6 files changed

+235
-86
lines changed

6 files changed

+235
-86
lines changed

paper/changes.pdf

1.03 KB
Binary file not shown.

paper/changes.tex

Lines changed: 20 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
\documentclass[10pt]{article}
22
%DIF LATEXDIFF DIFFERENCE FILE
33
%DIF DEL old.tex Mon Feb 19 07:49:49 2024
4-
%DIF ADD main.tex Mon Feb 19 07:48:14 2024
4+
%DIF ADD main.tex Mon Feb 19 12:49:27 2024
55
\usepackage[utf8]{inputenc}
66
\usepackage[english]{babel}
77
\usepackage[font=small,labelfont=bf]{caption}
@@ -698,30 +698,23 @@ \section*{Results}
698698
Second, we \DIFdelbegin \DIFdel{used questions
699699
about one lecture to predict knowledge at the embedding coordinate of a held-out
700700
question about the }\textit{\DIFdel{other}} %DIFAUXCMD
701-
\DIFdel{lecture , }\DIFdelend \DIFaddbegin \DIFadd{estimated knowledge for each question about one lecture using only questions (}\DIFaddend from the same \DIFdelbegin \DIFdel{quiz and participant }\DIFdelend \DIFaddbegin \DIFadd{participant and quiz) about the }\textit{\DIFadd{other}} \DIFadd{lecture }\DIFaddend (``Across-lecture''\DIFdelbegin \DIFdel{in }\DIFdelend \DIFaddbegin \DIFadd{; }\DIFaddend Fig.~\ref{fig:predictions}\DIFaddbegin \DIFadd{, middle rows}\DIFaddend ).
702-
This test was intended to \DIFdelbegin \DIFdel{test }\DIFdelend \DIFaddbegin \DIFadd{assess }\DIFaddend the \textit{generalizability} of our approach by asking whether our \DIFdelbegin \DIFdel{knowledge }\DIFdelend predictions held across the content areas of the two lectures.
701+
\DIFdel{lecture , }\DIFdelend \DIFaddbegin \DIFadd{estimated knowledge for each question about a given lecture using only the other questions (}\DIFaddend from the same \DIFdelbegin \DIFdel{quiz and participant (``Across-lecture''in }\DIFdelend \DIFaddbegin \DIFadd{participant and quiz) about that }\textit{\DIFadd{same}} \DIFadd{lecture (``Within-lecture''; }\DIFaddend Fig.~\ref{fig:predictions}\DIFaddbegin \DIFadd{, middle rows}\DIFaddend ).
702+
This test was intended to \DIFdelbegin \DIFdel{test the }\DIFdelend \DIFaddbegin \DIFadd{assess the }\DIFaddend \textit{\DIFdelbegin \DIFdel{generalizability}\DIFdelend \DIFaddbegin \DIFadd{specificity}\DIFaddend } of our approach by asking whether our \DIFdelbegin \DIFdel{knowledge predictions held across the content areas of the two lectures}\DIFdelend \DIFaddbegin \DIFadd{predictions could distinguish between questions about different content covered by the same lecture}\DIFaddend .
703703
Third, we \DIFdelbegin \DIFdel{used questions about one lecture to predict knowledge at the embedding
704704
coordinate of a held-out question about the }\textit{\DIFdel{same}} %DIFAUXCMD
705-
\DIFdel{lecture , }\DIFdelend \DIFaddbegin \DIFadd{estimated knowledge for each question about a given lecture using only the other questions (}\DIFaddend from the same \DIFdelbegin \DIFdel{quiz and participant }\DIFdelend \DIFaddbegin \DIFadd{participant and quiz) about that }\textit{\DIFadd{same}} \DIFadd{lecture }\DIFaddend (``Within-lecture''\DIFdelbegin \DIFdel{in }\DIFdelend \DIFaddbegin \DIFadd{; }\DIFaddend Fig.~\ref{fig:predictions}\DIFaddbegin \DIFadd{, bottom rows}\DIFaddend ).
706-
This test was intended to \DIFdelbegin \DIFdel{test }\DIFdelend \DIFaddbegin \DIFadd{assess }\DIFaddend the \textit{specificity} of our approach by asking whether our \DIFdelbegin \DIFdel{knowledge }\DIFdelend predictions could distinguish between questions about different content covered by the same lecture.
707-
\DIFdelbegin \DIFdel{We repeated each of these
705+
\DIFdel{lecture , }\DIFdelend \DIFaddbegin \DIFadd{estimated knowledge for each question about one lecture using only questions (}\DIFaddend from the same \DIFdelbegin \DIFdel{quiz and participant (``Within-lecture''in }\DIFdelend \DIFaddbegin \DIFadd{participant and quiz) about the }\textit{\DIFadd{other}} \DIFadd{lecture (``Across-lecture''; }\DIFaddend Fig.~\ref{fig:predictions}\DIFaddbegin \DIFadd{, bottom rows}\DIFaddend ).
706+
This test was intended to \DIFdelbegin \DIFdel{test the }\DIFdelend \DIFaddbegin \DIFadd{assess the }\DIFaddend \textit{\DIFdelbegin \DIFdel{specificity}\DIFdelend \DIFaddbegin \DIFadd{generalizability}\DIFaddend } of our approach by asking whether our \DIFdelbegin \DIFdel{knowledge predictions could distinguish between questions about
707+
different content covered by the same lecture.
708+
We repeated each of these
708709
analysesusing all possible held-out questions for each quiz and participant.
709-
}\DIFdelend
710+
}\DIFdelend \DIFaddbegin \DIFadd{predictions held across the content areas of the two lectures.
711+
}\DIFaddend
710712

711713
\DIFdelbegin \DIFdel{For the initial quizzes participantstook (prior to watching either lecture),
712714
predicted knowledge tended to be low overall, and relatively
713715
unstructured (Fig.
714716
~\ref{fig:predictions}, left column).
715-
When }\DIFdelend %DIF > When we estimated participants' knowledge for each Quiz~1 question based on all other Quiz~1 questions, we found an inverse relationship.
716-
%DIF > Specifically, higher estimated knowledge at the embedding coordinate at a held-out question was associated with a lower likelihood of answering the question correctly ($\textrm{odds ratio}\ (OR) = 0.136,\ \textrm{likelihood-ratio test statistic}\ (\lambda_{LR}) = 19.749,\ \textrm{95\% CI} = [14.352,\ 26.545],\ p = 0.001$).
717-
%DIF > However, this inverse relationship in fact represents the expected result under our null hypothesis (that estimated knowledge is \textit{not} predictive of success on a question).
718-
%DIF > An intuition for this can be taken from the expected outcome of same analysis based on the simple proportion correct, rather than estimated knowledge.
719-
%DIF > Suppose a participant answered $n$ out of 13 quiz questions correctly.
720-
%DIF > If we held out a single correctly answered question and computed the proportion of remaining questions answered correctly, that proportion would be $(n - 1) / 12$.
721-
%DIF > Whereas if we held out a single incorrectly answered question, the proportion of remaining questions answered correctly would be $n / 12$.
722-
\DIFaddbegin
723-
724-
\DIFadd{In performing this set of analyses, our null hypothesis is that the knowledge estimates we compute based on the quiz questions' embedding coordinates do }\textit{\DIFadd{not}} \DIFadd{provide useful information about participants' abilities to answer those questions.
717+
When }\DIFdelend \DIFaddbegin \DIFadd{In performing this set of analyses, our null hypothesis is that the knowledge estimates we compute based on the quiz questions' embedding coordinates do }\textit{\DIFadd{not}} \DIFadd{provide useful information about participants' abilities to answer those questions.
725718
What result might we expect to see if this is the case?
726719
To provide an intuition for this, consider the expected outcome if we carried out these same analyses using a simple proportion-correct measure in lieu of our knowledge estimates.
727720
Suppose a participant correctly answered $n$ out of 13 questions on a given quiz.
@@ -736,7 +729,7 @@ \section*{Results}
736729
Given that our knowledge estimates are computed as a weighted version of this same proportion-correct score (where each held-in question's weight reflects its embedding-space distance from the }\DIFaddend held-out question\DIFdelbegin \DIFdel{(``All questions''; $\U
737730
= 50587,~p = 0.723$), when we used questionsfrom one lecture to predict
738731
knowledge }\DIFdelend \DIFaddbegin \DIFadd{; see Eqn.~\ref{eqn:prop}), if these weights are uninformative (e.g., simply randomly distributed), then we should expect to see this same inverse relationship emerge, on average.
739-
It is only if the spatial relationships among the quiz questions' embedding coordinates map onto participants' knowledge in a meaningful way that we would we expect this relationship to be non-negative }[\textbf{\DIFadd{PHRASING}}]\DIFadd{.
732+
It is only if the spatial relationships among the quiz questions' embedding coordinates map onto participants' knowledge in a meaningful way that we would we expect this relationship to be non-negative.
740733
}
741734

742735
\DIFadd{When we fit a GLMM to estimates of participants' knowledge for each Quiz~1 question based on all other Quiz~1 questions, we observed this null-hypothesized inverse relationship.
@@ -751,8 +744,8 @@ \section*{Results}
751744
questions from one lecture to predict knowledge at the embedding coordinate of a held-out question }\DIFdelend \DIFaddbegin \DIFadd{likelihood-ratio test statistic $(\lambda_{LR}) = 19.749$, 95\%\ $\textnormal{CI} = [14.352,\ 26.545],\ p = 0.001$).
752745
However, when we repeated this analysis for quizzes 2 and 3, the direction of this relationship reversed: higher estimated knowledge for a given question predicted a greater likelihood of answering it correctly (Quiz~2: $OR = 2.905,\ \lambda_{LR} = 17.333,\ 95\%\ \textnormal{CI} = [14.966,\ 29.309],\ p = 0.002$; Quiz~3: $OR = 3.238,\ \lambda_{LR} = 6.882,\ 95\%\ \textnormal{CI} = [6.228,\ 8.184],\ p = 0.017$).
753746
Taken together, these results suggest that our knowledge estimations can reliably predict participants' likelihood of success on individual quiz questions, provided they have at least some amount of structured knowledge about the underlying concepts being tested.
754-
In other words, when participants' correct responses primarily arise from knowledge about the content probed by each question (e.g., after watching one or both lectures), these successes can be predicted from their ability to answer other questions about conceptually similar content (as captured by embedding-space distance).
755-
However, when a sufficiently large portion of participants' correct responses (presumably) reflect successful random guessing (such as on a multiple-choice quiz taken before viewing either lecture), our approach fails to accurately predict these successes since they do not map onto embedding space distances in a meaningful way }[\textbf{\DIFadd{PHRASING}}]\DIFadd{.
747+
In other words, when participants' correct responses arise primarily from knowledge about the content probed by each question (e.g., after watching one or both lectures), these successes can be predicted from their ability to answer other questions about conceptually similar content (as captured by embedding-space distance).
748+
However, when a sufficiently large portion of participants' correct responses (presumably) reflect successful random guessing (such as on a multiple-choice quiz taken before viewing either lecture), our approach fails to accurately predict these successes because they are not structured (with respect to spatial distance within the embedding space) in a meaningful way.
756749
}
757750

758751
\DIFadd{We observed a similar pattern when we fit GLMMs to estimates of participants' knowledge for each question about one lecture derived from other questions }\DIFaddend about the \textit{same} \DIFdelbegin \DIFdel{lecture (``Within-lecture'';
@@ -823,18 +816,18 @@ \section*{Results}
823816
%DIF > ALTERNATE EXPLANATION -- embedding space is essentially ``saturated'' with correctly answered questions, so just like how on quiz 1 when relatively few questions are correct, most questions ``around'' them will be incorrect, on quiz 3 when relatively few questions are incorrect, most questions nearby will be correct. And because of this, on average, when the ``held-out'' question is one of the few incorrect ones, there will tend to be more correct ones ``held in'' than there will be when the held-out question is correct.
824817
%DIF > Also, maybe worth noting: while negative relationship is significant, it's super weak -- per the model, a "1-unit" increase in estimated knowledge corresponds to only a 1.28% decrease in probability of correct answer (p = OR / (1 + OR)). For comparison, for quiz3/within-lecture/birth of stars, 1-unit increase in estimated knowledge corresponds to a 84.5% increase in probability. So decrease is sig. but basically negligible.
825818
Taken together, \DIFdelbegin \DIFdel{the results in Figure~\ref{fig:predictions} indicate }\DIFdelend \DIFaddbegin \DIFadd{these results suggest }\DIFaddend that our approach can \DIFdelbegin \DIFdel{reliably predict acquired knowledge (especially about recently
826-
learned content), and that the knowledge predictions are generalizable across the content areas spanned by the two lectures, while also specific enough to }\DIFdelend \DIFaddbegin \DIFadd{distinguish between questions about different content covered by a single lecture when participants have sufficiently structured knowledge about that lecture's content, though this specificity may decrease further in time from when the lecture in question was viewed.
819+
learned content ), and that the knowledge predictions are generalizable across the content areas spanned by the two lectures, while also specific enough to }\DIFdelend \DIFaddbegin \DIFadd{distinguish between questions about different content covered by a single lecture when participants have sufficiently structured knowledge about its contents, though this specificity may decrease further in time from when the lecture in question was viewed.
827820
}
828821

829-
\DIFadd{Finally, when we fit GLMMs to estimates of participants' knowledge for questions about one lecture based on questions (from the same quiz) about the other lecture, we observed a similar but slightly more nuanced pattern.
830-
Essentially, while the previous set of analyses suggest that our approach's ability to make }\textit{\DIFadd{specific}} \DIFadd{predictions within content areas depends on participants having a minimum level of knowledge about the given content, the across-lecture analyses we performed suggest that our ability to }\textit{\DIFadd{generalize}} \DIFadd{these predictions across different content areas requires that participants' level of knowledge about the content used to make predictions be reasonably similar to their level of knowledge about the content for which these predictions are made }[\textbf{\DIFadd{PHRASING}}]\DIFadd{.
831-
We found that using questions answered on Quiz~1, participants abilities to correctly answer questions about }\textit{\DIFadd{Four Fundamental Forces}} \DIFadd{could be predicted from their responses to questions about }\textit{\DIFadd{Birth of Stars}} \DIFadd{($OR = 1.896,\ \lambda_{LR} = 7.205,\ 95\%\ \textnormal{CI} = [6.224, 7.524],\ p = 0.039$) and their ability to correctly answer }\textit{\DIFadd{Birth of Stars}}\DIFadd{-related questions could be predicted from their responses to }\textit{\DIFadd{Four Fundamental Forces}}\DIFadd{-related questions ($OR = 1.522,\ \lambda_{LR} = 6.448,\ 95\%\ \textnormal{CI} = [5.656, 6.843],\ p = 0.043$).
832-
We note, however, that these Quiz~1 knowledge estimates suffer from the same ``noise'' due to the (presumably) higher rate of participants successfully guessing correct answers on Quiz~1 as noted above, and as a result provide the weakest signal of any of the knowledge estimates that we found to reliably predict success.
822+
\DIFadd{Finally, when we fit GLMMs to estimates of participants' knowledge for questions about one lecture using questions they answered (on the same quiz) about the }\textit{\DIFadd{other}} \DIFadd{lecture, we observed a similar but slightly more nuanced pattern.
823+
Essentially, while the previous set of within-lecture analyses suggest that the }\textit{\DIFadd{specificity}} \DIFadd{of our predictions within a single content area depends on participants having a minimum level of knowledge about that content, these across-lecture analyses suggest that our ability to }\textit{\DIFadd{generalize}} \DIFadd{our predictions across different content areas requires that participants' level of knowledge about the content used to make predictions be reasonably similar to their level of knowledge about the content for which these predictions are made.
824+
Using questions answered on Quiz~1, we found that participants' abilities to correctly answer questions about }\textit{\DIFadd{Four Fundamental Forces}} \DIFadd{could be predicted from their responses to questions about }\textit{\DIFadd{Birth of Stars}} \DIFadd{($OR = 1.896,\ \lambda_{LR} = 7.205,\ 95\%\ \textnormal{CI} = [6.224, 7.524],\ p = 0.039$) and similarly, that their ability to correctly answer }\textit{\DIFadd{Birth of Stars}}\DIFadd{-related questions could be predicted from their responses to }\textit{\DIFadd{Four Fundamental Forces}}\DIFadd{-related questions ($OR = 1.522,\ \lambda_{LR} = 6.448,\ 95\%\ \textnormal{CI} = [5.656, 6.843],\ p = 0.043$).
825+
We note, however, that these Quiz~1 knowledge estimates are subject to the same increased ``noise'' due to the (presumably) higher incidence of observed correct answers arising from successful random guessing (compared to the other two quizzes) as noted above, and as a result, provide the weakest signal of any of the knowledge estimates that we found reliably predicted success.
833826
When we repeated this analysis using questions from Quiz~2, we found participants' responses to }\textit{\DIFadd{Four Fundamental Forces}}\DIFadd{-related questions did not reliably predict their success on }\textit{\DIFadd{Birth of Stars}}\DIFadd{-related questions ($OR = 1.865,\ \lambda_{LR} = 3.205,\ 95\%\ \textnormal{CI} = [3.027, 3.600],\ p = 0.125$), nor did their responses to }\textit{\DIFadd{Birth of Stars}}\DIFadd{-related questions reliably predict their success on }\textit{\DIFadd{Four Fundamental Forces}}\DIFadd{-related questions ($OR = 3.490,\ \lambda_{LR} = 3.266,\ 95\%\ \textnormal{CI} = [3.033, 3.866],\ p = 0.094$).
834827
}\textbf{\DIFadd{Sentence about why this makes sense given that participants hadn't viewed BoS yet. i.e., when predicting held-out FFF questions, correct vs. incorrect labels for held-in q's aren't meaningfully structured w.r.t. embedding space; when predicting held-out BoS q's, whether or not held-out q was correctly answered isn't meaningfully related to spatial structure of correctly answered q's in embedding space.}}
835-
\DIFadd{However, when we again computed these across-lecture knowledge predictions using questions from Quiz~3 (when participants had now viewed }\textit{\DIFadd{both}} \DIFadd{lectures, we found that we could again reliably predict success on questions about }\textit{\DIFadd{Four Fundamental Forces}} \DIFadd{($OR = 11.294),\ \lambda_{LR} = 11.055,\ 95\%\ \textnormal{CI} = [9.126, 18.476],\ p = 0.004$) and }\textit{\DIFadd{Birth of Stars}} \DIFadd{($OR = 7.302),\ \lambda_{LR} = 7.068,\ 95\%\ \textnormal{CI} = [6.490, 8.584],\ p = 0.017$).
828+
\DIFadd{However, when we again computed these across-lecture knowledge predictions using questions from Quiz~3 (when participants had now viewed }\textit{\DIFadd{both}} \DIFadd{lectures, we found that we could again reliably predict success on questions about both }\textit{\DIFadd{Four Fundamental Forces}} \DIFadd{($OR = 11.294),\ \lambda_{LR} = 11.055,\ 95\%\ \textnormal{CI} = [9.126, 18.476],\ p = 0.004$) and }\textit{\DIFadd{Birth of Stars}} \DIFadd{($OR = 7.302,\ \lambda_{LR} = 7.068,\ 95\%\ \textnormal{CI} = [6.490, 8.584],\ p = 0.017$) using responses to questions questions about the other lecture's content.
836829
Across all three versions of these analyses, our results suggest that our knowledge estimations can reliably predict participants' abilities to answer individual quiz questions, }\DIFaddend distinguish between questions about \DIFdelbegin \DIFdel{more subtly different contentwithin the
837-
same lecture}\DIFdelend \DIFaddbegin \DIFadd{similar content, and generalize across content areas, provided that participants' quiz responses reflect a minimum level of ``real'' knowledge about both content on which these predictions are based and that for which they are made }[\textbf{\DIFadd{PHRASING}}]\DIFaddend .
830+
same lecture}\DIFdelend \DIFaddbegin \DIFadd{similar content, and generalize across content areas, provided that participants' quiz responses reflect a minimum level of ``real'' knowledge about both content on which these predictions are based and that for which they are made}\DIFaddend .
838831

839832
%DIF > our approach works when participants have a minimal baseline level of knowledge about content predicted and used to predict
840833
%DIF > our approach generalizes when knowledge of content used to predict can be assumed to be a reasonable indicator of knowledge of content predicted

paper/compile.sh

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,5 +14,13 @@ latex -interaction=nonstopmode supplement
1414
latex -interaction=nonstopmode supplement
1515
pdflatex -interaction=nonstopmode supplement
1616

17+
latexdiff old.tex main.tex > changes.tex
18+
latex -interaction=nonstopmode changes
19+
bibtex changes
20+
latex -interaction=nonstopmode changes
21+
latex -interaction=nonstopmode changes
22+
pdflatex -interaction=nonstopmode changes
23+
24+
1725
rm *.cb* *.dvi *.log *.blg *.aux *.fff *.out
1826

paper/main.pdf

1.05 KB
Binary file not shown.

0 commit comments

Comments
 (0)