11\documentclass [10pt ]{article }
22\usepackage {amsmath }
3- \usepackage [utf8 ]{inputenc }
43\usepackage [english ]{babel }
54\usepackage [font=small,labelfont=bf ]{caption }
65\usepackage {geometry }
109\usepackage {setspace }
1110\usepackage {hyperref }
1211\usepackage {lineno }
13-
1412\usepackage {xcolor }
1513
1614\setcitestyle {notesep={; }}
1715
18- % supplemental tables
16+ % supplementary tables
1917\newcommand {\questions }{1}
2018\newcommand {\topics }{2}
2119\newcommand {\matchTab }{3}
22-
23- % supplemental figures
20+ % supplementary figures
2421\newcommand {\topicWordWeights }{1}
2522\newcommand {\topicWeights }{2}
2623\newcommand {\forcesCorrs }{3}
2724\newcommand {\bosCorrs }{4}
28- \newcommand {\jaccard }{5}
29- \newcommand {\ldaVsBERT }{6}
30- \newcommand {\individualKnowledgeMapsA }{7}
31- \newcommand {\individualKnowledgeMapsB }{8}
32- \newcommand {\individualKnowledgeMapsC }{9}
33- \newcommand {\individualLearningMapsA }{10}
34- \newcommand {\individualLearningMapsB }{11}
35-
36- \newcommand {\U }{{\fontfamily {serif}\selectfont\ensuremath {\mathrm {U}}}}
37-
25+ \newcommand {\individualKnowledgeMapsA }{5}
26+ \newcommand {\individualKnowledgeMapsB }{6}
27+ \newcommand {\individualKnowledgeMapsC }{7}
28+ \newcommand {\individualLearningMapsA }{8}
29+ \newcommand {\individualLearningMapsB }{9}
30+ \newcommand {\jaccard }{10}
31+ \newcommand {\ldaVsBERT }{11}
32+ % supplementary results
33+ \newcommand {\suppResults }{\textit {Supplementary results }}
3834
3935% simple command for inline comments
4036\newcommand {\comment }[1]{}
37+
4138% italicize section names in \nameref and place in \mbox to prevent bug when
4239% text is split across lines or pages
4340\NewCommandCopy {\oldnameref }{\nameref }
@@ -939,15 +936,15 @@ \section*{Discussion}
939936conceptual knowledge. First, from a methodological standpoint, our modeling
940937framework provides a systematic means of mapping out and characterizing
941938knowledge in maps that have infinite (arbitrarily many) numbers of coordinates,
942- and of `` filling out'' those maps using relatively small numbers of multiple
943- choice quiz questions. Our experimental finding that we can use these maps to
944- predict responses to held-out questions has several psychological implications
945- as well. For example, concepts that are assigned to nearby coordinates by the
946- text embedding model also appear to be `` known to a similar extent '' (as
947- reflected by participants' responses to held-out questions;
939+ and of `` filling out'' those maps using relatively small numbers of
940+ multiple- choice quiz questions. Our experimental finding that we can use these
941+ maps to predict success on held-out questions has several psychological
942+ implications as well. For example, concepts that are assigned to nearby
943+ coordinates by the text embedding model also appear to be `` known to a similar
944+ extent '' (as reflected by participants' responses to held-out questions;
948945Fig.~\ref {fig:predictions }). This suggests that participants also
949946\textit {conceptualize } similarly the content reflected by nearby embedding
950- coordinates. How participants' knowledge falls off with spatial distance is
947+ coordinates. How participants' knowledge `` falls off'' with spatial distance is
951948captured by the knowledge maps we infer from their quiz responses
952949(e.g., Figs.~\ref {fig:smoothness },~\ref {fig:knowledge-maps }). In other words,
953950our study shows that knowledge about a given concept implies knowledge about
@@ -957,79 +954,76 @@ \section*{Discussion}
957954In our study, we characterize the `` coordinates'' of participants' knowledge
958955using a relatively simple `` bag of words'' text embedding model~\citep [LDA;
959956][]{BleiEtal03}. More sophisticated text embedding models, such as
960- transformer-based models~\citep {ViswEtal17 , DevlEtal18 , ChatGPT , TouvEtal23 }
961- can learn complex grammatical and semantic relationships between words,
962- higher-order syntactic structures, stylistic features, and more. We considered
963- using transformer-based models in our study, but we found that the text
964- embeddings derived from these models were surprisingly uninformative with
965- respect to differentiating or otherwise characterizing the conceptual content
966- of the lectures and questions we used. We suspect that this reflects a broader
967- challenge in constructing models that are high-resolution within a given domain
968- (e.g., the domain of physics lectures and questions) \textit {and } sufficiently
969- broad so as to enable them to cover a wide range of domains. For example, we
970- found that the embeddings derived even from much larger and more modern models
971- like BERT~\citep {DevlEtal18 }, GPT~\citep {ViswEtal17 }, LLaMa~\citep {TouvEtal23 },
972- and others that are trained on enormous text corpora, end up yielding poor
973- resolution within the content space spanned by individual course videos
974- (Supp.~Fig.~\ldaVsBERT ). Whereas the LDA embeddings of the lectures and
975- questions are `` near'' each other (i.e., the convex hull enclosing the two
976- lectures' trajectories is highly overlapping with the convex hull enclosing the
977- questions' embeddings), the BERT embeddings of the lectures and questions are
978- instead largely distinct (top row of Supp.~Fig.~\ldaVsBERT ). The LDA embeddings
979- of the questions for each lecture and the corresponding lecture's trajectory
980- are also similar. For example, as shown in Fig.~\ref {fig:sliding-windows }C, the
981- LDA embeddings for \textit {Four Fundamental Forces } questions (blue dots)
982- appear closer to the \textit {Four Fundamental Forces } lecture trajectory (blue
983- line), whereas the LDA embeddings for \textit {Birth of Stars } questions (green
984- dots) appear closer to the \textit {Birth of Stars } lecture trajectory (green
985- line). The BERT embeddings of the lectures and questions do not show this
986- property (Supp.~Fig.~\ldaVsBERT ). We also examined per-question `` content
987- matches'' between individual questions and individual moments of each lecture
988- (Fig.~\ref {fig:question-correlations }, Supp.~Fig.~\ldaVsBERT ). The time series
989- plot of individual questions' correlations are different from each other when
990- computed using LDA (e.g., the traces can be clearly visually separated), whereas
991- the correlations computed from BERT embeddings of different questions all look
992- very similar. This tells us that LDA is capturing some differences in content
993- between the questions, whereas BERT is not. The time series plots of individual
994- questions' correlations have clear `` peaks'' when computed using LDA, but not
995- when computed using BERT. This tells us that LDA is capturing a `` match''
996- between the content of each question and a relatively well-defined time window
997- of the corresponding lectures. The BERT embeddings appear to blur together the
998- content of the questions versus specific moments of each lecture. Finally, we
999- also compared the pairwise correlations between embeddings of questions within
1000- versus across content areas (i.e., content covered by the individual lectures,
1001- lecture-specific questions, and by the `` general physics knowledge''
1002- questions). The LDA embeddings show a strong contrast between same-content
1003- embeddings versus across-content embeddings. In other words, the embeddings of
1004- questions about the \textit {Four Fundamental Forces } material are highly
1005- correlated with the embeddings of the \textit {Four Fundamental Forces } lecture,
1006- but not with the embeddings of \textit {Birth of Stars }, questions about
1007- \textit {Birth of Stars }, or general physics knowledge questions. We see a
1008- similar pattern with the LDA embeddings of the \textit {Birth of Stars }
1009- questions (Fig.~\ref {fig:topics }, Supp.~Fig.~\topicWeights ). In contrast, the
1010- BERT embeddings are all highly correlated with each other (Supp.
1011- Fig.~\ldaVsBERT ). Taken together, these comparisons illustrate how LDA (trained
1012- on the specific content in question) provides both coverage of the requisite
1013- material and specificity at the level of the content covered by individual
1014- questions. BERT, on the other hand, essentially assigns both lectures and all
1015- of the questions (which are all broadly about `` physics'' ) into a tiny region
1016- of its embedding space, thereby blurring out meaningful distinctions between
1017- different specific concepts covered by the lectures and questions. We note that
1018- these are not criticisms of BERT (or other large language models trained on
1019- large and diverse corpora). Rather, our point is that simple fine-tuned models
1020- trained on a relatively small but specialized corpus can outperform much more
1021- complicated models trained on much larger corpora, when we are specifically
1022- interested in capturing subtle conceptual differences at the level of a single
1023- course lecture or question. Of course if our goal had been to find a model that
1024- generalized to many different content areas, we would expect our approach to
1025- perform comparatively poorly relative to BERT or other much larger models. We
1026- suggest that bridging the tradeoff between high resolution within each content
1027- area versus the ability to generalize to many different content areas will be
1028- an important challenge for future work in this domain.
957+ transformer-based models~\citep {ViswEtal17 , DevlEtal18 , ChatGPT , TouvEtal23 },
958+ can leverage additional textual information such as complex grammatical and
959+ semantic relationships between words, higher-order syntactic structures,
960+ stylistic features, and more. We considered using transformer-based models in
961+ our study, but we found that the text embeddings derived from these models were
962+ surprisingly uninformative with respect to differentiating or otherwise
963+ characterizing the conceptual content of the lectures and questions we used
964+ (see \suppResults ). We suspect that this reflects a broader challenge in
965+ constructing models that are both high-resolution within a given domain (e.g.,
966+ the domain of physics lectures and questions) \textit {and } sufficiently broad
967+ as to enable them to cover a wide range of domains. Essentially, these
968+ `` larger'' language models learn these more complex features of language through
969+ training on enormous and diverse text corpora. But as a result, their
970+ embedding spaces also `` span'' an enormous and diverse range of conceptual
971+ content, sacrificing a degree of specificity in their capacities to distinguish
972+ subtle conceptual differences within a more narrow range of content.
973+ In comparing our LDA model (trained specifically on the lectures used in our
974+ study) to a larger transformer-based model (BERT), we found that our LDA model provides
975+ both coverage of the requisite material and specificity at the level of
976+ individual questions, while BERT essentially relegates the contents of both
977+ lectures and all quiz questions (which are all broadly about `` physics'' ) to a
978+ tiny region of its embedding space, thereby blurring out meaningful distinctions
979+ between different specific concepts covered by the lectures and questions
980+ (Supp.~Fig.~\ldaVsBERT ). We note that these are not criticisms of BERT, nor of
981+ other large language models trained on large and diverse corpora. Rather, our
982+ point is that simpler models trained on relatively small but specialized
983+ corpora can outperform much more complex models trained on much larger corpora
984+ when we are specifically interested in capturing subtle conceptual differences
985+ at the level of a single course lecture or quiz question. On the other hand, if
986+ our goal had been to choose a model that generalized to many different content
987+ areas, we would expect our LDA model to perform comparatively poorly to BERT or
988+ other much larger general-purpose models. We suggest that bridging this tradeoff
989+ between high resolution within a single content area and the ability to
990+ generalize to many diverse content areas will be an important challenge for
991+ future work.
992+
993+ At the opposite end of the spectrum from large language models, one could also
994+ imagine using an even \textit {simpler } `` model'' than LDA that relates the
995+ contents of course lectures and quiz questions through explicit word-overlap
996+ metrics (rather than similarities in the latent topics they exhibit). In a
997+ supplementary analysis (Supp.~Fig.~\jaccard ), we compared the LDA-based
998+ question-lecture matches shown in Figure~\ref {fig:question-correlations } with
999+ analogous matches based on the Jaccard similarity between each question's text
1000+ and each sliding window from the corresponding lecture's transcript. Similarly
1001+ to the embeddings derived from BERT, we found that this approach also blurred
1002+ meaningful distinctions between concepts presented in different parts of each
1003+ lecture and tested by different quiz questions. But rather than characterizing
1004+ their contents at too \textit {broad } a semantic scale, the lack of specificity
1005+ in this approach arises from considering too \textit {narrow } a semantic scale:
1006+ the sorts of concepts typically conveyed in course lectures and tested by quiz
1007+ questions are not defined (and meaningful similarities and distinctions between
1008+ them do not tend to emerge) at the level of individual words.
1009+
1010+ In other words, while the embedding spaces of more complex large language models
1011+ afford low resolution at the scale of individual course lectures and questions
1012+ because they `` zoom out'' too far, simpler word-matching measures yield low
1013+ resolution because they `` zoom \textit {in }'' too far. In this way, we view our
1014+ approach as occupying a sort of `` sweet spot'' between simpler and more complex
1015+ alternatives, in that it enables us to characterize the contents of course
1016+ materials at the appropriate semantic scale where relevant concepts `` come into
1017+ focus.'' Our approach enables us to accurately and consistently identify each
1018+ question's content in a way that matches it with specific content from the
1019+ lectures and distinguishes it from other questions about similar content. In
1020+ turn, this enables us to construct accurate predictions about participants'
1021+ knowledge of the conceptual content tested by individual quiz questions
1022+ (Fig.~\ref {fig:predictions }).
10291023
10301024Another application for large language models that does \textit {not } require
10311025explicitly modeling the content of individual lectures or questions is to
1032- leverage the models' abilities to generate text. For example, generative text
1026+ leverage these models' abilities to generate text. For example, generative text
10331027models like ChatGPT~\citep {ChatGPT } and LLaMa~\citep {TouvEtal23 } are already
10341028being used to build a new generation of interactive tutoring
10351029systems~\citep [e.g.,][]{MannEtal23b }. Unlike the approach we have taken here,
@@ -1043,39 +1037,6 @@ \section*{Discussion}
10431037needs in real time, and that are able to provide more nuanced feedback about
10441038what learners know and what they do not know.
10451039
1046- At the opposite end of the spectrum from large language models, one could also
1047- imagine \textit {simplifying } some aspects of our LDA-based approach by
1048- computing simple word overlap metrics. For example, the Jaccard similarity
1049- between text $ A$ and $ B$ is computed as the number of unique words in the
1050- intersection of words from $ A$ and $ B$ divided by the number of unique words in
1051- the union of words from $ A$ and $ B$ . In a supplementary analysis
1052- (Supp.~Fig.~\jaccard ), we compared the LDA-based question-lecture matches we
1053- reported in Figure~\ref {fig:question-correlations } with the Jaccard similarities
1054- between each question and each sliding window of text from the corresponding
1055- lecture. As shown in Supplementary Figure~\jaccard , this simple word-matching
1056- approach does not appear to capture the same level of specificity as the
1057- LDA-based approach. Whereas the LDA-based approach often yields a clear peak in
1058- the time series of correlations between each question and the corresponding
1059- lecture, the Jaccard similarity-based approach does not. Furthermore, these
1060- LDA-based matches appear to capture conceptual overlaps between the questions
1061- and lectures (Supp.~Tab.~\matchTab ), whereas simple word matching does not. For
1062- example, one of the example questions examined in Supplementary
1063- Figure~\jaccard ~asks `` Which of the following occurs as a cloud of atoms gets
1064- more dense?'' The LDA-based matches identify lecture timepoints where the
1065- relevant \textit {topics } are discussed (e.g., when words like `` cloud,''
1066- `` atom,'' `` dense,'' etc., are mentioned \textit {together }). The Jaccard
1067- similarity-based matches, on the other hand, are strong when \textit {any } of
1068- these words are mentioned, even if they do not occur together.
1069-
1070- We view our approach as occupying a sort of `` sweet spot,'' between much larger
1071- language models and simple word matching-based approaches, that enables us to
1072- capture the relevant conceptual content of course materials at an appropriate
1073- semantic scale. Our approach enables us to accurately and consistently identify
1074- each question's content in a way that also matches up with what is presented in
1075- the lectures. In turn, this enables us to construct accurate predictions about
1076- participants' knowledge of the conceptual content tested by held-out questions
1077- (Fig.~\ref {fig:predictions }).
1078-
10791040One limitation of our approach is that topic models contain no explicit
10801041internal representations of more complex aspects of `` knowledge,'' like
10811042knowledge graphs, dependencies or associations between concepts, causality, and
0 commit comments