Skip to content

Commit 4feb03e

Browse files
Merge pull request #108 from paxtonfitzpatrick/revision-3
updates to discussion, supp figs and tables, add supp results section
2 parents c7bf630 + 48ec6de commit 4feb03e

File tree

6 files changed

+497
-314
lines changed

6 files changed

+497
-314
lines changed

paper/figs/model-comparison.pdf

-113 KB
Binary file not shown.
3.29 KB
Binary file not shown.

paper/main.pdf

-2.48 KB
Binary file not shown.

paper/main.tex

Lines changed: 86 additions & 125 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
\documentclass[10pt]{article}
22
\usepackage{amsmath}
3-
\usepackage[utf8]{inputenc}
43
\usepackage[english]{babel}
54
\usepackage[font=small,labelfont=bf]{caption}
65
\usepackage{geometry}
@@ -10,34 +9,32 @@
109
\usepackage{setspace}
1110
\usepackage{hyperref}
1211
\usepackage{lineno}
13-
1412
\usepackage{xcolor}
1513

1614
\setcitestyle{notesep={; }}
1715

18-
% supplemental tables
16+
% supplementary tables
1917
\newcommand{\questions}{1}
2018
\newcommand{\topics}{2}
2119
\newcommand{\matchTab}{3}
22-
23-
% supplemental figures
20+
% supplementary figures
2421
\newcommand{\topicWordWeights}{1}
2522
\newcommand{\topicWeights}{2}
2623
\newcommand{\forcesCorrs}{3}
2724
\newcommand{\bosCorrs}{4}
28-
\newcommand{\jaccard}{5}
29-
\newcommand{\ldaVsBERT}{6}
30-
\newcommand{\individualKnowledgeMapsA}{7}
31-
\newcommand{\individualKnowledgeMapsB}{8}
32-
\newcommand{\individualKnowledgeMapsC}{9}
33-
\newcommand{\individualLearningMapsA}{10}
34-
\newcommand{\individualLearningMapsB}{11}
35-
36-
\newcommand{\U}{{\fontfamily{serif}\selectfont\ensuremath{\mathrm{U}}}}
37-
25+
\newcommand{\individualKnowledgeMapsA}{5}
26+
\newcommand{\individualKnowledgeMapsB}{6}
27+
\newcommand{\individualKnowledgeMapsC}{7}
28+
\newcommand{\individualLearningMapsA}{8}
29+
\newcommand{\individualLearningMapsB}{9}
30+
\newcommand{\jaccard}{10}
31+
\newcommand{\ldaVsBERT}{11}
32+
% supplementary results
33+
\newcommand{\suppResults}{\textit{Supplementary results}}
3834

3935
% simple command for inline comments
4036
\newcommand{\comment}[1]{}
37+
4138
% italicize section names in \nameref and place in \mbox to prevent bug when
4239
% text is split across lines or pages
4340
\NewCommandCopy{\oldnameref}{\nameref}
@@ -939,15 +936,15 @@ \section*{Discussion}
939936
conceptual knowledge. First, from a methodological standpoint, our modeling
940937
framework provides a systematic means of mapping out and characterizing
941938
knowledge in maps that have infinite (arbitrarily many) numbers of coordinates,
942-
and of ``filling out'' those maps using relatively small numbers of multiple
943-
choice quiz questions. Our experimental finding that we can use these maps to
944-
predict responses to held-out questions has several psychological implications
945-
as well. For example, concepts that are assigned to nearby coordinates by the
946-
text embedding model also appear to be ``known to a similar extent'' (as
947-
reflected by participants' responses to held-out questions;
939+
and of ``filling out'' those maps using relatively small numbers of
940+
multiple-choice quiz questions. Our experimental finding that we can use these
941+
maps to predict success on held-out questions has several psychological
942+
implications as well. For example, concepts that are assigned to nearby
943+
coordinates by the text embedding model also appear to be ``known to a similar
944+
extent'' (as reflected by participants' responses to held-out questions;
948945
Fig.~\ref{fig:predictions}). This suggests that participants also
949946
\textit{conceptualize} similarly the content reflected by nearby embedding
950-
coordinates. How participants' knowledge falls off with spatial distance is
947+
coordinates. How participants' knowledge ``falls off'' with spatial distance is
951948
captured by the knowledge maps we infer from their quiz responses
952949
(e.g., Figs.~\ref{fig:smoothness},~\ref{fig:knowledge-maps}). In other words,
953950
our study shows that knowledge about a given concept implies knowledge about
@@ -957,79 +954,76 @@ \section*{Discussion}
957954
In our study, we characterize the ``coordinates'' of participants' knowledge
958955
using a relatively simple ``bag of words'' text embedding model~\citep[LDA;
959956
][]{BleiEtal03}. More sophisticated text embedding models, such as
960-
transformer-based models~\citep{ViswEtal17, DevlEtal18, ChatGPT, TouvEtal23}
961-
can learn complex grammatical and semantic relationships between words,
962-
higher-order syntactic structures, stylistic features, and more. We considered
963-
using transformer-based models in our study, but we found that the text
964-
embeddings derived from these models were surprisingly uninformative with
965-
respect to differentiating or otherwise characterizing the conceptual content
966-
of the lectures and questions we used. We suspect that this reflects a broader
967-
challenge in constructing models that are high-resolution within a given domain
968-
(e.g., the domain of physics lectures and questions) \textit{and} sufficiently
969-
broad so as to enable them to cover a wide range of domains. For example, we
970-
found that the embeddings derived even from much larger and more modern models
971-
like BERT~\citep{DevlEtal18}, GPT~\citep{ViswEtal17}, LLaMa~\citep{TouvEtal23},
972-
and others that are trained on enormous text corpora, end up yielding poor
973-
resolution within the content space spanned by individual course videos
974-
(Supp.~Fig.~\ldaVsBERT). Whereas the LDA embeddings of the lectures and
975-
questions are ``near'' each other (i.e., the convex hull enclosing the two
976-
lectures' trajectories is highly overlapping with the convex hull enclosing the
977-
questions' embeddings), the BERT embeddings of the lectures and questions are
978-
instead largely distinct (top row of Supp.~Fig.~\ldaVsBERT). The LDA embeddings
979-
of the questions for each lecture and the corresponding lecture's trajectory
980-
are also similar. For example, as shown in Fig.~\ref{fig:sliding-windows}C, the
981-
LDA embeddings for \textit{Four Fundamental Forces} questions (blue dots)
982-
appear closer to the \textit{Four Fundamental Forces} lecture trajectory (blue
983-
line), whereas the LDA embeddings for \textit{Birth of Stars} questions (green
984-
dots) appear closer to the \textit{Birth of Stars} lecture trajectory (green
985-
line). The BERT embeddings of the lectures and questions do not show this
986-
property (Supp.~Fig.~\ldaVsBERT). We also examined per-question ``content
987-
matches'' between individual questions and individual moments of each lecture
988-
(Fig.~\ref{fig:question-correlations}, Supp.~Fig.~\ldaVsBERT). The time series
989-
plot of individual questions' correlations are different from each other when
990-
computed using LDA (e.g., the traces can be clearly visually separated), whereas
991-
the correlations computed from BERT embeddings of different questions all look
992-
very similar. This tells us that LDA is capturing some differences in content
993-
between the questions, whereas BERT is not. The time series plots of individual
994-
questions' correlations have clear ``peaks'' when computed using LDA, but not
995-
when computed using BERT. This tells us that LDA is capturing a ``match''
996-
between the content of each question and a relatively well-defined time window
997-
of the corresponding lectures. The BERT embeddings appear to blur together the
998-
content of the questions versus specific moments of each lecture. Finally, we
999-
also compared the pairwise correlations between embeddings of questions within
1000-
versus across content areas (i.e., content covered by the individual lectures,
1001-
lecture-specific questions, and by the ``general physics knowledge''
1002-
questions). The LDA embeddings show a strong contrast between same-content
1003-
embeddings versus across-content embeddings. In other words, the embeddings of
1004-
questions about the \textit{Four Fundamental Forces} material are highly
1005-
correlated with the embeddings of the \textit{Four Fundamental Forces} lecture,
1006-
but not with the embeddings of \textit{Birth of Stars}, questions about
1007-
\textit{Birth of Stars}, or general physics knowledge questions. We see a
1008-
similar pattern with the LDA embeddings of the \textit{Birth of Stars}
1009-
questions (Fig.~\ref{fig:topics}, Supp.~Fig.~\topicWeights). In contrast, the
1010-
BERT embeddings are all highly correlated with each other (Supp.
1011-
Fig.~\ldaVsBERT). Taken together, these comparisons illustrate how LDA (trained
1012-
on the specific content in question) provides both coverage of the requisite
1013-
material and specificity at the level of the content covered by individual
1014-
questions. BERT, on the other hand, essentially assigns both lectures and all
1015-
of the questions (which are all broadly about ``physics'') into a tiny region
1016-
of its embedding space, thereby blurring out meaningful distinctions between
1017-
different specific concepts covered by the lectures and questions. We note that
1018-
these are not criticisms of BERT (or other large language models trained on
1019-
large and diverse corpora). Rather, our point is that simple fine-tuned models
1020-
trained on a relatively small but specialized corpus can outperform much more
1021-
complicated models trained on much larger corpora, when we are specifically
1022-
interested in capturing subtle conceptual differences at the level of a single
1023-
course lecture or question. Of course if our goal had been to find a model that
1024-
generalized to many different content areas, we would expect our approach to
1025-
perform comparatively poorly relative to BERT or other much larger models. We
1026-
suggest that bridging the tradeoff between high resolution within each content
1027-
area versus the ability to generalize to many different content areas will be
1028-
an important challenge for future work in this domain.
957+
transformer-based models~\citep{ViswEtal17, DevlEtal18, ChatGPT, TouvEtal23},
958+
can leverage additional textual information such as complex grammatical and
959+
semantic relationships between words, higher-order syntactic structures,
960+
stylistic features, and more. We considered using transformer-based models in
961+
our study, but we found that the text embeddings derived from these models were
962+
surprisingly uninformative with respect to differentiating or otherwise
963+
characterizing the conceptual content of the lectures and questions we used
964+
(see \suppResults). We suspect that this reflects a broader challenge in
965+
constructing models that are both high-resolution within a given domain (e.g.,
966+
the domain of physics lectures and questions) \textit{and} sufficiently broad
967+
as to enable them to cover a wide range of domains. Essentially, these
968+
``larger'' language models learn these more complex features of language through
969+
training on enormous and diverse text corpora. But as a result, their
970+
embedding spaces also ``span'' an enormous and diverse range of conceptual
971+
content, sacrificing a degree of specificity in their capacities to distinguish
972+
subtle conceptual differences within a more narrow range of content.
973+
In comparing our LDA model (trained specifically on the lectures used in our
974+
study) to a larger transformer-based model (BERT), we found that our LDA model provides
975+
both coverage of the requisite material and specificity at the level of
976+
individual questions, while BERT essentially relegates the contents of both
977+
lectures and all quiz questions (which are all broadly about ``physics'') to a
978+
tiny region of its embedding space, thereby blurring out meaningful distinctions
979+
between different specific concepts covered by the lectures and questions
980+
(Supp.~Fig.~\ldaVsBERT). We note that these are not criticisms of BERT, nor of
981+
other large language models trained on large and diverse corpora. Rather, our
982+
point is that simpler models trained on relatively small but specialized
983+
corpora can outperform much more complex models trained on much larger corpora
984+
when we are specifically interested in capturing subtle conceptual differences
985+
at the level of a single course lecture or quiz question. On the other hand, if
986+
our goal had been to choose a model that generalized to many different content
987+
areas, we would expect our LDA model to perform comparatively poorly to BERT or
988+
other much larger general-purpose models. We suggest that bridging this tradeoff
989+
between high resolution within a single content area and the ability to
990+
generalize to many diverse content areas will be an important challenge for
991+
future work.
992+
993+
At the opposite end of the spectrum from large language models, one could also
994+
imagine using an even \textit{simpler} ``model'' than LDA that relates the
995+
contents of course lectures and quiz questions through explicit word-overlap
996+
metrics (rather than similarities in the latent topics they exhibit). In a
997+
supplementary analysis (Supp.~Fig.~\jaccard), we compared the LDA-based
998+
question-lecture matches shown in Figure~\ref{fig:question-correlations} with
999+
analogous matches based on the Jaccard similarity between each question's text
1000+
and each sliding window from the corresponding lecture's transcript. Similarly
1001+
to the embeddings derived from BERT, we found that this approach also blurred
1002+
meaningful distinctions between concepts presented in different parts of each
1003+
lecture and tested by different quiz questions. But rather than characterizing
1004+
their contents at too \textit{broad} a semantic scale, the lack of specificity
1005+
in this approach arises from considering too \textit{narrow} a semantic scale:
1006+
the sorts of concepts typically conveyed in course lectures and tested by quiz
1007+
questions are not defined (and meaningful similarities and distinctions between
1008+
them do not tend to emerge) at the level of individual words.
1009+
1010+
In other words, while the embedding spaces of more complex large language models
1011+
afford low resolution at the scale of individual course lectures and questions
1012+
because they ``zoom out'' too far, simpler word-matching measures yield low
1013+
resolution because they ``zoom \textit{in}'' too far. In this way, we view our
1014+
approach as occupying a sort of ``sweet spot'' between simpler and more complex
1015+
alternatives, in that it enables us to characterize the contents of course
1016+
materials at the appropriate semantic scale where relevant concepts ``come into
1017+
focus.'' Our approach enables us to accurately and consistently identify each
1018+
question's content in a way that matches it with specific content from the
1019+
lectures and distinguishes it from other questions about similar content. In
1020+
turn, this enables us to construct accurate predictions about participants'
1021+
knowledge of the conceptual content tested by individual quiz questions
1022+
(Fig.~\ref{fig:predictions}).
10291023

10301024
Another application for large language models that does \textit{not} require
10311025
explicitly modeling the content of individual lectures or questions is to
1032-
leverage the models' abilities to generate text. For example, generative text
1026+
leverage these models' abilities to generate text. For example, generative text
10331027
models like ChatGPT~\citep{ChatGPT} and LLaMa~\citep{TouvEtal23} are already
10341028
being used to build a new generation of interactive tutoring
10351029
systems~\citep[e.g.,][]{MannEtal23b}. Unlike the approach we have taken here,
@@ -1043,39 +1037,6 @@ \section*{Discussion}
10431037
needs in real time, and that are able to provide more nuanced feedback about
10441038
what learners know and what they do not know.
10451039

1046-
At the opposite end of the spectrum from large language models, one could also
1047-
imagine \textit{simplifying} some aspects of our LDA-based approach by
1048-
computing simple word overlap metrics. For example, the Jaccard similarity
1049-
between text $A$ and $B$ is computed as the number of unique words in the
1050-
intersection of words from $A$ and $B$ divided by the number of unique words in
1051-
the union of words from $A$ and $B$. In a supplementary analysis
1052-
(Supp.~Fig.~\jaccard), we compared the LDA-based question-lecture matches we
1053-
reported in Figure~\ref{fig:question-correlations} with the Jaccard similarities
1054-
between each question and each sliding window of text from the corresponding
1055-
lecture. As shown in Supplementary Figure~\jaccard, this simple word-matching
1056-
approach does not appear to capture the same level of specificity as the
1057-
LDA-based approach. Whereas the LDA-based approach often yields a clear peak in
1058-
the time series of correlations between each question and the corresponding
1059-
lecture, the Jaccard similarity-based approach does not. Furthermore, these
1060-
LDA-based matches appear to capture conceptual overlaps between the questions
1061-
and lectures (Supp.~Tab.~\matchTab), whereas simple word matching does not. For
1062-
example, one of the example questions examined in Supplementary
1063-
Figure~\jaccard~asks ``Which of the following occurs as a cloud of atoms gets
1064-
more dense?'' The LDA-based matches identify lecture timepoints where the
1065-
relevant \textit{topics} are discussed (e.g., when words like ``cloud,''
1066-
``atom,'' ``dense,'' etc., are mentioned \textit{together}). The Jaccard
1067-
similarity-based matches, on the other hand, are strong when \textit{any} of
1068-
these words are mentioned, even if they do not occur together.
1069-
1070-
We view our approach as occupying a sort of ``sweet spot,'' between much larger
1071-
language models and simple word matching-based approaches, that enables us to
1072-
capture the relevant conceptual content of course materials at an appropriate
1073-
semantic scale. Our approach enables us to accurately and consistently identify
1074-
each question's content in a way that also matches up with what is presented in
1075-
the lectures. In turn, this enables us to construct accurate predictions about
1076-
participants' knowledge of the conceptual content tested by held-out questions
1077-
(Fig.~\ref{fig:predictions}).
1078-
10791040
One limitation of our approach is that topic models contain no explicit
10801041
internal representations of more complex aspects of ``knowledge,'' like
10811042
knowledge graphs, dependencies or associations between concepts, causality, and

paper/supplement.pdf

-96.1 KB
Binary file not shown.

0 commit comments

Comments
 (0)