Skip to content

Commit 48ec6de

Browse files
minor rephrasings to clarify intent of BERT/LDA comparisons
1 parent 99c42e7 commit 48ec6de

File tree

4 files changed

+20
-21
lines changed

4 files changed

+20
-21
lines changed

paper/main.pdf

177 Bytes
Binary file not shown.

paper/main.tex

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -951,9 +951,6 @@ \section*{Discussion}
951951
related concepts, and we also show how estimated knowledge falls off with
952952
distance in text embedding space.
953953

954-
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
955-
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
956-
957954
In our study, we characterize the ``coordinates'' of participants' knowledge
958955
using a relatively simple ``bag of words'' text embedding model~\citep[LDA;
959956
][]{BleiEtal03}. More sophisticated text embedding models, such as
@@ -969,12 +966,12 @@ \section*{Discussion}
969966
the domain of physics lectures and questions) \textit{and} sufficiently broad
970967
as to enable them to cover a wide range of domains. Essentially, these
971968
``larger'' language models learn these more complex features of language through
972-
pre-training on enormous and diverse text corpora. But as a result, their
969+
training on enormous and diverse text corpora. But as a result, their
973970
embedding spaces also ``span'' an enormous and diverse range of conceptual
974971
content, sacrificing a degree of specificity in their capacities to distinguish
975972
subtle conceptual differences within a more narrow range of content.
976973
In comparing our LDA model (trained specifically on the lectures used in our
977-
study) to a larger transformer-based model (BERT), we found that LDA provides
974+
study) to a larger transformer-based model (BERT), we found that our LDA model provides
978975
both coverage of the requisite material and specificity at the level of
979976
individual questions, while BERT essentially relegates the contents of both
980977
lectures and all quiz questions (which are all broadly about ``physics'') to a
@@ -985,11 +982,13 @@ \section*{Discussion}
985982
point is that simpler models trained on relatively small but specialized
986983
corpora can outperform much more complex models trained on much larger corpora
987984
when we are specifically interested in capturing subtle conceptual differences
988-
at the level of a single course lecture or question. On the other hand, if our
989-
goal had been to choose a model that generalized to many different content
990-
domains, we would expect our LDA model to perform comparatively poorly to BERT
991-
or other much larger models. We suggest that bridging this tradeoff will be
992-
an important challenge for future work.
985+
at the level of a single course lecture or quiz question. On the other hand, if
986+
our goal had been to choose a model that generalized to many different content
987+
areas, we would expect our LDA model to perform comparatively poorly to BERT or
988+
other much larger general-purpose models. We suggest that bridging this tradeoff
989+
between high resolution within a single content area and the ability to
990+
generalize to many diverse content areas will be an important challenge for
991+
future work.
993992

994993
At the opposite end of the spectrum from large language models, one could also
995994
imagine using an even \textit{simpler} ``model'' than LDA that relates the

paper/supplement.pdf

38 Bytes
Binary file not shown.

paper/supplement.tex

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -437,17 +437,17 @@ \section*{Supplementary results}
437437
Allocation;~\citealp{BleiEtal03}) trained on overlapping sliding windows of
438438
text from the lectures' transcripts (see \topicModelMethods). In comparing our
439439
approach to various alternative methods of characterizing the lectures' and
440-
questions' contents, we found that LDA embeddings exhibited a number of
441-
desirable properties that both simpler and more complex alternatives did not.
442-
For example, we found that analogous embeddings derived from more modern and
443-
complex models trained on much larger and more diverse text corpora (e.g.,
444-
BERT,~\citealp{DevlEtal18}; GPT,~\citealp{ViswEtal17};
445-
LLaMa,~\citealp{TouvEtal23}) tended to afford poor resolution within the content
446-
space spanned by individual course lectures and quiz questions. To illustrate
447-
this contrast, in Supplementary Figure~\ref{fig:compare-bert} we display a
448-
subset of visualizations from the main text based on our LDA embeddings (left
449-
column) alongside analogous visualizations based instead on BERT embeddings
450-
(right column).
440+
questions' contents, we found that embeddings derived from our corpus-specific
441+
LDA model exhibited a number of desirable properties that both simpler and more
442+
complex alternatives did not. For example, we found that analogous embeddings
443+
derived from more modern and complex models trained on much larger and more
444+
diverse text corpora (e.g., BERT,~\citealp{DevlEtal18};
445+
GPT,~\citealp{ViswEtal17}; LLaMa,~\citealp{TouvEtal23}) tended to afford poor
446+
resolution within the content space spanned by individual course lectures and
447+
quiz questions. To illustrate this contrast, in Supplementary
448+
Figure~\ref{fig:compare-bert} we display a subset of visualizations from the
449+
main text based on our LDA embeddings (left column) alongside analogous
450+
visualizations based instead on BERT embeddings (right column).
451451

452452
The top row of Supplementary Figure~\ref{fig:compare-bert} displays PCA
453453
projections of the two lectures' trajectories and 39 quiz questions'

0 commit comments

Comments
 (0)