ContextLab
diff --git a/‎paper/supplement.pdf‎
8.42 KB b/‎paper/supplement.pdf‎
8.42 KB
diff --git a/‎paper/supplement.tex‎
Lines changed: 6 additions & 1 deletion b/‎paper/supplement.tex‎
Lines changed: 6 additions & 1 deletion
@@ -488,8 +488,13 @@ \section*{Supplementary results}
 (analogous to main text Fig.~\topicVariability B). The LDA embeddings show a strong contrast between same-content embeddings and across-content embeddings. In other words, the the amount of ``information'' about different features (i.e., topics) reflected in each lecture's content is highly similar to the amount of information reflected in the contents of its corresponding quiz questions, and dissimilar from that of other subsets of quiz questions. The BERT embeddings, by contrast, do not exhibit these clear distinctions between content areas, and instead represent \textit{all} course materials used in our study highly similarly overall. This indicates that, relative to enormous breadth of conceptual content BERT embeddings can represent, representations of concepts that are specifically related to \textit{Four Fundamental Forces}, specifically related to \textit{Birth of Stars}, and related to physics more generally are so similar as to be almost indistinguishable. 
 Additionally, the relative similarities (i.e., correlations) between lectures and question sets indicate that both lectures are represented more similarly to each other than to any subset of quiz questions, and each subset of quiz questions is represented more similarly to other quiz questions than to either lecture. This mirrors the separation between the lectures and questions visible in their PCA projections above, and suggests that BERT's representations of the subtle, specific semantic features that link the lecture and questions within each content area, and distinguish between different content areas, are sufficiently weak that they are overpowered by representations of \textit{syntactic} features that tend to distinguish questions from lectures more broadly (e.g., beginning with words like ``What'', ``Which'', or ``Why'', ending with a question mark, etc.). Taken together, first two rows of Supplementary Figure~\ref{fig:compare-bert} suggest that a simple LDA model trained on the specific lectures used in our study can ``match'' quiz questions to their corresponding lectures by similarities in their embedding weights, while BERT (trained on a much broader text corpus) cannot.
 
-We also examined both models' abilities to match the quiz questions to specific temporal intervals within the lectures. The third and fourth rows of Supplementary Figure~\ref{fig:compare-bert} display the time series of correlations between the embedding weights for each quiz question and each timepoint of its corresponding lecture (third row: \textit{Four Fundamental Forces}; fourth row: \textit{Birth of Stars}; analogous to main text Fig.~\questionCorrs). When computed using LDA embeddings, the questions' correlation time series generally appear fairly different from each other (i.e., individual questions' traces can be easily visually separated), whereas their correlation time series computed using BERT embeddings are all highly similar. This indicates that the LDA embeddings are capturing some differences in content between the questions about a given lecture that the BERT embeddings are not. The time series plots of individual questions' correlations also exhibit clear ``peaks'' when computed using LDA embeddings, but not when computed using BERT embeddings. This indicates that the LDA embeddings are capturing specific ``matches'' between individual quiz questions and relatively well defined time periods of lecture content, whereas the BERT embeddings appear to blue these specific correspondences. Taken together, these comparisons suggest that our LDA model trained specifically on the two lectures' contents is sensitive to subtle differences between the concepts probed by individual quiz questions, and can identify the relevant periods in the lectures where those concepts where presented (Fig.~\questionCorrs, Supp.~Figs.~\ref{fig:forces-peaks},~\ref{fig:bos-peaks}, Supp.~Tab.~\ref{tab:matches}). Meanwhile, BERT, whose embedding weights must be able to represent a far broader range of concepts, consequently blurs these subtle distinctions between concepts conveyed within the span of a single course lecture, thereby affording a ``lower resolution'' depiction of its contents. 
+We also examined both models' abilities to match the quiz questions to specific temporal intervals within the lectures. The third and fourth rows of Supplementary Figure~\ref{fig:compare-bert} display the time series of correlations between the embedding weights for each quiz question and each timepoint of its corresponding lecture (third row: \textit{Four Fundamental Forces}; fourth row: \textit{Birth of Stars}; analogous to main text Fig.~\questionCorrs). When computed using LDA embeddings, the questions' correlation time series generally appear fairly different from each other (i.e., individual questions' traces can be easily visually separated), whereas their correlation time series computed using BERT embeddings are all highly similar. This indicates that the LDA embeddings are capturing some differences in content between the questions about a given lecture that the BERT embeddings are not. The time series plots of individual questions' correlations also exhibit clear ``peaks'' when computed using LDA embeddings, but not when computed using BERT embeddings. This indicates that the LDA embeddings are capturing specific ``matches'' between individual quiz questions and relatively well defined time periods of lecture content, whereas the BERT embeddings appear to blur these specific correspondences. Taken together, these comparisons suggest that our LDA model trained specifically on the two lectures' contents is sensitive to subtle differences between the concepts probed by individual quiz questions, and can identify the relevant periods in the lectures where those concepts where presented (Fig.~\questionCorrs, Supp.~Figs.~\ref{fig:forces-peaks},~\ref{fig:bos-peaks}, Supp.~Tab.~\ref{tab:matches}). Meanwhile, BERT, whose embedding weights must be able to represent a far broader range of concepts, consequently blurs these subtle distinctions between concepts conveyed within the span of a single course lecture, thereby affording a ``lower resolution'' depiction of its contents. 
 
+We also compared the within-lecture question matches identified by our LDA-based approach to analogous matches obtained via a simpler approach that did \textit{not} entail using a text embedding model. Instead of projecting the text of the lectures and quiz questions into a common space and relating them by the similarities (i.e., correlations) between their coordinates (i.e., topic vectors), in this approach we related them by a measure of similarity computed directly from their texts. Specifically, we first parsed the two lectures' transcripts into overlapping sliding windows and performed the same preprocessing as we did for our LDA-based approach (see~\topicModelMethods). We then computed the Jaccard similarity between the text of each lecture-related quiz question and each sliding window from its corresponding lecture's transcript. Here, the Jaccard similarity between a question $Q$ and a lecture window $W$ is defined as the number of unique words that appear in the texts of both $Q$ and $W$ (i.e., $Q \cap W$) divided by the number of unique words that appear in either $Q$ or $W$ (i.e., $Q \cup W$). 
+
+As shown in Supplementary Figure~\ref{fig:compare-wordcount}, this simple word-matching approach does not appear to capture the same level of specificity as our LDA-based approach. Panels A and B display each question's distribution of topic vector correlations with all lecture timepoints (blue dots) alongside its distribution of Jaccard similarities to all sliding windows (orange dots). In these panels, both similarity measures have been normalized (independently, and separately for each question) to range between 0 and 1, such that a similarity of 1 reflects each question's ``best matching'' sample of lecture content, and a similarity of 0 reflects the least similar lecture content. In most cases, the distributions of topic vector correlations are dominated by extreme values (i.e., values close to 0 or 1) with comparatively few intermediate values. In other words, this approach yields strong and specific matches: each question's similarity is near its maximum value for a particular set of timepoints, and tends to be close to its minimum value for most other timepoints. The distributions of Jaccard similarities show the opposite pattern, consisting primarily of intermediate values with very few extreme values. In other words, this approach tends to weakly match each question to a large proportion of the lecture content. 
+
+These patterns are reflected in the similarity time series plots shown in panels C and D (analogous to main text Fig.~\questionCorrs). The questions' time series of topic vector correlations exhibit visually distinct ``peaks'' during specific sections of the lectures, outside of which their correlations tend to be relatively low. In other words, the set of timepoints whose topic vectors are maximally correlated with the topic vector for a given question tend to be not only specific, but also temporally contiguous. This indicates that topic vector correlation matches questions with sustained, cohesive periods of lecture content where the relevant underlying \textit{concepts} are discussed, rather than with a sporadic assortment of timepoints where the question and lecture content share more superficial similarities. The time series of Jaccard similarities, by contrast, are less cleanly structured and do appear to reflect such superficial content matches in the more scattered, brief, and weak ``peaks'' they exhibit. As an example, panel E displays text of the two quiz questions highlighted in green and purple in panels A through D, along with the text from the best-matching intervals from the lectures (identified based on topic vector correlation). Both questions include words that---given the subject matter of the two lectures---appear frequently throughout their transcripts (e.g., ``\textit{force}'', ``\textit{atom}'', ``\textit{nucleus}'', ``\textit{cloud}'', ``\textit{dense}'', etc.). While the presence or absence of any one (or more) of these words will lead to differences in the degree of ``match'' based on Jaccard similarity (resulting in the ``jagged'' time series shown in panel D), LDA is able to capture some notion of the \textit{context} in which different usages of the same words occur. As a result, the model is able to successfully assign similar topic vectors to quiz questions and lecture timepoints specifically when their shared combinations of words denote the same deeper meaning.
 
 \clearpage
 \bibliographystyle{apa}