Skip to content

Commit 4e514f1

Browse files
jeremymanningclaude
andcommitted
feat: Renumber lectures for X-hour integration (weeks 3-10)
- Week 3: lectures 7,8 → 9,11; added new X-hour lecture 10 - Week 4: lectures 9-11 → 12-14 - Week 5: lectures 12-14 → 15-17 - Week 6: lectures 15-17 → 18-20 - Week 7: lectures 18-20 → 21-23 - Week 9: lectures 21-23 → 24-26 - Week 10: lecture 24 → 27 Updated all internal lecture number references in titles. New lecture structure: 27 lectures total (4 per week for weeks 1-3). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
1 parent 2d69e2f commit 4e514f1

File tree

19 files changed

+400
-18
lines changed

19 files changed

+400
-18
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ footer: ''
88

99
<!-- _class: lead -->
1010

11-
# Lecture 24: Final Project Work Session
11+
# Lecture 27: Final Project Work Session
1212
## Models of Language and Conversation 🤖
1313

1414
**PSYC 51.07: Models of Language and Communication**

slides/week3/lecture10.md

Lines changed: 382 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,382 @@
1+
---
2+
marp: true
3+
theme: cdl-theme
4+
paginate: true
5+
header: 'PSYC 51.07: Models of Language and Communication'
6+
footer: ''
7+
---
8+
9+
<!-- _class: lead -->
10+
11+
# Lecture 10: X-Hour Embeddings Workshop
12+
## Week 3: Hands-On Dimensionality Reduction and Word Vectors
13+
14+
**PSYC 51.07: Models of Language and Communication**
15+
16+
---
17+
18+
# Learning Objectives
19+
20+
By the end of this session, you will:
21+
22+
1. Implement classic dimensionality reduction (LSA, LDA)
23+
2. Train and analyze Word2Vec embeddings
24+
3. Visualize high-dimensional embeddings using UMAP
25+
4. Compare different embedding methods on real data
26+
5. Understand semantic relationships captured by embeddings
27+
28+
**Workshop format:** Hands-on coding with the 20 Newsgroups dataset
29+
30+
---
31+
32+
# Workshop Overview
33+
34+
**Today's Agenda:**
35+
36+
1. **Part 1:** Why embeddings? From sparse to dense representations
37+
2. **Part 2:** LSA - Latent Semantic Analysis with SVD
38+
3. **Part 3:** LDA - Latent Dirichlet Allocation for topic modeling
39+
4. **Part 4:** Word2Vec - Neural word embeddings
40+
5. **Part 5:** Visualizing embeddings with UMAP
41+
6. **Part 6:** Comparing methods and document classification
42+
43+
**Companion notebook:** `xhour_embeddings_demo.ipynb`
44+
45+
---
46+
47+
# Part 1: Why Embeddings?
48+
49+
**The problem with sparse representations:**
50+
51+
<div class="columns">
52+
<div class="column">
53+
54+
**Last week (BoW, TF-IDF):**
55+
- High dimensional (vocab size)
56+
- Sparse (mostly zeros)
57+
- No semantic similarity
58+
- "dog" and "puppy" are orthogonal
59+
60+
</div>
61+
<div class="column">
62+
63+
**Embeddings:**
64+
- Low dimensional (50-300 dims)
65+
- Dense (all non-zero)
66+
- Similar words cluster together
67+
- "dog" and "puppy" are close!
68+
69+
</div>
70+
</div>
71+
72+
---
73+
74+
# The Magic of Word Vectors
75+
76+
**Famous example:** king - man + woman = queen
77+
78+
<div class="callout info">
79+
<div class="callout-title">Vector Arithmetic</div>
80+
81+
Word embeddings capture semantic relationships as directions in space:
82+
- Gender direction: woman - man
83+
- Royalty direction: king - queen
84+
- Pluralization: words - word
85+
86+
</div>
87+
88+
**Key insight:** Meaning encoded as geometry!
89+
90+
---
91+
92+
# Part 2: Latent Semantic Analysis (LSA)
93+
94+
**Using SVD to find latent topics:**
95+
96+
$$X \approx U_k \Sigma_k V_k^T$$
97+
98+
<div class="columns">
99+
<div class="column">
100+
101+
**Algorithm:**
102+
1. Build TF-IDF matrix $X$
103+
2. Apply Singular Value Decomposition
104+
3. Keep top $k$ dimensions
105+
4. Use $U_k$ as word embeddings
106+
107+
</div>
108+
<div class="column">
109+
110+
**Interpretation:**
111+
- $U$: word-topic associations
112+
- $\Sigma$: topic strengths
113+
- $V^T$: doc-topic associations
114+
115+
</div>
116+
</div>
117+
118+
---
119+
120+
# LSA in Code
121+
122+
```python
123+
from sklearn.decomposition import TruncatedSVD
124+
from sklearn.feature_extraction.text import TfidfVectorizer
125+
126+
# Build TF-IDF matrix
127+
tfidf = TfidfVectorizer(max_features=5000, stop_words='english')
128+
tfidf_matrix = tfidf.fit_transform(documents)
129+
130+
# Apply LSA
131+
lsa = TruncatedSVD(n_components=100, random_state=42)
132+
doc_embeddings = lsa.fit_transform(tfidf_matrix)
133+
word_embeddings = lsa.components_.T
134+
135+
print(f"Explained variance: {lsa.explained_variance_ratio_.sum():.2%}")
136+
```
137+
138+
**Try it:** Find similar words using cosine similarity!
139+
140+
---
141+
142+
# Part 3: LDA for Topic Modeling
143+
144+
**A probabilistic approach:**
145+
146+
<div class="callout tip">
147+
<div class="callout-title">Generative Story</div>
148+
149+
LDA imagines documents are created by:
150+
1. Choosing a mixture of topics
151+
2. For each word, picking a topic
152+
3. Sampling a word from that topic
153+
154+
</div>
155+
156+
**Key difference from LSA:**
157+
- Probabilistic interpretation
158+
- Non-negative weights
159+
- More interpretable topics
160+
161+
---
162+
163+
# LDA Example Output
164+
165+
```python
166+
Topic 0: hockey, game, team, player, season, nhl, play
167+
Topic 1: space, nasa, launch, orbit, shuttle, satellite
168+
Topic 2: computer, software, program, file, windows, system
169+
Topic 3: medical, doctor, patient, disease, health, treatment
170+
Topic 4: government, president, congress, law, political
171+
```
172+
173+
<div class="callout info">
174+
175+
Each document is a **mixture** of topics:
176+
Document #42: 60% Space + 25% Computer + 15% Other
177+
178+
</div>
179+
180+
---
181+
182+
# Part 4: Word2Vec
183+
184+
**Learning embeddings from context:**
185+
186+
<div class="columns">
187+
<div class="column">
188+
189+
**Skip-gram:**
190+
Given target word, predict context
191+
192+
"The **cat** sat on mat"
193+
- cat → the, sat, on
194+
195+
**CBOW:**
196+
Given context, predict target
197+
198+
the, sat, on → **cat**
199+
200+
</div>
201+
<div class="column">
202+
203+
```python
204+
from gensim.models import Word2Vec
205+
206+
model = Word2Vec(
207+
sentences=tokenized_docs,
208+
vector_size=100,
209+
window=5,
210+
min_count=5,
211+
sg=1 # Skip-gram
212+
)
213+
```
214+
215+
</div>
216+
</div>
217+
218+
---
219+
220+
# Word2Vec: Semantic Similarity
221+
222+
```python
223+
# Find similar words
224+
model.wv.most_similar('computer', topn=5)
225+
# [('software', 0.82), ('program', 0.79), ('system', 0.75), ...]
226+
227+
# Word analogies
228+
model.wv.most_similar(
229+
positive=['woman', 'king'],
230+
negative=['man']
231+
)
232+
# [('queen', 0.71), ...]
233+
```
234+
235+
<div class="callout warning">
236+
<div class="callout-title">Hands-on Exercise</div>
237+
238+
Try creating your own word analogies! What works? What fails?
239+
240+
</div>
241+
242+
---
243+
244+
# Part 5: Visualizing with UMAP
245+
246+
**Projecting 100D → 2D:**
247+
248+
```python
249+
import umap
250+
251+
reducer = umap.UMAP(
252+
n_neighbors=15,
253+
min_dist=0.1,
254+
metric='cosine'
255+
)
256+
257+
embeddings_2d = reducer.fit_transform(word_vectors)
258+
```
259+
260+
**UMAP advantages:**
261+
- Faster than t-SNE
262+
- Preserves global structure
263+
- Better cluster separation
264+
265+
---
266+
267+
# What You Should See
268+
269+
When you visualize embeddings:
270+
271+
<div class="columns">
272+
<div class="column">
273+
274+
**Sports cluster:**
275+
- hockey, baseball, player, team, game
276+
277+
**Space cluster:**
278+
- nasa, shuttle, orbit, launch, space
279+
280+
</div>
281+
<div class="column">
282+
283+
**Tech cluster:**
284+
- computer, software, program, windows
285+
286+
**Medical cluster:**
287+
- doctor, patient, hospital, treatment
288+
289+
</div>
290+
</div>
291+
292+
<div class="callout tip">
293+
294+
Related words should cluster together even though we never told the model they were related!
295+
296+
</div>
297+
298+
---
299+
300+
# Part 6: Comparing Methods
301+
302+
| Method | Speed | Interpretability | Quality | Data Needed |
303+
|--------|-------|------------------|---------|-------------|
304+
| LSA | Fast | Medium | Medium | Small-Medium |
305+
| LDA | Medium | High | Medium | Medium |
306+
| Word2Vec | Medium | Low | High | Large |
307+
308+
**Recommendations:**
309+
- **Quick exploration:** LSA
310+
- **Interpretable topics:** LDA
311+
- **Best semantic quality:** Word2Vec
312+
313+
---
314+
315+
# Document Classification with Embeddings
316+
317+
**Using embeddings as features:**
318+
319+
```python
320+
def document_vector(doc, model):
321+
"""Average word vectors for document."""
322+
tokens = preprocess(doc)
323+
vectors = [model.wv[w] for w in tokens if w in model.wv]
324+
return np.mean(vectors, axis=0) if vectors else np.zeros(100)
325+
326+
# Train classifier
327+
X_train = [document_vector(doc, w2v) for doc in train_docs]
328+
clf = LogisticRegression()
329+
clf.fit(X_train, y_train)
330+
```
331+
332+
**Compare to TF-IDF baseline!**
333+
334+
---
335+
336+
# Key Takeaways
337+
338+
1. **Embeddings capture semantic meaning** - similar words have similar vectors
339+
340+
2. **Different methods, different strengths:**
341+
- LSA: Fast, linear, interpretable
342+
- LDA: Probabilistic, topic-focused
343+
- Word2Vec: Neural, best for similarity
344+
345+
3. **Visualization reveals structure** - UMAP shows semantic clusters
346+
347+
4. **Limitations:**
348+
- Static (one vector per word, no context)
349+
- Requires substantial data
350+
- Can encode biases
351+
352+
**Next week:** Contextual embeddings (BERT, GPT)!
353+
354+
---
355+
356+
# Discussion Questions
357+
358+
1. **Why does vector arithmetic work?** What does "king - man + woman" really mean geometrically?
359+
360+
2. **Bias in embeddings:** If Word2Vec learns from news articles, what biases might it capture?
361+
362+
3. **Window size matters:** What happens with window=2 vs window=10?
363+
364+
4. **Out-of-vocabulary problem:** How do you handle words not in your vocabulary?
365+
366+
5. **When to use what:** For a sentiment analysis task, would you choose LSA, LDA, or Word2Vec?
367+
368+
---
369+
370+
# Next Steps
371+
372+
**For Assignment 2:**
373+
- Use embeddings to improve your classifier
374+
- Compare at least 2 embedding methods
375+
- Visualize your embeddings
376+
377+
**Coming up in Lecture 11:**
378+
- Modern neural word embeddings
379+
- GloVe and FastText
380+
- Subword tokenization
381+
382+
**Office hours:** Available if you need help with the notebook!

0 commit comments

Comments
 (0)