Skip to content

Commit 415b3b3

Browse files
docs: Expand data warehouse design answer with scale requirements, architecture diagram, and star schema details.
1 parent 17e668e commit 415b3b3

File tree

3 files changed

+5392
-137
lines changed

3 files changed

+5392
-137
lines changed

docs/Interview-Questions/Natural-Language-Processing.md

Lines changed: 316 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -23,19 +23,325 @@ This document provides a curated list of 100 NLP interview questions commonly as
2323

2424
??? success "View Answer"
2525

26-
**Core Components:**
27-
28-
1. **Self-Attention:** Weighs importance of each token
29-
2. **Multi-Head Attention:** Parallel attention for different relationships
30-
3. **Positional Encoding:** Adds position information
31-
4. **Feed-Forward Networks:** Per-position transformations
32-
26+
## Architecture Overview
27+
28+
The Transformer architecture ("Attention is All You Need", 2017) revolutionized NLP by replacing recurrence with attention mechanisms, enabling parallelization and better long-range dependency modeling.
29+
30+
**Key Parameters (BERT-base):**
31+
- **Layers:** 12 encoder layers
32+
- **Hidden size (d_model):** 768
33+
- **Attention heads:** 12
34+
- **Parameters:** 110M
35+
- **Max sequence length:** 512 tokens
36+
- **FFN dimension:** 3072 (4× d_model)
37+
38+
## Core Architecture
39+
3340
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
34-
35-
**Key Innovation:** Parallelizable (unlike RNNs), captures long-range dependencies.
41+
42+
**Multi-Head Attention:**
43+
44+
$$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$
45+
46+
where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$
47+
48+
## Production Implementation (200 lines)
49+
50+
```python
51+
# transformer.py
52+
import torch
53+
import torch.nn as nn
54+
import torch.nn.functional as F
55+
import math
56+
57+
class MultiHeadAttention(nn.Module):
58+
"""
59+
Multi-Head Self-Attention
60+
61+
Time: O(n² × d) where n=seq_len, d=d_model
62+
Space: O(n²) for attention matrix
63+
"""
64+
65+
def __init__(self, d_model=768, num_heads=12, dropout=0.1):
66+
super().__init__()
67+
assert d_model % num_heads == 0
68+
69+
self.d_model = d_model
70+
self.num_heads = num_heads
71+
self.d_k = d_model // num_heads # 64 per head
72+
73+
# Linear projections
74+
self.W_q = nn.Linear(d_model, d_model)
75+
self.W_k = nn.Linear(d_model, d_model)
76+
self.W_v = nn.Linear(d_model, d_model)
77+
self.W_o = nn.Linear(d_model, d_model)
78+
79+
self.dropout = nn.Dropout(dropout)
80+
81+
def scaled_dot_product_attention(self, Q, K, V, mask=None):
82+
"""
83+
Scaled Dot-Product Attention
84+
85+
Args:
86+
Q, K, V: [batch, heads, seq_len, d_k]
87+
mask: [batch, 1, 1, seq_len] for padding
88+
89+
Returns:
90+
output: [batch, heads, seq_len, d_k]
91+
attention_weights: [batch, heads, seq_len, seq_len]
92+
"""
93+
# scores: [batch, heads, seq_len, seq_len]
94+
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
95+
96+
# Apply mask (padding = -inf)
97+
if mask is not None:
98+
scores = scores.masked_fill(mask == 0, float('-inf'))
99+
100+
# Attention weights
101+
attn_weights = F.softmax(scores, dim=-1)
102+
attn_weights = self.dropout(attn_weights)
103+
104+
# Apply to values
105+
output = torch.matmul(attn_weights, V)
106+
return output, attn_weights
107+
108+
def split_heads(self, x):
109+
"""[batch, seq, d_model] → [batch, heads, seq, d_k]"""
110+
batch_size, seq_len, d_model = x.size()
111+
return x.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
112+
113+
def combine_heads(self, x):
114+
"""[batch, heads, seq, d_k] → [batch, seq, d_model]"""
115+
batch_size, _, seq_len, d_k = x.size()
116+
return x.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
117+
118+
def forward(self, query, key, value, mask=None):
119+
# Linear projections and split heads
120+
Q = self.split_heads(self.W_q(query))
121+
K = self.split_heads(self.W_k(key))
122+
V = self.split_heads(self.W_v(value))
123+
124+
# Attention
125+
attn_output, attn_weights = self.scaled_dot_product_attention(Q, K, V, mask)
126+
127+
# Combine heads and final projection
128+
output = self.combine_heads(attn_output)
129+
output = self.W_o(output)
130+
131+
return output, attn_weights
132+
133+
class PositionWiseFeedForward(nn.Module):
134+
"""
135+
FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
136+
137+
Applied independently to each position
138+
"""
139+
140+
def __init__(self, d_model=768, d_ff=3072, dropout=0.1):
141+
super().__init__()
142+
self.linear1 = nn.Linear(d_model, d_ff)
143+
self.linear2 = nn.Linear(d_ff, d_model)
144+
self.dropout = nn.Dropout(dropout)
145+
146+
def forward(self, x):
147+
# x: [batch, seq_len, d_model]
148+
return self.linear2(self.dropout(F.gelu(self.linear1(x))))
149+
150+
class TransformerEncoderLayer(nn.Module):
151+
"""Single encoder layer with self-attention + FFN"""
152+
153+
def __init__(self, d_model=768, num_heads=12, d_ff=3072, dropout=0.1):
154+
super().__init__()
155+
156+
self.self_attn = MultiHeadAttention(d_model, num_heads, dropout)
157+
self.ffn = PositionWiseFeedForward(d_model, d_ff, dropout)
158+
159+
self.norm1 = nn.LayerNorm(d_model)
160+
self.norm2 = nn.LayerNorm(d_model)
161+
162+
self.dropout1 = nn.Dropout(dropout)
163+
self.dropout2 = nn.Dropout(dropout)
164+
165+
def forward(self, x, mask=None):
166+
# Self-attention + residual + norm
167+
attn_output, _ = self.self_attn(x, x, x, mask)
168+
x = self.norm1(x + self.dropout1(attn_output))
169+
170+
# FFN + residual + norm
171+
ffn_output = self.ffn(x)
172+
x = self.norm2(x + self.dropout2(ffn_output))
173+
174+
return x
175+
176+
class PositionalEncoding(nn.Module):
177+
"""
178+
Sinusoidal positional encoding
179+
180+
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
181+
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
182+
"""
183+
184+
def __init__(self, d_model=768, max_len=512):
185+
super().__init__()
186+
187+
# Create PE matrix [max_len, d_model]
188+
pe = torch.zeros(max_len, d_model)
189+
position = torch.arange(0, max_len).unsqueeze(1).float()
190+
191+
div_term = torch.exp(
192+
torch.arange(0, d_model, 2).float() *
193+
(-math.log(10000.0) / d_model)
194+
)
195+
196+
pe[:, 0::2] = torch.sin(position * div_term)
197+
pe[:, 1::2] = torch.cos(position * div_term)
198+
199+
pe = pe.unsqueeze(0) # [1, max_len, d_model]
200+
self.register_buffer('pe', pe)
201+
202+
def forward(self, x):
203+
seq_len = x.size(1)
204+
return x + self.pe[:, :seq_len, :]
205+
206+
class TransformerEncoder(nn.Module):
207+
"""Complete Transformer Encoder (BERT-style)"""
208+
209+
def __init__(
210+
self,
211+
vocab_size=30522,
212+
d_model=768,
213+
num_layers=12,
214+
num_heads=12,
215+
d_ff=3072,
216+
max_len=512,
217+
dropout=0.1
218+
):
219+
super().__init__()
220+
221+
self.embedding = nn.Embedding(vocab_size, d_model)
222+
self.pos_encoding = PositionalEncoding(d_model, max_len)
223+
224+
self.layers = nn.ModuleList([
225+
TransformerEncoderLayer(d_model, num_heads, d_ff, dropout)
226+
for _ in range(num_layers)
227+
])
228+
229+
self.dropout = nn.Dropout(dropout)
230+
self.d_model = d_model
231+
232+
def forward(self, input_ids, attention_mask=None):
233+
"""
234+
Args:
235+
input_ids: [batch, seq_len]
236+
attention_mask: [batch, seq_len] (1=real, 0=padding)
237+
238+
Returns:
239+
[batch, seq_len, d_model]
240+
"""
241+
# Token embeddings + scaling
242+
x = self.embedding(input_ids) * math.sqrt(self.d_model)
243+
244+
# Add positional encoding
245+
x = self.pos_encoding(x)
246+
x = self.dropout(x)
247+
248+
# Prepare attention mask [batch, 1, 1, seq_len]
249+
if attention_mask is not None:
250+
attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
251+
252+
# Pass through layers
253+
for layer in self.layers:
254+
x = layer(x, attention_mask)
255+
256+
return x
257+
258+
# Example
259+
if __name__ == "__main__":
260+
model = TransformerEncoder(
261+
vocab_size=30522, # BERT vocab
262+
d_model=768,
263+
num_layers=12,
264+
num_heads=12
265+
)
266+
267+
input_ids = torch.randint(0, 30522, (2, 10)) # batch=2, seq=10
268+
mask = torch.ones(2, 10)
269+
270+
output = model(input_ids, mask)
271+
print(f"Output shape: {output.shape}") # [2, 10, 768]
272+
273+
total_params = sum(p.numel() for p in model.parameters())
274+
print(f"Parameters: {total_params:,}") # ~110M
275+
```
276+
277+
## Architecture Comparison
278+
279+
| Model | Type | Params | Context | Training | Best For |
280+
|-------|------|--------|---------|----------|----------|
281+
| **BERT** | Encoder | 110M-340M | 512 | Days (TPUs) | Classification, NER, QA |
282+
| **GPT-3** | Decoder | 175B | 2048 | Months (GPUs) | Generation, few-shot |
283+
| **T5** | Enc-Dec | 220M-11B | 512 | Weeks (TPUs) | Translation, summarization |
284+
| **LLaMA** | Decoder | 7B-65B | 2048 | Weeks (GPUs) | Open-source generation |
285+
286+
## Key Innovations Explained
287+
288+
**1. Why √d_k Scaling?**
289+
- **Problem:** For large d_k, dot products grow large → softmax saturates
290+
- **Impact:** Gradients vanish, training fails
291+
- **Solution:** Divide by √d_k to normalize variance
292+
- **Math:** Var(Q·K) = d_k, so Var(Q·K/√d_k) = 1
293+
294+
**2. Multi-Head Attention Benefits**
295+
- Different heads learn different patterns:
296+
- **Head 1:** Syntactic dependencies (subject-verb agreement)
297+
- **Head 2:** Semantic relationships
298+
- **Head 3:** Coreference resolution
299+
- **12 heads × 64-dim = 768-dim** (same as single-head)
300+
301+
**3. Position Encoding**
302+
- **Why needed:** Self-attention is permutation-invariant
303+
- **Sinusoidal advantage:** Generalizes to longer sequences
304+
- **Modern alternatives:** Learned PE, RoPE, ALiBi
305+
306+
## Common Pitfalls
307+
308+
| Pitfall | Impact | Solution |
309+
|---------|--------|----------|
310+
| **O(n²) memory** | OOM for long sequences (>4K) | Flash Attention, sparse patterns |
311+
| **No positional info** | Model ignores token order | Positional encoding |
312+
| **Padding inefficiency** | Wasted compute | Dynamic batching, pack sequences |
313+
| **Attention collapse** | All weights uniform | Proper init, gradient clipping |
314+
| **512 token limit** | Can't process long documents | Longformer, Big Bird, chunking |
315+
316+
## Real-World Systems
317+
318+
**Google BERT (2018):**
319+
- **Training:** 16 TPUs × 4 days, Wikipedia + BooksCorpus
320+
- **Impact:** SotA on 11 NLP tasks, pre-training revolution
321+
- **Production:** Powers Google Search understanding
322+
323+
**OpenAI GPT-3 (2020):**
324+
- **Scale:** 175B params, 96 layers, 96 heads
325+
- **Training:** $4.6M cost, 300B tokens
326+
- **Innovation:** Few-shot learning without fine-tuning
327+
- **Limitation:** 2K context (GPT-4: 128K with improvements)
328+
329+
**Meta LLaMA (2023):**
330+
- **Efficiency:** 65B params matches GPT-3 175B
331+
- **Improvements:** RoPE, SwiGLU, RMSNorm
332+
- **Training:** 1.4T tokens, 2048 A100 GPUs
333+
- **Open-source:** Democratized LLM access
36334

37335
!!! tip "Interviewer's Insight"
38-
Knows why $\sqrt{d_k}$ scaling matters and can explain attention mechanism.
336+
**Strong candidates:**
337+
338+
- Can implement scaled dot-product attention from scratch with correct tensor shapes
339+
- Explain √d_k scaling mathematically (prevents softmax saturation)
340+
- Understand O(n²) complexity and solutions (Flash Attention reduces to O(n))
341+
- Know when to use BERT vs GPT vs T5 (classification vs generation vs sequence-to-sequence)
342+
- Mention recent advances: RoPE for longer context, Flash Attention for efficiency
343+
- Discuss production concerns: quantization (INT8 inference), distillation (DistilBERT), ONNX export
344+
- Reference real impact: "BERT improved Google Search relevance by 10%"
39345

40346
---
41347

0 commit comments

Comments
 (0)