Skip to content

Latest commit

 

History

History
535 lines (396 loc) · 18.9 KB

File metadata and controls

535 lines (396 loc) · 18.9 KB

Day 4: Experiment 1 - Embeddings Validation Detailed Learning Notes

Date: January 1, 2026 Focus: Running 7 hands-on experiments to validate embeddings theory from Days 1-3 Structure: Detailed learning objectives for each test, mapping theory to practice

This document captures the deep learning objectives for each test in Experiment 1. Each test validates a specific concept from Days 1-3 and teaches a practical insight about how embeddings work in RAG systems.


Foundation Coming Into Day 4

From Day 1-3: Complete Learning Foundation

Key Concepts to Validate:

  • Day 1: Embeddings are pre-trained lookups (not runtime calculations), latent dimensions capture semantic meaning, dimensionality trade-offs
  • Day 2-3: Transformer architecture, attention mechanisms, working memory vs knowledge, hallucinations and how context prevents them

Why Day 4 Matters:

  • Day 1-3: Theory (understanding mechanisms)
  • Day 4: Practice (seeing mechanisms work in real code)
  • Week 1 Project: Building a system that relies on all these insights

Test 1: Pre-trained Models - Deterministic Lookup

Core Insight

Embeddings are pre-computed lookups, not runtime calculations. The same input always returns the identical vector.

What You're Testing

text = "The cat sat on the mat"
embedding_1 = get_embedding(text)  # Call 1
embedding_2 = get_embedding(text)  # Call 2
# Expected: embedding_1 == embedding_2 (bitwise identical)

What You Learn

The Realization: Embeddings don't change. They're not random. They're not computed fresh each time.

Validates Day 1 (lines 20-33):

"Models are pre-trained. You just look up embeddings. During training, the model discovers latent dimensions automatically. At runtime: you just do word-by-word lookup → get pre-computed embedding values"

Practical understanding: When you call client.embeddings.create(), you're not triggering a neural network computation. You're triggering a lookup in a massive pre-computed table.

Why This Matters

This validates the entire caching strategy for your RAG system.

If embeddings changed randomly:

  • You'd need to re-embed code every time
  • Your ChromaDB would be stale
  • Caching would be useless

Since embeddings are deterministic:

  • Cache them indefinitely
  • Unchanged code files never need re-embedding
  • System scales efficiently

Connection to Day 1-2 Learning

From Day 1 notes on model training phases:

  • Phase 1: Model trains on internet data, discovers latent dimensions
  • Phase 2: Frozen model is deployed with fixed weights
  • Runtime: No training. Only inference (lookup)

Test 1 proves Phase 2 is actually happening.

What Passing vs Failing Means

✅ PASS: Embeddings are identical

  • Decision: Safe to cache embeddings indefinitely
  • Implication: embedder.py can use persistent caching

❌ FAIL: Embeddings differ between calls

  • Problem: Caching strategy becomes invalid
  • Requires: Architecture redesign

Test 2: Semantic Clustering - Latent Dimensions Capture Meaning

Core Insight

Similar code clusters together in embedding space regardless of language syntax. Semantic meaning is encoded in the latent dimensions.

What You're Testing

code_1 = "def add(a, b): return a + b"           # Python add
code_2 = "function add(a, b) { return a + b; }" # JavaScript add
code_3 = "def multiply(a, b): return a * b"     # Python multiply

Expected: similarity(py_add, js_add) > similarity(py_add, py_multiply)

What You Learn

The Realization: The embedding model understands what code does, not just how it looks.

From Day 1 (lines 36-65):

"Latent dimensions compress similar concepts together. Similar semantic meaning → similar embedding patterns"

Example results might show:

  • Python add ↔ JavaScript add: 0.85 (HIGH - same operation)
  • Python add ↔ Python multiply: 0.42 (LOW - different operation)

The insight: An entire dimension (or group of dimensions) represents "addition operation", and this dimension is language-agnostic.

Why This Matters

Without semantic clustering, you'd have to index each language separately. With it, you get multi-language search for free.

Connection to Day 1-2 Learning

From Day 2-3 on transformer attention:

  • Attention allows the model to focus on relevant tokens
  • "add" keyword + operands trigger attention to arithmetic concepts
  • Language-specific syntax is filtered out

What Passing vs Failing Means

✅ PASS: semantic clustering works across languages

  • Decision: Multi-language indexing is viable
  • Implication: Can include Python/JS/TypeScript in same collection

❌ FAIL: Language syntax is stronger signal than semantics

  • Problem: Can't reliably cross-language search
  • Requires: Use separate indexes per language

Test 3: Dimensionality - 1536 vs 384 Dimensions

Core Insight

More dimensions = finer semantic detail. Fewer dimensions = faster computation but information loss.

What You're Testing

# Get 1536-dim embedding
embedding_full = get_embedding(code_complex)
assert len(embedding_full) == 1536

# Truncate to 384 dimensions
embedding_truncated = embedding_full[:384]

# Compare similarity with related code
similarity_full = cosine_similarity(emb_full, emb_similar_full)
similarity_truncated = cosine_similarity(emb_truncated, emb_similar_truncated)

# Expected: similarity_full >= similarity_truncated

What You Learn

Example results:

Similarity (1536 dims):  0.78  (captures fine detail)
Similarity (384 dims):   0.71  (less nuanced)
Difference:              0.07

The realization: Full dimensionality preserves more semantic nuance. The 7% difference quantifies the accuracy trade-off.

From Day 1 (lines 69-87):

"More dimensions = More nuance, more accurate, but slower"

Why This Matters

Cost/benefit analysis for your project:

  • Storage: 1536-dim = 4x larger than 384-dim
  • Computation: Still fast (cosine is O(n) in dims)
  • Accuracy: 7% improvement is meaningful for hard queries
  • OpenAI cost: Same regardless (pay per token, not per dimension)

Decision for Week 1: Use 1536-dim because the 7% accuracy improvement justifies the cost.

Connection to Day 1 Learning

From Day 1 notes on model architecture:

  • Transformers have embedding dimension
  • Larger dimensions = more "memory" to encode information
  • Information density increases with dimensions, but with diminishing returns

What Passing vs Failing Means

✅ PASS: Full dims capture more semantic detail

  • Decision: text-embedding-3-small (1536 dims) is worth the cost
  • Implication: Store all 1536 dimensions in ChromaDB

❌ FAIL: Truncation improves similarity (counterintuitive)

  • Suggests: Noise in higher dimensions
  • Requires: Investigation

Test 4: Distance Metrics - Cosine vs Euclidean

Core Insight

For high-dimensional embeddings, cosine similarity (angle-based) outperforms Euclidean distance (magnitude-based) because embeddings vary in magnitude.

What You're Testing

# Short code snippet vs verbose equivalent
code_short = "def sum(arr): return sum(arr)"
code_long = "def calculate_sum(array): total = 0; for item in array: total += item; return total"

# Different semantic meaning
code_different = "def fetch_user(id): return db.query(User).get(id)"

# Calculate both metrics
euclidean_short_long = euclidean_distance(emb_short, emb_long)
cosine_short_long = cosine_similarity(emb_short, emb_long)

euclidean_short_diff = euclidean_distance(emb_short, emb_different)
cosine_short_diff = cosine_similarity(emb_short, emb_different)

What You Learn

Example results:

Cosine similarity (short ↔ long):       0.89  (SAME FUNCTION)
Cosine similarity (short ↔ different):  0.24  (DIFFERENT FUNCTION)
Difference:                              0.65  (CLEAR SEPARATION)

Euclidean distance (short ↔ long):      2.14  (LARGE - confused by magnitude)
Euclidean distance (short ↔ different): 1.98  (SIMILAR - confused)
Difference:                              0.16  (POOR SEPARATION)

The realization: The shorter code has smaller magnitude. Euclidean distance measures "straight-line distance", which includes magnitude effects. Cosine measures "angle", which ignores magnitude.

From Day 1 (lines 12-15):

"Cosine distance: The angle between the two vectors (ignores magnitude)"

Why This Matters

For semantic matching, angle matters, magnitude doesn't:

  • Two code snippets doing same thing have similar angle
  • But different magnitude (length differences)
  • Cosine correctly identifies them as similar
  • Euclidean gets confused by magnitude

Connection to Day 1-2 Learning

From Days 2-3 transformer notes:

  • Attention softmax produces normalized weights
  • Output embeddings vary in magnitude based on input length
  • For semantic comparison, we only care about direction (angle), not magnitude

What Passing vs Failing Means

✅ PASS: cosine distinguishes semantic similarity better

  • Decision: ChromaDB's cosine default is correct
  • Implication: retriever.py uses cosine similarity

❌ FAIL: Euclidean performs as well

  • Suggests: Embeddings aren't normalized
  • Requires: Investigation

Test 5: Semantic Relationships - Programming Paradigm Clustering

Core Insight

Embeddings capture domain-specific semantic relationships. Code from the same programming language clusters together based on learned paradigm patterns.

What You're Testing

Python concepts (similar paradigms):

- "Python list comprehension [x for x in items]"
- "Python dictionary comprehension {k: v for k, v in items}"
- "Python generator expression (x for x in items)"

JavaScript concepts (different paradigm):

- "JavaScript array map items.map(x => x)"
- "JavaScript array filter items.filter(x => x > 0)"
- "JavaScript array reduce items.reduce((a, b) => a + b)"

Expected: Within-language similarity > Cross-language similarity

What You Learn

Example results:

Average Python concept similarity:      0.76  (HIGH - same paradigm)
Average JavaScript concept similarity:  0.72  (HIGH - same paradigm)
Average cross-language similarity:      0.48  (MODERATE - different paradigm)

The realization: The embedding model learned more than just "this is code". It learned language-specific idioms:

  • Python emphasizes list/dict comprehensions (declarative)
  • JavaScript emphasizes map/filter/reduce (functional)

These are different approaches to the same problems.

From Day 1 (lines 227-234):

"Embeddings capture semantic relationships. Networks learn relationships as measurable dimensions."

Why This Matters

You get a choice:

  1. No filtering: Return top-5 matches regardless of language (user sees diverse examples)
  2. With filtering: Return top-5 matches in same language (user sees idiomatic examples)

Both options are possible because semantics + language paradigm are both encoded.

Connection to Day 1-2 Learning

From Day 1 notes on how embeddings emerge:

"During training on diverse code, model learns language-specific idioms"

The model detected these patterns and encoded them as separate dimensions.

What Passing vs Failing Means

✅ PASS: Within-language > Cross-language similarity

  • Confirms: Embeddings understand language paradigms
  • Decision: Language metadata is optional (nice feature)
  • Implication: Can support both filtered and unfiltered search

❌ FAIL: Within-language ≈ Cross-language

  • Means: Embeddings are language-agnostic (also valid)
  • Problem: Can't leverage language paradigm patterns
  • Decision: Either accept it or need language-specific processing

Test 6: Chunking by Semantic Units - Function-Level vs Fixed-Size

Core Insight

How you split code fundamentally changes retrieval quality. Semantic boundaries (functions) outperform arbitrary boundaries (fixed character counts).

What You're Testing

Same code file, two chunking strategies:

Strategy A: Semantic chunks (by function) - 3 complete functions Strategy B: Fixed-size chunks (100-char splits) - Functions broken across chunks

Query: "How do I authenticate a user?"

What You Learn

Example results:

SEMANTIC CHUNKING:
authenticate_user function score:  0.87  (perfect match - TOP RESULT)
fetch_user_profile function score: 0.45
update_user_settings function score: 0.42

FIXED-SIZE CHUNKING:
chunk_1_(docstring only):  0.62  (fragment)
chunk_2_(middle of func):  0.71  (incomplete)
chunk_3_(end of func):     0.58  (fragment)

The realization: Even with identical embeddings, semantic chunking gives a clear winner. Fixed-size chunks scatter the meaning across multiple lower-scoring fragments.

From Day 2-3 (lines 666-668, 735-736):

"Good chunking strategy = good tool output = better model response Your chunks become working memory"

Why This Matters

Your RAG pipeline quality directly depends on chunking:

Good chunking:
  Top result: complete authenticate_user function (0.87)
  → Claude gets full context
  → Accurate answer

Bad chunking:
  Top result: docstring only (0.62)
  → Claude has incomplete context
  → Risk of hallucination

The 0.25 difference (0.87 vs 0.62) = difference between confident answer and hallucination risk.

Connection to Day 2-3 Learning

From Day 2-3 on working memory vs knowledge:

"Working memory: specific, current, reliable. Model learns: working memory is more trusted than knowledge."

Good chunking = complete working memory = Claude trusts it.

What Passing vs Failing Means

✅ PASS: Semantic chunking outperforms fixed-size

  • Decision: config.yaml: chunk_strategy: by_function
  • Implication: chunker.py parses and chunks by function boundaries

❌ FAIL: Fixed-size works as well

  • Investigation: Are functions very short?
  • Possible outcome: Might be acceptable if structure is simple

Deeper Learning

Professional tools use multiple strategies:

  • AST parsing (get exact function boundaries)
  • Semantic search (find related chunks)
  • LSP integration (understand scope and imports)
  • Fallback heuristics (when parsing fails)

Test 6 validates the principle (semantic > fixed). Real systems validate the practice (combine methods).


Test 7: Working Memory for RAG - Retrieval Quality Prevents Hallucinations

Core Insight

The entire RAG pattern works: good retrieval provides relevant context (working memory) that Claude uses instead of hallucinating from training data.

What You're Testing

Relevant code chunks (what user needs):

def calculate_tax(amount, rate):
    return amount * rate

def apply_discount(price, discount_pct):
    return price * (1 - discount_pct)

Irrelevant code chunks (unrelated):

def connect_database(host, port):
    return DatabaseConnection(host, port)

def log_error(message):
    logger.error(message)

Query: "How do I calculate price after tax and discount?"

What You Learn

Example results:

RELEVANT CHUNKS:
calculate_tax function:    0.82  (strong match)
apply_discount function:   0.79  (strong match)

IRRELEVANT CHUNKS:
connect_database function: 0.31  (weak - correctly filtered)
log_error function:        0.28  (weak - correctly filtered)

Retrieval Quality Gap: 0.82 - 0.31 = 0.51  (EXCELLENT SEPARATION)

The realization: Semantic search actually filters for relevance. It's not magic. It's a mathematical property of embeddings in high-dimensional space.

From Day 2-3 (lines 611-615):

"Hallucinations: Model fills gaps when context is insufficient. Working memory: Specific, current, reliable context prevents gaps."

Why This Matters

Without RAG (knowledge-only):

  • Claude generates generic answer using training knowledge

With RAG but bad retrieval (broken working memory):

  • Search retrieves wrong functions (error logging, database)
  • Claude hallucinates about tax calculation in database context

With RAG and good retrieval (healthy working memory):

  • Search retrieves correct functions (calculate_tax, apply_discount)
  • Claude provides accurate, contextualized answer

Test 7 validates we're in the third scenario.

Connection to Day 2-3 Learning

From Day 2-3 notes on transformer architecture:

"Transformers use attention to focus on relevant information"

Your RAG system applies the same principle:

  • Embedding space acts like attention mechanism
  • High similarity = high attention weight
  • Low similarity = low attention weight

What Passing vs Failing Means

✅ PASS: Relevant scores >> Irrelevant scores, top-k are all relevant

  • Confirms: Semantic search relieves claude of hallucination risk
  • Decision: Top-k retrieval strategy is safe and effective
  • Implication: RAG pattern works end-to-end

❌ FAIL: Relevant and irrelevant chunks score similarly (gap < 0.2)

  • Means: Embeddings not distinguishing relevance well
  • Problem: Top-k might include irrelevant code
  • Investigation: Is embedding model trained for code? Are chunks sized correctly?

Deeper Learning: Empirical Top-k Determination

Test 7 validates the principle (retrieval works). But it doesn't answer:

How many chunks do you actually need?

The answer varies:

  • Simple queries: 1-2 chunks sufficient
  • Complex queries: 5-10 chunks
  • Edge cases: 20+ chunks

This is why config.yaml has top_k as a tuning parameter.


How Tests 1-7 Connect

Test Validates Critical For
Test 1 Embeddings are deterministic Caching strategy
Test 2 Semantic clustering across languages Multi-language support
Test 3 1536 dims > 384 dims Dimension choice
Test 4 Cosine > Euclidean distance Distance metric
Test 5 Language paradigm clustering Metadata filtering
Test 6 Function-level chunks > fixed-size Chunking strategy
Test 7 Good retrieval prevents hallucinations RAG pattern validation

The Progression:

  • Tests 1-3: Embeddings fundamentals
  • Tests 4-5: Distance metrics and semantic relationships
  • Test 6: Chunking strategy
  • Test 7: Complete RAG system validation

Key Takeaways for RAG-Code-QA Implementation

From Test 1: Caching is Safe

  • Embeddings are stable
  • Unchanged code doesn't need re-embedding
  • Safe to cache indefinitely

From Test 2: Multi-Language Works

  • Semantic meaning transcends syntax
  • Python, JavaScript, TypeScript can share same index
  • User can search across languages

From Test 3: 1536 Dimensions is Optimal

  • 7% better accuracy than 384 dims
  • Worth the storage/computation cost

From Test 4: Cosine is Correct

  • Handles magnitude variation
  • ChromaDB default choice is right

From Test 5: Language Paradigms Matter

  • Embeddings learn conventions
  • Metadata filtering is optional but valuable

From Test 6: Chunking is Critical

  • Semantic units (functions) > arbitrary splits
  • Highest-impact design decision

From Test 7: RAG Works

  • Good retrieval = complete working memory = accurate answers
  • Eliminates hallucinations when done well

These learning notes map each test to specific theory from Days 1-3. Understanding what each test teaches you prepares you for implementation with the right architecture decisions.