Date: January 1, 2026 Focus: Running 7 hands-on experiments to validate embeddings theory from Days 1-3 Structure: Detailed learning objectives for each test, mapping theory to practice
This document captures the deep learning objectives for each test in Experiment 1. Each test validates a specific concept from Days 1-3 and teaches a practical insight about how embeddings work in RAG systems.
Key Concepts to Validate:
- Day 1: Embeddings are pre-trained lookups (not runtime calculations), latent dimensions capture semantic meaning, dimensionality trade-offs
- Day 2-3: Transformer architecture, attention mechanisms, working memory vs knowledge, hallucinations and how context prevents them
Why Day 4 Matters:
- Day 1-3: Theory (understanding mechanisms)
- Day 4: Practice (seeing mechanisms work in real code)
- Week 1 Project: Building a system that relies on all these insights
Embeddings are pre-computed lookups, not runtime calculations. The same input always returns the identical vector.
text = "The cat sat on the mat"
embedding_1 = get_embedding(text) # Call 1
embedding_2 = get_embedding(text) # Call 2
# Expected: embedding_1 == embedding_2 (bitwise identical)The Realization: Embeddings don't change. They're not random. They're not computed fresh each time.
Validates Day 1 (lines 20-33):
"Models are pre-trained. You just look up embeddings. During training, the model discovers latent dimensions automatically. At runtime: you just do word-by-word lookup → get pre-computed embedding values"
Practical understanding: When you call client.embeddings.create(), you're not triggering a neural network computation. You're triggering a lookup in a massive pre-computed table.
This validates the entire caching strategy for your RAG system.
If embeddings changed randomly:
- You'd need to re-embed code every time
- Your ChromaDB would be stale
- Caching would be useless
Since embeddings are deterministic:
- Cache them indefinitely
- Unchanged code files never need re-embedding
- System scales efficiently
From Day 1 notes on model training phases:
- Phase 1: Model trains on internet data, discovers latent dimensions
- Phase 2: Frozen model is deployed with fixed weights
- Runtime: No training. Only inference (lookup)
Test 1 proves Phase 2 is actually happening.
✅ PASS: Embeddings are identical
- Decision: Safe to cache embeddings indefinitely
- Implication:
embedder.pycan use persistent caching
❌ FAIL: Embeddings differ between calls
- Problem: Caching strategy becomes invalid
- Requires: Architecture redesign
Similar code clusters together in embedding space regardless of language syntax. Semantic meaning is encoded in the latent dimensions.
code_1 = "def add(a, b): return a + b" # Python add
code_2 = "function add(a, b) { return a + b; }" # JavaScript add
code_3 = "def multiply(a, b): return a * b" # Python multiplyExpected: similarity(py_add, js_add) > similarity(py_add, py_multiply)
The Realization: The embedding model understands what code does, not just how it looks.
From Day 1 (lines 36-65):
"Latent dimensions compress similar concepts together. Similar semantic meaning → similar embedding patterns"
Example results might show:
- Python add ↔ JavaScript add: 0.85 (HIGH - same operation)
- Python add ↔ Python multiply: 0.42 (LOW - different operation)
The insight: An entire dimension (or group of dimensions) represents "addition operation", and this dimension is language-agnostic.
Without semantic clustering, you'd have to index each language separately. With it, you get multi-language search for free.
From Day 2-3 on transformer attention:
- Attention allows the model to focus on relevant tokens
- "add" keyword + operands trigger attention to arithmetic concepts
- Language-specific syntax is filtered out
✅ PASS: semantic clustering works across languages
- Decision: Multi-language indexing is viable
- Implication: Can include Python/JS/TypeScript in same collection
❌ FAIL: Language syntax is stronger signal than semantics
- Problem: Can't reliably cross-language search
- Requires: Use separate indexes per language
More dimensions = finer semantic detail. Fewer dimensions = faster computation but information loss.
# Get 1536-dim embedding
embedding_full = get_embedding(code_complex)
assert len(embedding_full) == 1536
# Truncate to 384 dimensions
embedding_truncated = embedding_full[:384]
# Compare similarity with related code
similarity_full = cosine_similarity(emb_full, emb_similar_full)
similarity_truncated = cosine_similarity(emb_truncated, emb_similar_truncated)
# Expected: similarity_full >= similarity_truncatedExample results:
Similarity (1536 dims): 0.78 (captures fine detail)
Similarity (384 dims): 0.71 (less nuanced)
Difference: 0.07
The realization: Full dimensionality preserves more semantic nuance. The 7% difference quantifies the accuracy trade-off.
From Day 1 (lines 69-87):
"More dimensions = More nuance, more accurate, but slower"
Cost/benefit analysis for your project:
- Storage: 1536-dim = 4x larger than 384-dim
- Computation: Still fast (cosine is O(n) in dims)
- Accuracy: 7% improvement is meaningful for hard queries
- OpenAI cost: Same regardless (pay per token, not per dimension)
Decision for Week 1: Use 1536-dim because the 7% accuracy improvement justifies the cost.
From Day 1 notes on model architecture:
- Transformers have embedding dimension
- Larger dimensions = more "memory" to encode information
- Information density increases with dimensions, but with diminishing returns
✅ PASS: Full dims capture more semantic detail
- Decision: text-embedding-3-small (1536 dims) is worth the cost
- Implication: Store all 1536 dimensions in ChromaDB
❌ FAIL: Truncation improves similarity (counterintuitive)
- Suggests: Noise in higher dimensions
- Requires: Investigation
For high-dimensional embeddings, cosine similarity (angle-based) outperforms Euclidean distance (magnitude-based) because embeddings vary in magnitude.
# Short code snippet vs verbose equivalent
code_short = "def sum(arr): return sum(arr)"
code_long = "def calculate_sum(array): total = 0; for item in array: total += item; return total"
# Different semantic meaning
code_different = "def fetch_user(id): return db.query(User).get(id)"
# Calculate both metrics
euclidean_short_long = euclidean_distance(emb_short, emb_long)
cosine_short_long = cosine_similarity(emb_short, emb_long)
euclidean_short_diff = euclidean_distance(emb_short, emb_different)
cosine_short_diff = cosine_similarity(emb_short, emb_different)Example results:
Cosine similarity (short ↔ long): 0.89 (SAME FUNCTION)
Cosine similarity (short ↔ different): 0.24 (DIFFERENT FUNCTION)
Difference: 0.65 (CLEAR SEPARATION)
Euclidean distance (short ↔ long): 2.14 (LARGE - confused by magnitude)
Euclidean distance (short ↔ different): 1.98 (SIMILAR - confused)
Difference: 0.16 (POOR SEPARATION)
The realization: The shorter code has smaller magnitude. Euclidean distance measures "straight-line distance", which includes magnitude effects. Cosine measures "angle", which ignores magnitude.
From Day 1 (lines 12-15):
"Cosine distance: The angle between the two vectors (ignores magnitude)"
For semantic matching, angle matters, magnitude doesn't:
- Two code snippets doing same thing have similar angle
- But different magnitude (length differences)
- Cosine correctly identifies them as similar
- Euclidean gets confused by magnitude
From Days 2-3 transformer notes:
- Attention softmax produces normalized weights
- Output embeddings vary in magnitude based on input length
- For semantic comparison, we only care about direction (angle), not magnitude
✅ PASS: cosine distinguishes semantic similarity better
- Decision: ChromaDB's cosine default is correct
- Implication:
retriever.pyuses cosine similarity
❌ FAIL: Euclidean performs as well
- Suggests: Embeddings aren't normalized
- Requires: Investigation
Embeddings capture domain-specific semantic relationships. Code from the same programming language clusters together based on learned paradigm patterns.
Python concepts (similar paradigms):
- "Python list comprehension [x for x in items]"
- "Python dictionary comprehension {k: v for k, v in items}"
- "Python generator expression (x for x in items)"
JavaScript concepts (different paradigm):
- "JavaScript array map items.map(x => x)"
- "JavaScript array filter items.filter(x => x > 0)"
- "JavaScript array reduce items.reduce((a, b) => a + b)"
Expected: Within-language similarity > Cross-language similarity
Example results:
Average Python concept similarity: 0.76 (HIGH - same paradigm)
Average JavaScript concept similarity: 0.72 (HIGH - same paradigm)
Average cross-language similarity: 0.48 (MODERATE - different paradigm)
The realization: The embedding model learned more than just "this is code". It learned language-specific idioms:
- Python emphasizes list/dict comprehensions (declarative)
- JavaScript emphasizes map/filter/reduce (functional)
These are different approaches to the same problems.
From Day 1 (lines 227-234):
"Embeddings capture semantic relationships. Networks learn relationships as measurable dimensions."
You get a choice:
- No filtering: Return top-5 matches regardless of language (user sees diverse examples)
- With filtering: Return top-5 matches in same language (user sees idiomatic examples)
Both options are possible because semantics + language paradigm are both encoded.
From Day 1 notes on how embeddings emerge:
"During training on diverse code, model learns language-specific idioms"
The model detected these patterns and encoded them as separate dimensions.
✅ PASS: Within-language > Cross-language similarity
- Confirms: Embeddings understand language paradigms
- Decision: Language metadata is optional (nice feature)
- Implication: Can support both filtered and unfiltered search
❌ FAIL: Within-language ≈ Cross-language
- Means: Embeddings are language-agnostic (also valid)
- Problem: Can't leverage language paradigm patterns
- Decision: Either accept it or need language-specific processing
How you split code fundamentally changes retrieval quality. Semantic boundaries (functions) outperform arbitrary boundaries (fixed character counts).
Same code file, two chunking strategies:
Strategy A: Semantic chunks (by function) - 3 complete functions Strategy B: Fixed-size chunks (100-char splits) - Functions broken across chunks
Query: "How do I authenticate a user?"
Example results:
SEMANTIC CHUNKING:
authenticate_user function score: 0.87 (perfect match - TOP RESULT)
fetch_user_profile function score: 0.45
update_user_settings function score: 0.42
FIXED-SIZE CHUNKING:
chunk_1_(docstring only): 0.62 (fragment)
chunk_2_(middle of func): 0.71 (incomplete)
chunk_3_(end of func): 0.58 (fragment)
The realization: Even with identical embeddings, semantic chunking gives a clear winner. Fixed-size chunks scatter the meaning across multiple lower-scoring fragments.
From Day 2-3 (lines 666-668, 735-736):
"Good chunking strategy = good tool output = better model response Your chunks become working memory"
Your RAG pipeline quality directly depends on chunking:
Good chunking:
Top result: complete authenticate_user function (0.87)
→ Claude gets full context
→ Accurate answer
Bad chunking:
Top result: docstring only (0.62)
→ Claude has incomplete context
→ Risk of hallucination
The 0.25 difference (0.87 vs 0.62) = difference between confident answer and hallucination risk.
From Day 2-3 on working memory vs knowledge:
"Working memory: specific, current, reliable. Model learns: working memory is more trusted than knowledge."
Good chunking = complete working memory = Claude trusts it.
✅ PASS: Semantic chunking outperforms fixed-size
- Decision:
config.yaml: chunk_strategy: by_function - Implication:
chunker.pyparses and chunks by function boundaries
❌ FAIL: Fixed-size works as well
- Investigation: Are functions very short?
- Possible outcome: Might be acceptable if structure is simple
Professional tools use multiple strategies:
- AST parsing (get exact function boundaries)
- Semantic search (find related chunks)
- LSP integration (understand scope and imports)
- Fallback heuristics (when parsing fails)
Test 6 validates the principle (semantic > fixed). Real systems validate the practice (combine methods).
The entire RAG pattern works: good retrieval provides relevant context (working memory) that Claude uses instead of hallucinating from training data.
Relevant code chunks (what user needs):
def calculate_tax(amount, rate):
return amount * rate
def apply_discount(price, discount_pct):
return price * (1 - discount_pct)Irrelevant code chunks (unrelated):
def connect_database(host, port):
return DatabaseConnection(host, port)
def log_error(message):
logger.error(message)Query: "How do I calculate price after tax and discount?"
Example results:
RELEVANT CHUNKS:
calculate_tax function: 0.82 (strong match)
apply_discount function: 0.79 (strong match)
IRRELEVANT CHUNKS:
connect_database function: 0.31 (weak - correctly filtered)
log_error function: 0.28 (weak - correctly filtered)
Retrieval Quality Gap: 0.82 - 0.31 = 0.51 (EXCELLENT SEPARATION)
The realization: Semantic search actually filters for relevance. It's not magic. It's a mathematical property of embeddings in high-dimensional space.
From Day 2-3 (lines 611-615):
"Hallucinations: Model fills gaps when context is insufficient. Working memory: Specific, current, reliable context prevents gaps."
Without RAG (knowledge-only):
- Claude generates generic answer using training knowledge
With RAG but bad retrieval (broken working memory):
- Search retrieves wrong functions (error logging, database)
- Claude hallucinates about tax calculation in database context
With RAG and good retrieval (healthy working memory):
- Search retrieves correct functions (calculate_tax, apply_discount)
- Claude provides accurate, contextualized answer
Test 7 validates we're in the third scenario.
From Day 2-3 notes on transformer architecture:
"Transformers use attention to focus on relevant information"
Your RAG system applies the same principle:
- Embedding space acts like attention mechanism
- High similarity = high attention weight
- Low similarity = low attention weight
✅ PASS: Relevant scores >> Irrelevant scores, top-k are all relevant
- Confirms: Semantic search relieves claude of hallucination risk
- Decision: Top-k retrieval strategy is safe and effective
- Implication: RAG pattern works end-to-end
❌ FAIL: Relevant and irrelevant chunks score similarly (gap < 0.2)
- Means: Embeddings not distinguishing relevance well
- Problem: Top-k might include irrelevant code
- Investigation: Is embedding model trained for code? Are chunks sized correctly?
Test 7 validates the principle (retrieval works). But it doesn't answer:
How many chunks do you actually need?
The answer varies:
- Simple queries: 1-2 chunks sufficient
- Complex queries: 5-10 chunks
- Edge cases: 20+ chunks
This is why config.yaml has top_k as a tuning parameter.
| Test | Validates | Critical For |
|---|---|---|
| Test 1 | Embeddings are deterministic | Caching strategy |
| Test 2 | Semantic clustering across languages | Multi-language support |
| Test 3 | 1536 dims > 384 dims | Dimension choice |
| Test 4 | Cosine > Euclidean distance | Distance metric |
| Test 5 | Language paradigm clustering | Metadata filtering |
| Test 6 | Function-level chunks > fixed-size | Chunking strategy |
| Test 7 | Good retrieval prevents hallucinations | RAG pattern validation |
The Progression:
- Tests 1-3: Embeddings fundamentals
- Tests 4-5: Distance metrics and semantic relationships
- Test 6: Chunking strategy
- Test 7: Complete RAG system validation
- Embeddings are stable
- Unchanged code doesn't need re-embedding
- Safe to cache indefinitely
- Semantic meaning transcends syntax
- Python, JavaScript, TypeScript can share same index
- User can search across languages
- 7% better accuracy than 384 dims
- Worth the storage/computation cost
- Handles magnitude variation
- ChromaDB default choice is right
- Embeddings learn conventions
- Metadata filtering is optional but valuable
- Semantic units (functions) > arbitrary splits
- Highest-impact design decision
- Good retrieval = complete working memory = accurate answers
- Eliminates hallucinations when done well
These learning notes map each test to specific theory from Days 1-3. Understanding what each test teaches you prepares you for implementation with the right architecture decisions.