Context Engineering Course - Implementation Report

Environment

Bun: v1.3.0
Ollama: v0.15.6
Models: llama3.2 (2.0 GB), nomic-embed-text (274 MB)
OS: macOS Darwin 25.2.0
Project: holmes-context-engineering/

Exercise 0: Data Acquisition

Status: completed Implementation: src/00-download.ts

Commands used

bun run src/00-download.ts
ls data/                                    # 9 .txt files
grep -l "PROJECT GUTENBERG" data/*.txt      # nothing returned — clean

How to improve the task / notes

Missing: file naming convention. The task says "save each volume as a clean .txt file in ./data/" but doesn't specify the naming pattern. Should say e.g. {gutenberg_id}-{slugified-title}.txt to avoid ambiguity.
Missing: batch API call option. The Gutendex API supports querying multiple IDs at once (?ids=244,2097,...). The task implies individual fetches per book. Mentioning the batch endpoint would save HTTP round-trips and teach a useful API pattern.
Gotcha: the text/plain format key varies. Gutendex returns format keys like text/plain; charset=utf-8 and text/plain; charset=us-ascii. The task should mention that the student needs to pick the right MIME type key, not just "download plain text." A student who does formats["text/plain"] will get undefined.
Gotcha: table of contents gets mangled. The line-unwrapping algorithm joins TOC entries into a single long line because they're not separated by blank lines. This is cosmetically ugly but functionally harmless. Could mention this as an expected artifact.
Setup: Ollama install command is Linux-only. The prerequisites section gives curl -fsSL https://ollama.com/install.sh | sh which only works on Linux. macOS users install via brew install ollama or the .app download. Should provide platform-specific instructions.
Minor: course numbering inconsistency. The overview table numbers techniques 1-8, but modules are numbered 0-8. The table should start at 0 or the modules at 1.

Exercise 1: Chunking Strategies

Status: completed Implementation: src/01-chunking.ts

Commands used

bun run src/01-chunking.ts

Results

Strategy       | Total Chunks |  Avg Length |   Min |   Max
fixed          |         7928 |        500 |    22 |   500
paragraph      |         9881 |        359 |     7 |  3322
story-aware    |          169 |      21098 |     4 | 141934

All three strategies found the "eliminated the impossible" quote from The Sign of the Four.

How to improve the task / notes

Wrong quote phrasing. The course says to search for Holmes's quote about "eliminating the impossible" but the actual text in The Sign of the Four is "eliminated the impossible" (past tense). Students will get zero results if they search for the infinitive form. Should give the exact phrase or at least a case-insensitive substring.
Paragraph-based merge threshold too low. The task says "merge short paragraphs (<100 chars) with the next one" — but many legitimate paragraphs (short dialogue lines) are under 100 chars. The merge logic is ambiguous: should you merge with the next paragraph or the previous? Students will make different choices. Clarify direction and consider a higher threshold or a different heuristic.
Story-aware chunker produces enormous chunks. Without a maximum chunk size fallback, the story-aware strategy produces chunks up to 141K chars (entire volumes without clear chapter boundaries). The task should mention adding a secondary split within oversized story-aware chunks, or at least warn students to expect this.
No guidance on what to do with tiny trailing chunks. Both fixed-size (22-char trailing chunk) and paragraph-based (7-char chunks) produce very small fragments. The task could suggest filtering out chunks below a minimum threshold.
Missing: explicit instruction to export chunks for Exercise 2. Exercise 2 needs the paragraph-based chunks. The task should say "export your chunking functions — you'll reuse them in Exercise 2" or suggest writing chunks to a JSON file.

Exercise 2: Embeddings & Vector Search

Status: completed Implementation: src/02-embeddings.ts

Commands used

bun run src/02-embeddings.ts
# Took ~5 minutes to embed 9881 chunks

Results

9,881 paragraph-based chunks embedded and stored in holmes.db
Query "dangerous hound on the moor": all 5 results from Hound of the Baskervilles (correct)
Search latency: ~370-400ms including Ollama embed call. Brute-force similarity alone is fast.
Noticed: "THE END" chunks (tiny text) rank artificially high for broad queries due to embedding behavior

How to improve the task / notes

better-sqlite3 fails with Bun due to native ABI mismatch. The course specifically recommends better-sqlite3 and even explains why ("stable extension loading") — but it crashes on Bun v1.3.0 with NODE_MODULE_VERSION mismatch. The fix is to use bun:sqlite (Bun's built-in SQLite) or run bun install --force to recompile. The course should mention this fallback prominently, since the whole course runs on Bun.
Missing: book_title column. The task only specifies id, text, source, embedding in the schema. Exercise 4 later needs book_title. Adding it now avoids a painful schema migration later. The task should include it from the start.
The ollama npm package API name is not mentioned. The course just says "embed each chunk using Ollama's nomic-embed-text model" but doesn't show the npm package method. The actual API is ollama.embed({ model, input }) returning { embeddings: number[][] }. Giving at least the method signature would save students debugging time.
"< 200ms" target is misleading. The course says brute-force search should be under 200ms, but doesn't distinguish between the embedding call latency (~200ms) and the similarity computation (~100ms). The total is ~400ms. Should clarify: "similarity computation should be < 200ms; embedding the query adds its own latency."
Short/empty chunks create noise. "THE END" (a 7-char chunk) ranked #1 for "What is the relationship between Holmes and Watson?" because tiny chunks get concentrated embeddings. The task should recommend filtering out chunks below ~50 chars before embedding.
Embedding 9,881 chunks takes ~5 minutes. The course doesn't warn about this. Adding a note "This step takes several minutes — go make tea" would help set expectations. Also should suggest caching (checking if DB already has data before re-embedding).

Exercise 3: RAG Pipeline

Status: completed Implementation: src/03-rag.ts

Commands used

bun run src/03-rag.ts "How was the Red-Headed League scheme uncovered?"
bun run src/03-rag.ts   # runs 3 default test queries

Results

Pipeline works end-to-end: retrieves chunks, builds prompt, calls llama3.2, prints answer with sources.
Answer quality is poor for 2 of 3 test queries because retrieval returns noise chunks ("THE END", table of contents).
The Red-Headed League question failed because no relevant chunks were retrieved (the story is in The Adventures, but generic crime-related chunks from other books ranked higher).
The Irene Adler question also failed — no A Scandal in Bohemia passages were retrieved.

How to improve the task / notes

Module import side effects. Since exercises import from each other (03-rag.ts imports 02-embeddings.ts which imports 01-chunking.ts), all three main() functions run when you execute exercise 3. The course should instruct students to guard their main function with if (import.meta.main) (Bun's equivalent of Python's if __name__ == "__main__"). This is not mentioned anywhere.
The Ollama chat API call is not shown. The course says "send it to Ollama's llama3.2 model" but doesn't show the method. Students need ollama.chat({ model, messages }). At minimum, show the import and method signature.
Expected poor results should be acknowledged. The test queries are designed to expose RAG weaknesses (the course says "modules 4-8 exist to fix these failure modes"), but the verification table implies the answers should be correct. Should add a note: "Don't worry if the answers are disappointing — that's the point. Exercises 4+ will fix this."
Missing: bun:sqlite alternative. The course only mentions better-sqlite3 but it fails on Bun. Should mention bun:sqlite as the native alternative.
CLI argument handling. The course says to use "command-line argument" but doesn't specify process.argv[2] or Bun.argv[2]. A one-liner example would help beginners.

Exercise 4: Reranking & Filtering

Status: completed Implementation: src/04-reranking.ts

Commands used

bun run src/04-reranking.ts "What clues did Holmes find at the crime scene in A Study in Scarlet?"

Results

Metadata filtering correctly detected "A Study in Scarlet" and restricted search to that book only
All 5 final results from the correct book (vs. mixed sources in Exercise 3)
Reranking took ~12s (20 LLM calls)
MMR produced diverse chunks covering different scenes (RACHE, blood, Gregson's investigation)
Answer quality improved but still limited by llama3.2 3B model

How to improve the task / notes

Schema migration is not addressed. Exercise 2 creates the chunks table without book_title. Exercise 4 says "add a book_title column to your chunks table" but doesn't explain how (ALTER TABLE? Drop and recreate? Re-embed everything?). If a student already has 10K embedded chunks, telling them to re-embed is costly. Should suggest either: (a) add the column from the start in Exercise 2, or (b) use ALTER TABLE chunks ADD COLUMN book_title TEXT and populate from the existing source filename.
The MMR code snippet has a return type issue. The provided mmrSelect function returns typeof candidates[0] (a single item), but it should be called iteratively to build a list. The snippet doesn't show the iterative selection loop — students must figure out that MMR is greedy and selects one at a time.
LLM score parsing is fragile. The task says "Respond with ONLY a number" but llama3.2 often responds with "7/10" or "Score: 7" or just rambles. Should mention needing robust number extraction (regex) and a fallback score.
Reranking latency is high but not discussed. 20 LLM calls at ~500ms each = ~10-12s. The course mentions this motivates cross-encoder models but should suggest parallelizing the reranking calls (Promise.all) or using a smaller model for scoring. With Ollama's serial processing, parallelizing doesn't actually help, but it's worth noting.
Keyword matching for book detection is brittle. "hound" triggers Hound of the Baskervilles, but a question about "a hound in another story" would be wrongly filtered. The task should mention this limitation and perhaps suggest LLM-based book detection as an improvement.

Exercise 5: Memory Systems

Status: completed Implementation: src/05-memory.ts

Commands used

bun run src/05-memory.ts --test    # non-interactive 3-turn test
bun run src/05-memory.ts           # interactive mode

Results

Conversation-aware query rewriting works: "Why does he respect her?" → "why does sherlock holmes respect irene adler"
Fact extraction stores memories after every 3 turns
Memory retrieval returns relevant stored facts in subsequent queries
Answer quality limited by the retrieval step (the actual A Scandal in Bohemia passages weren't retrieved)

How to improve the task / notes

No guidance on handling piped/non-interactive input. The task says "read user input from stdin" but doesn't address how to test non-interactively. Piped input causes readline to close prematurely in Bun. Should suggest a --test flag or scripted test mode.
Missing: readline import. The course doesn't mention which readline API to use. Node/Bun's createInterface from readline works, but students might try process.stdin.on('data') or Bun.stdin approaches which behave differently.
The long-term memory test is impractical. The course says "start a new conversation and ask 'What character was I interested in last time?'" — but this requires stopping and restarting the process. Should clarify that "new conversation" means resetting the messages array but keeping the DB connection, not restarting the process.
Fact extraction quality with llama3.2. The 3B model extracts verbose and sometimes incorrect facts (e.g., it claimed the passage was from The Case-Book when it was actually from The Adventures). The task should warn that small models produce lower-quality extractions and suggest validating extracted facts.
The prompt structure change from Exercise 3 is significant. Exercise 3 uses separate system/user messages for ollama.chat(). Exercise 5 puts everything (system, memories, history, context, question) into a single user message. The course should be explicit about which approach to use and why.

Exercise 6: Context Compression

Status: completed Implementation: src/06-compression.ts

Commands used

bun run src/06-compression.ts --test    # 6-turn test conversation
bun run src/06-compression.ts           # interactive mode

Results

Token budget dashboard prints at every turn, showing allocation across system/memories/conversation/context/query
Total tokens never exceeded budget (stayed around 10-20% utilization with 4096 budget)
Query rewriting resolved "that one" → "A Study in Scarlet", "the two villains" → comparison query
Conversation summarization triggered when history exceeds threshold
Chunk compression compresses lower-ranked chunks to single sentences

How to improve the task / notes

The 4096 token budget is unrealistically large for this dataset. With paragraph-based chunks averaging ~360 chars (~90 tokens), 5 chunks = ~450 tokens. The conversation rarely approaches the 4096 budget. Compression only triggers in extended conversations. Should either use a smaller budget (e.g., 2000) or warn students that compression may not trigger in short test sessions.
llama3.2 default context may be 2048, not 4096. Ollama's llama3.2 defaults to a 2048-token context window. The course's 4096-token budget may exceed what the model can actually handle unless num_ctx: 4096 is passed in the Ollama options. This should be explicitly mentioned.
The token approximation is rough. text.length / 4 assumes English prose. For JSON, code, or special characters, this underestimates. The course should mention this is an approximation and suggest verifying against actual model token counts.
Compression quality feedback loop. The task doesn't address how to verify that compression preserves key information. Students should be told to print the summary and manually check it, or compare compressed vs. uncompressed answers to the same question.
Missing: the 10-turn test sequence from the course. The course specifies a 6-turn test (Turns 1-6), but the verification says "have a 10-turn conversation." The mismatch is confusing — should consistently specify 6 turns for the core test and 10 as an optional stress test.

Exercise 7: Dynamic Prompt Assembly

Status: completed Implementation: src/07-assembly.ts

Commands used

bun run src/07-assembly.ts --test    # 5-turn test covering each intent type
bun run src/07-assembly.ts           # interactive mode

Results

Intent classification worked for 4 of 5 test queries:
- "Hi there!" → GREETING (correct, 0 chunks, fast response)
- "In which story does Holmes fake his death?" → FACTUAL (correct, 5 chunks)
- "How does that compare to how Irene Adler outsmarted him?" → ANALYTICAL (correct, 10 chunks from 6 books)
- "What makes a great detective villain?" → ANALYTICAL (correct, 10 chunks from 7 books)
- "How do you find your answers?" → ANALYTICAL (WRONG, should be META — 0 chunks expected)
Dynamic system prompt switches correctly per intent
MMR diversity applied for ANALYTICAL queries
Routing log prints at every turn with chunk count, books, history, memories, token usage

How to improve the task / notes

Intent classification is unreliable with llama3.2. The 3B model misclassified "How do you find your answers?" as ANALYTICAL instead of META. The task should warn about this and suggest adding few-shot examples to the classification prompt (e.g., provide 1-2 examples per intent). Alternatively, suggest using keyword-based heuristics as a fallback.
FOLLOW_UP detection never triggered. In the test, "How does that compare..." was classified as ANALYTICAL rather than FOLLOW_UP, even though it references "that" (a pronoun needing resolution). The boundary between FOLLOW_UP and ANALYTICAL is unclear. The task should explain which takes priority when both apply, or suggest merging them.
The route table doesn't specify what "All available" history means. For FOLLOW_UP, the task says "All available" for conversation history. How much is "all"? The full raw messages? The summary + recent turns? Should give a concrete number or explain the strategy.
No error handling for misclassification. If the model returns garbage (not one of the 5 intents), the task doesn't say what to do. Should specify a default intent (FACTUAL is a good fallback).
Exercise is very slow. A single 5-turn test takes 3-5 minutes due to multiple LLM calls per turn (classify + rewrite + retrieval + generation). The course should mention expected runtime and suggest optimizations (e.g., running classification and embedding in parallel).

Exercise 8: Capstone — Integration & Evaluation

Status: completed Implementation: src/08-capstone.ts, src/08-eval.ts, src/lib/ (8 modules), data/eval.json

Commands used

bun run src/08-eval.ts

Results

Refactored all shared logic into src/lib/ modules: db.ts, llm.ts, embeddings.ts, reranker.ts, memory.ts, compressor.ts, assembler.ts, chunking.ts
Capstone chatbot (08-capstone.ts) integrates all modules with streaming output
Evaluation harness (08-eval.ts) runs 10 test cases from data/eval.json

Evaluation scores:

Retrieval accuracy: 7/10 (70%)   — target was ≥80%
Answer accuracy:    2/10 (20%)   — target was ≥70%
Overall score:      45%

Retrieval failures: 3 questions retrieved wrong books (e.g., "Red-Headed League" question pulled chunks from wrong stories)
Answer failures: Even when correct chunks were retrieved, llama3.2 3B frequently omitted expected keywords or hallucinated details

How to improve the task / notes

The accuracy targets are unrealistic with llama3.2 3B. The course sets ≥80% retrieval and ≥70% answer accuracy, but the 3B model regularly ignores retrieved context, omits expected keywords, and hallucinates. With a larger model (e.g., llama3.1 8B or 70B), these targets would be achievable. The course should either lower the targets for 3B models or recommend a larger model for the eval.
Keyword-based answer evaluation is brittle. A correct answer that paraphrases rather than using the exact expected keyword scores as a failure. For example, the model might say "disguised himself" instead of the expected keyword "disguise". The course should mention this limitation and suggest fuzzy matching or LLM-based evaluation as alternatives.
No guidance on what src/lib/ modules should export. The course says "refactor into reusable modules" but doesn't specify which functions go where, or what the module boundaries should be. Students will make wildly different architectural choices. Should provide a suggested module list with key exports.
Streaming is mentioned but not taught. The course says to "add streaming" but doesn't show Ollama's streaming API (ollama.chat({ stream: true }) returns an async iterable). This is a significant implementation detail to leave as an exercise.
The eval.json test set is self-authored. The course says "create 10+ test questions" but doesn't provide a sample eval.json. Students must write their own test set AND implementation simultaneously, making it hard to know if poor scores reflect bad tests or bad code. Should provide at least 5 reference test cases.
No baseline comparison. The evaluation runs once with the full pipeline but doesn't compare against a baseline (e.g., Exercise 3's naive RAG). Without a before/after comparison, students can't measure how much their improvements actually helped. Should suggest running the same eval.json against the Exercise 3 pipeline first.
budgetCheck function is mentioned in the route log but not defined in the course. The capstone assembles all context and should validate total tokens against the budget, but the course doesn't specify when to warn vs. truncate vs. error.

Summary

Overall Assessment

All 9 exercises (0-8) completed successfully. The course teaches a solid progression of context engineering techniques, building from raw data acquisition through to a fully integrated chatbot with evaluation. However, there are significant gaps between what the instructions describe and what students actually need to know.

Cross-cutting Issues

better-sqlite3 vs bun:sqlite: The course's recommended SQLite library doesn't work with Bun out of the box. This is the single biggest blocker a student will hit. Must be addressed in prerequisites.
API signatures never shown: The course tells students what to call (Ollama embed, chat, etc.) but never shows the actual method signatures. Students spend significant time reading ollama npm docs. Adding a 3-line code snippet per API call would save hours.
import.meta.main guard not mentioned: Without this, importing from previous exercises causes their main() functions to execute as side effects. This breaks every exercise from Exercise 3 onward. Should be introduced in Exercise 1.
llama3.2 3B limitations not acknowledged: The model is too small for reliable intent classification, fact extraction, LLM-based reranking, and keyword-accurate answers. The course should set expectations: "results will be approximate with a 3B model; accuracy improves significantly with 8B+ models."
Schema evolution not planned: Exercise 2 creates the schema, Exercise 4 needs an extra column. The course should either include book_title from the start or teach ALTER TABLE migration.
No expected runtimes: Embedding 10K chunks takes ~5 min, reranking takes ~12s, a 5-turn conversation takes ~3-5 min. Students with no benchmarks may think something is broken.
Interactive testing is difficult: Exercises 5-7 are interactive chat loops with no built-in test mode. The course should suggest a --test flag pattern from the start.

Scores

Exercise	Status	Key Issue
0 - Data Acquisition	✓ Completed	Gutendex MIME key varies
1 - Chunking	✓ Completed	Wrong quote phrasing in instructions
2 - Embeddings	✓ Completed	`better-sqlite3` ABI crash
3 - RAG Pipeline	✓ Completed	Import side effects, poor retrieval
4 - Reranking	✓ Completed	Schema migration not addressed
5 - Memory	✓ Completed	Interactive testing not addressed
6 - Compression	✓ Completed	Budget too large to trigger compression
7 - Assembly	✓ Completed	Intent classification unreliable with 3B
8 - Capstone + Eval	✓ Completed	Accuracy targets unrealistic for 3B model

Final Eval Results

Retrieval accuracy: 70% (7/10)
Answer accuracy: 20% (2/10)
Overall: 45%

FilesExpand file tree

REPORT.md

Latest commit

History

REPORT.md

File metadata and controls

Context Engineering Course - Implementation Report

Environment

Exercise 0: Data Acquisition

Commands used

How to improve the task / notes

Exercise 1: Chunking Strategies

Commands used

Results

How to improve the task / notes

Exercise 2: Embeddings & Vector Search

Commands used

Results

How to improve the task / notes

Exercise 3: RAG Pipeline

Commands used

Results

How to improve the task / notes

Exercise 4: Reranking & Filtering

Commands used

Results

How to improve the task / notes

Exercise 5: Memory Systems

Commands used

Results

How to improve the task / notes

Exercise 6: Context Compression

Commands used

Results

How to improve the task / notes

Exercise 7: Dynamic Prompt Assembly

Commands used

Results

How to improve the task / notes

Exercise 8: Capstone — Integration & Evaluation

Commands used

Results

How to improve the task / notes

Summary

Overall Assessment

Cross-cutting Issues

Scores

Final Eval Results