- Bun: v1.3.0
- Ollama: v0.15.6
- Models: llama3.2 (2.0 GB), nomic-embed-text (274 MB)
- OS: macOS Darwin 25.2.0
- Project:
holmes-context-engineering/
Status: completed
Implementation: src/00-download.ts
bun run src/00-download.ts
ls data/ # 9 .txt files
grep -l "PROJECT GUTENBERG" data/*.txt # nothing returned — clean- Missing: file naming convention. The task says "save each volume as a clean
.txtfile in./data/" but doesn't specify the naming pattern. Should say e.g.{gutenberg_id}-{slugified-title}.txtto avoid ambiguity. - Missing: batch API call option. The Gutendex API supports querying multiple IDs at once (
?ids=244,2097,...). The task implies individual fetches per book. Mentioning the batch endpoint would save HTTP round-trips and teach a useful API pattern. - Gotcha: the
text/plainformat key varies. Gutendex returns format keys liketext/plain; charset=utf-8andtext/plain; charset=us-ascii. The task should mention that the student needs to pick the right MIME type key, not just "download plain text." A student who doesformats["text/plain"]will getundefined. - Gotcha: table of contents gets mangled. The line-unwrapping algorithm joins TOC entries into a single long line because they're not separated by blank lines. This is cosmetically ugly but functionally harmless. Could mention this as an expected artifact.
- Setup: Ollama install command is Linux-only. The prerequisites section gives
curl -fsSL https://ollama.com/install.sh | shwhich only works on Linux. macOS users install viabrew install ollamaor the .app download. Should provide platform-specific instructions. - Minor: course numbering inconsistency. The overview table numbers techniques 1-8, but modules are numbered 0-8. The table should start at 0 or the modules at 1.
Status: completed
Implementation: src/01-chunking.ts
bun run src/01-chunking.tsStrategy | Total Chunks | Avg Length | Min | Max
fixed | 7928 | 500 | 22 | 500
paragraph | 9881 | 359 | 7 | 3322
story-aware | 169 | 21098 | 4 | 141934
All three strategies found the "eliminated the impossible" quote from The Sign of the Four.
- Wrong quote phrasing. The course says to search for Holmes's quote about "eliminating the impossible" but the actual text in The Sign of the Four is "eliminated the impossible" (past tense). Students will get zero results if they search for the infinitive form. Should give the exact phrase or at least a case-insensitive substring.
- Paragraph-based merge threshold too low. The task says "merge short paragraphs (<100 chars) with the next one" — but many legitimate paragraphs (short dialogue lines) are under 100 chars. The merge logic is ambiguous: should you merge with the next paragraph or the previous? Students will make different choices. Clarify direction and consider a higher threshold or a different heuristic.
- Story-aware chunker produces enormous chunks. Without a maximum chunk size fallback, the story-aware strategy produces chunks up to 141K chars (entire volumes without clear chapter boundaries). The task should mention adding a secondary split within oversized story-aware chunks, or at least warn students to expect this.
- No guidance on what to do with tiny trailing chunks. Both fixed-size (22-char trailing chunk) and paragraph-based (7-char chunks) produce very small fragments. The task could suggest filtering out chunks below a minimum threshold.
- Missing: explicit instruction to export chunks for Exercise 2. Exercise 2 needs the paragraph-based chunks. The task should say "export your chunking functions — you'll reuse them in Exercise 2" or suggest writing chunks to a JSON file.
Status: completed
Implementation: src/02-embeddings.ts
bun run src/02-embeddings.ts
# Took ~5 minutes to embed 9881 chunks- 9,881 paragraph-based chunks embedded and stored in
holmes.db - Query "dangerous hound on the moor": all 5 results from Hound of the Baskervilles (correct)
- Search latency: ~370-400ms including Ollama embed call. Brute-force similarity alone is fast.
- Noticed: "THE END" chunks (tiny text) rank artificially high for broad queries due to embedding behavior
better-sqlite3fails with Bun due to native ABI mismatch. The course specifically recommendsbetter-sqlite3and even explains why ("stable extension loading") — but it crashes on Bun v1.3.0 withNODE_MODULE_VERSIONmismatch. The fix is to usebun:sqlite(Bun's built-in SQLite) or runbun install --forceto recompile. The course should mention this fallback prominently, since the whole course runs on Bun.- Missing:
book_titlecolumn. The task only specifiesid, text, source, embeddingin the schema. Exercise 4 later needsbook_title. Adding it now avoids a painful schema migration later. The task should include it from the start. - The
ollamanpm package API name is not mentioned. The course just says "embed each chunk using Ollama'snomic-embed-textmodel" but doesn't show the npm package method. The actual API isollama.embed({ model, input })returning{ embeddings: number[][] }. Giving at least the method signature would save students debugging time. - "< 200ms" target is misleading. The course says brute-force search should be under 200ms, but doesn't distinguish between the embedding call latency (~200ms) and the similarity computation (~100ms). The total is ~400ms. Should clarify: "similarity computation should be < 200ms; embedding the query adds its own latency."
- Short/empty chunks create noise. "THE END" (a 7-char chunk) ranked #1 for "What is the relationship between Holmes and Watson?" because tiny chunks get concentrated embeddings. The task should recommend filtering out chunks below ~50 chars before embedding.
- Embedding 9,881 chunks takes ~5 minutes. The course doesn't warn about this. Adding a note "This step takes several minutes — go make tea" would help set expectations. Also should suggest caching (checking if DB already has data before re-embedding).
Status: completed
Implementation: src/03-rag.ts
bun run src/03-rag.ts "How was the Red-Headed League scheme uncovered?"
bun run src/03-rag.ts # runs 3 default test queries- Pipeline works end-to-end: retrieves chunks, builds prompt, calls llama3.2, prints answer with sources.
- Answer quality is poor for 2 of 3 test queries because retrieval returns noise chunks ("THE END", table of contents).
- The Red-Headed League question failed because no relevant chunks were retrieved (the story is in The Adventures, but generic crime-related chunks from other books ranked higher).
- The Irene Adler question also failed — no A Scandal in Bohemia passages were retrieved.
- Module import side effects. Since exercises import from each other (
03-rag.tsimports02-embeddings.tswhich imports01-chunking.ts), all threemain()functions run when you execute exercise 3. The course should instruct students to guard their main function withif (import.meta.main)(Bun's equivalent of Python'sif __name__ == "__main__"). This is not mentioned anywhere. - The Ollama chat API call is not shown. The course says "send it to Ollama's
llama3.2model" but doesn't show the method. Students needollama.chat({ model, messages }). At minimum, show the import and method signature. - Expected poor results should be acknowledged. The test queries are designed to expose RAG weaknesses (the course says "modules 4-8 exist to fix these failure modes"), but the verification table implies the answers should be correct. Should add a note: "Don't worry if the answers are disappointing — that's the point. Exercises 4+ will fix this."
- Missing:
bun:sqlitealternative. The course only mentionsbetter-sqlite3but it fails on Bun. Should mentionbun:sqliteas the native alternative. - CLI argument handling. The course says to use "command-line argument" but doesn't specify
process.argv[2]orBun.argv[2]. A one-liner example would help beginners.
Status: completed
Implementation: src/04-reranking.ts
bun run src/04-reranking.ts "What clues did Holmes find at the crime scene in A Study in Scarlet?"- Metadata filtering correctly detected "A Study in Scarlet" and restricted search to that book only
- All 5 final results from the correct book (vs. mixed sources in Exercise 3)
- Reranking took ~12s (20 LLM calls)
- MMR produced diverse chunks covering different scenes (RACHE, blood, Gregson's investigation)
- Answer quality improved but still limited by llama3.2 3B model
- Schema migration is not addressed. Exercise 2 creates the chunks table without
book_title. Exercise 4 says "add abook_titlecolumn to your chunks table" but doesn't explain how (ALTER TABLE? Drop and recreate? Re-embed everything?). If a student already has 10K embedded chunks, telling them to re-embed is costly. Should suggest either: (a) add the column from the start in Exercise 2, or (b) useALTER TABLE chunks ADD COLUMN book_title TEXTand populate from the existingsourcefilename. - The MMR code snippet has a return type issue. The provided
mmrSelectfunction returnstypeof candidates[0](a single item), but it should be called iteratively to build a list. The snippet doesn't show the iterative selection loop — students must figure out that MMR is greedy and selects one at a time. - LLM score parsing is fragile. The task says "Respond with ONLY a number" but
llama3.2often responds with "7/10" or "Score: 7" or just rambles. Should mention needing robust number extraction (regex) and a fallback score. - Reranking latency is high but not discussed. 20 LLM calls at ~500ms each = ~10-12s. The course mentions this motivates cross-encoder models but should suggest parallelizing the reranking calls (Promise.all) or using a smaller model for scoring. With Ollama's serial processing, parallelizing doesn't actually help, but it's worth noting.
- Keyword matching for book detection is brittle. "hound" triggers Hound of the Baskervilles, but a question about "a hound in another story" would be wrongly filtered. The task should mention this limitation and perhaps suggest LLM-based book detection as an improvement.
Status: completed
Implementation: src/05-memory.ts
bun run src/05-memory.ts --test # non-interactive 3-turn test
bun run src/05-memory.ts # interactive mode- Conversation-aware query rewriting works: "Why does he respect her?" → "why does sherlock holmes respect irene adler"
- Fact extraction stores memories after every 3 turns
- Memory retrieval returns relevant stored facts in subsequent queries
- Answer quality limited by the retrieval step (the actual A Scandal in Bohemia passages weren't retrieved)
- No guidance on handling piped/non-interactive input. The task says "read user input from stdin" but doesn't address how to test non-interactively. Piped input causes
readlineto close prematurely in Bun. Should suggest a--testflag or scripted test mode. - Missing:
readlineimport. The course doesn't mention which readline API to use. Node/Bun'screateInterfacefromreadlineworks, but students might tryprocess.stdin.on('data')orBun.stdinapproaches which behave differently. - The long-term memory test is impractical. The course says "start a new conversation and ask 'What character was I interested in last time?'" — but this requires stopping and restarting the process. Should clarify that "new conversation" means resetting the messages array but keeping the DB connection, not restarting the process.
- Fact extraction quality with llama3.2. The 3B model extracts verbose and sometimes incorrect facts (e.g., it claimed the passage was from The Case-Book when it was actually from The Adventures). The task should warn that small models produce lower-quality extractions and suggest validating extracted facts.
- The prompt structure change from Exercise 3 is significant. Exercise 3 uses separate system/user messages for
ollama.chat(). Exercise 5 puts everything (system, memories, history, context, question) into a single user message. The course should be explicit about which approach to use and why.
Status: completed
Implementation: src/06-compression.ts
bun run src/06-compression.ts --test # 6-turn test conversation
bun run src/06-compression.ts # interactive mode- Token budget dashboard prints at every turn, showing allocation across system/memories/conversation/context/query
- Total tokens never exceeded budget (stayed around 10-20% utilization with 4096 budget)
- Query rewriting resolved "that one" → "A Study in Scarlet", "the two villains" → comparison query
- Conversation summarization triggered when history exceeds threshold
- Chunk compression compresses lower-ranked chunks to single sentences
- The 4096 token budget is unrealistically large for this dataset. With paragraph-based chunks averaging ~360 chars (~90 tokens), 5 chunks = ~450 tokens. The conversation rarely approaches the 4096 budget. Compression only triggers in extended conversations. Should either use a smaller budget (e.g., 2000) or warn students that compression may not trigger in short test sessions.
llama3.2default context may be 2048, not 4096. Ollama'sllama3.2defaults to a 2048-token context window. The course's 4096-token budget may exceed what the model can actually handle unlessnum_ctx: 4096is passed in the Ollama options. This should be explicitly mentioned.- The token approximation is rough.
text.length / 4assumes English prose. For JSON, code, or special characters, this underestimates. The course should mention this is an approximation and suggest verifying against actual model token counts. - Compression quality feedback loop. The task doesn't address how to verify that compression preserves key information. Students should be told to print the summary and manually check it, or compare compressed vs. uncompressed answers to the same question.
- Missing: the 10-turn test sequence from the course. The course specifies a 6-turn test (Turns 1-6), but the verification says "have a 10-turn conversation." The mismatch is confusing — should consistently specify 6 turns for the core test and 10 as an optional stress test.
Status: completed
Implementation: src/07-assembly.ts
bun run src/07-assembly.ts --test # 5-turn test covering each intent type
bun run src/07-assembly.ts # interactive mode- Intent classification worked for 4 of 5 test queries:
- "Hi there!" → GREETING (correct, 0 chunks, fast response)
- "In which story does Holmes fake his death?" → FACTUAL (correct, 5 chunks)
- "How does that compare to how Irene Adler outsmarted him?" → ANALYTICAL (correct, 10 chunks from 6 books)
- "What makes a great detective villain?" → ANALYTICAL (correct, 10 chunks from 7 books)
- "How do you find your answers?" → ANALYTICAL (WRONG, should be META — 0 chunks expected)
- Dynamic system prompt switches correctly per intent
- MMR diversity applied for ANALYTICAL queries
- Routing log prints at every turn with chunk count, books, history, memories, token usage
- Intent classification is unreliable with llama3.2. The 3B model misclassified "How do you find your answers?" as ANALYTICAL instead of META. The task should warn about this and suggest adding few-shot examples to the classification prompt (e.g., provide 1-2 examples per intent). Alternatively, suggest using keyword-based heuristics as a fallback.
- FOLLOW_UP detection never triggered. In the test, "How does that compare..." was classified as ANALYTICAL rather than FOLLOW_UP, even though it references "that" (a pronoun needing resolution). The boundary between FOLLOW_UP and ANALYTICAL is unclear. The task should explain which takes priority when both apply, or suggest merging them.
- The route table doesn't specify what "All available" history means. For FOLLOW_UP, the task says "All available" for conversation history. How much is "all"? The full raw messages? The summary + recent turns? Should give a concrete number or explain the strategy.
- No error handling for misclassification. If the model returns garbage (not one of the 5 intents), the task doesn't say what to do. Should specify a default intent (FACTUAL is a good fallback).
- Exercise is very slow. A single 5-turn test takes 3-5 minutes due to multiple LLM calls per turn (classify + rewrite + retrieval + generation). The course should mention expected runtime and suggest optimizations (e.g., running classification and embedding in parallel).
Status: completed
Implementation: src/08-capstone.ts, src/08-eval.ts, src/lib/ (8 modules), data/eval.json
bun run src/08-eval.ts- Refactored all shared logic into
src/lib/modules:db.ts,llm.ts,embeddings.ts,reranker.ts,memory.ts,compressor.ts,assembler.ts,chunking.ts - Capstone chatbot (
08-capstone.ts) integrates all modules with streaming output - Evaluation harness (
08-eval.ts) runs 10 test cases fromdata/eval.json - Evaluation scores:
Retrieval accuracy: 7/10 (70%) — target was ≥80% Answer accuracy: 2/10 (20%) — target was ≥70% Overall score: 45% - Retrieval failures: 3 questions retrieved wrong books (e.g., "Red-Headed League" question pulled chunks from wrong stories)
- Answer failures: Even when correct chunks were retrieved, llama3.2 3B frequently omitted expected keywords or hallucinated details
- The accuracy targets are unrealistic with llama3.2 3B. The course sets ≥80% retrieval and ≥70% answer accuracy, but the 3B model regularly ignores retrieved context, omits expected keywords, and hallucinates. With a larger model (e.g., llama3.1 8B or 70B), these targets would be achievable. The course should either lower the targets for 3B models or recommend a larger model for the eval.
- Keyword-based answer evaluation is brittle. A correct answer that paraphrases rather than using the exact expected keyword scores as a failure. For example, the model might say "disguised himself" instead of the expected keyword "disguise". The course should mention this limitation and suggest fuzzy matching or LLM-based evaluation as alternatives.
- No guidance on what
src/lib/modules should export. The course says "refactor into reusable modules" but doesn't specify which functions go where, or what the module boundaries should be. Students will make wildly different architectural choices. Should provide a suggested module list with key exports. - Streaming is mentioned but not taught. The course says to "add streaming" but doesn't show Ollama's streaming API (
ollama.chat({ stream: true })returns an async iterable). This is a significant implementation detail to leave as an exercise. - The eval.json test set is self-authored. The course says "create 10+ test questions" but doesn't provide a sample eval.json. Students must write their own test set AND implementation simultaneously, making it hard to know if poor scores reflect bad tests or bad code. Should provide at least 5 reference test cases.
- No baseline comparison. The evaluation runs once with the full pipeline but doesn't compare against a baseline (e.g., Exercise 3's naive RAG). Without a before/after comparison, students can't measure how much their improvements actually helped. Should suggest running the same eval.json against the Exercise 3 pipeline first.
budgetCheckfunction is mentioned in the route log but not defined in the course. The capstone assembles all context and should validate total tokens against the budget, but the course doesn't specify when to warn vs. truncate vs. error.
All 9 exercises (0-8) completed successfully. The course teaches a solid progression of context engineering techniques, building from raw data acquisition through to a fully integrated chatbot with evaluation. However, there are significant gaps between what the instructions describe and what students actually need to know.
-
better-sqlite3vsbun:sqlite: The course's recommended SQLite library doesn't work with Bun out of the box. This is the single biggest blocker a student will hit. Must be addressed in prerequisites. -
API signatures never shown: The course tells students what to call (Ollama embed, chat, etc.) but never shows the actual method signatures. Students spend significant time reading
ollamanpm docs. Adding a 3-line code snippet per API call would save hours. -
import.meta.mainguard not mentioned: Without this, importing from previous exercises causes theirmain()functions to execute as side effects. This breaks every exercise from Exercise 3 onward. Should be introduced in Exercise 1. -
llama3.2 3B limitations not acknowledged: The model is too small for reliable intent classification, fact extraction, LLM-based reranking, and keyword-accurate answers. The course should set expectations: "results will be approximate with a 3B model; accuracy improves significantly with 8B+ models."
-
Schema evolution not planned: Exercise 2 creates the schema, Exercise 4 needs an extra column. The course should either include
book_titlefrom the start or teachALTER TABLEmigration. -
No expected runtimes: Embedding 10K chunks takes ~5 min, reranking takes ~12s, a 5-turn conversation takes ~3-5 min. Students with no benchmarks may think something is broken.
-
Interactive testing is difficult: Exercises 5-7 are interactive chat loops with no built-in test mode. The course should suggest a
--testflag pattern from the start.
| Exercise | Status | Key Issue |
|---|---|---|
| 0 - Data Acquisition | ✓ Completed | Gutendex MIME key varies |
| 1 - Chunking | ✓ Completed | Wrong quote phrasing in instructions |
| 2 - Embeddings | ✓ Completed | better-sqlite3 ABI crash |
| 3 - RAG Pipeline | ✓ Completed | Import side effects, poor retrieval |
| 4 - Reranking | ✓ Completed | Schema migration not addressed |
| 5 - Memory | ✓ Completed | Interactive testing not addressed |
| 6 - Compression | ✓ Completed | Budget too large to trigger compression |
| 7 - Assembly | ✓ Completed | Intent classification unreliable with 3B |
| 8 - Capstone + Eval | ✓ Completed | Accuracy targets unrealistic for 3B model |
- Retrieval accuracy: 70% (7/10)
- Answer accuracy: 20% (2/10)
- Overall: 45%