Skip to content

Latest commit

 

History

History
954 lines (646 loc) · 51.1 KB

File metadata and controls

954 lines (646 loc) · 51.1 KB

Context Engineering for LLMs: A Weekend Course

Build a "Sherlock Holmes Expert" chatbot from scratch — and learn every major context engineering technique along the way.

You will spend this weekend building a system that can answer deep questions about Arthur Conan Doyle's Sherlock Holmes stories. Not by stuffing entire novels into a prompt, but by engineering the context — deciding what information to retrieve, how to format it, when to compress it, and what to remember across conversations. By Sunday evening, you'll have a working chatbot backed by a local SQLite vector database, conversation memory, dynamic prompt assembly, and reranking — all running on your machine with no cloud dependencies.


What is context engineering?

Prompt engineering is about crafting a good question. Context engineering is about preparing the entire briefing packet before the question is even asked.

Andrej Karpathy defined it as "the delicate art and science of filling the context window with just the right information for the next step." Anthropic formalized it further: context engineering means "finding the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome."

The mental model: think of the LLM as a CPU and the context window as RAM. The CPU is powerful but can only work with what's loaded into RAM. Your job as a context engineer is to design the system that loads the right data into RAM at the right time — not too much (noise drowns the signal), not too little (the model hallucinates), and in the right format (structure aids comprehension).

This matters because, as Philipp Schmid put it: "Most agent failures are not model failures anymore — they are context failures."

What you will learn

This course covers 8 core techniques, each building on the previous:

# Technique What it solves
1 Text acquisition & preprocessing Getting clean data into your system
2 Chunking strategies Breaking documents into retrievable units
3 Embeddings & vector search Finding relevant chunks by meaning
4 RAG pipeline Answering questions from external knowledge
5 Reranking & filtering Improving retrieval precision
6 Memory systems Remembering across conversation turns
7 Context compression Fitting more into less space
8 Dynamic prompt assembly Orchestrating all of the above at runtime

How this course is structured

Every module follows the same pattern:

  • 📖 LEARN — A brief explanation of the technique: what it is, why it matters, and the key concepts. Includes links to go deeper.
  • 🔨 PRACTICE — A hands-on exercise with:
    • The problem: what's broken or missing right now
    • Your task: what to build
    • How to verify: a concrete test that proves it works
    • Stretch goal: an optional harder challenge

Estimated total time: 10–14 hours across two days.


Prerequisites & setup

What you need installed

  • Bun (v1.1+): curl -fsSL https://bun.sh/install | bash
  • Ollama: curl -fsSL https://ollama.com/install.sh | sh

Pull two models (do this first — the downloads take a few minutes):

ollama pull llama3.2        # 3B chat model (~2GB)
ollama pull nomic-embed-text # embedding model (~274MB)

Project setup

mkdir holmes-context-engineering && cd holmes-context-engineering
bun init -y
bun add ollama better-sqlite3
bun add -D @types/better-sqlite3
mkdir data src

Why better-sqlite3 instead of bun:sqlite? We'll use the sqlite-vec extension for vector search later. better-sqlite3 has stable extension loading. If you prefer bun:sqlite, the API is nearly identical — adapt as needed.

Why this stack?

Choice Reason
Bun Built-in TypeScript, fast startup, no build step
SQLite Zero-config database, single file, runs everywhere
Ollama Local LLM and embedding inference, no API keys needed
No frameworks You'll build each piece yourself so you understand what frameworks abstract away

The dataset: Sherlock Holmes

We'll use Arthur Conan Doyle's Sherlock Holmes stories from Project Gutenberg — 4 novels and 56 short stories across 9 volumes. The entire canon is public domain and freely downloadable as plain text.

Why Holmes?

  • Multiple document types: short stories (~8K words each) and novels (~40–60K words) in one corpus
  • Natural structure: stories have titles, chapters have headings, dialogue is abundant
  • Cross-document relationships: characters (Moriarty, Irene Adler, Lestrade) recur across stories
  • Engaging queries: "How does Holmes solve the Red-Headed League case?" is more fun to debug than "Summarize paragraph 3 of document 7"

Module 0: Data Acquisition

📖 LEARN

Every context engineering system starts with clean source data. Garbage in, garbage out — if your text has boilerplate headers, encoding artifacts, or broken paragraph boundaries, your embeddings will be noisy and your retrieval will suffer.

Project Gutenberg texts come with legal headers and footers that must be stripped. The text body uses hard line wrapping at ~72 characters, which means paragraphs look like this:

Holmes was seated at his side-table clad in his
dressing-gown, and working hard over a chemical
investigation.

That's one paragraph across three lines. If you split naively on \n, you'll shatter paragraphs into fragments.

Key concept: the Gutendex API. Gutendex (https://gutendex.com/) is a free REST API for Project Gutenberg metadata. It returns JSON with download URLs for every format, so you don't need to hardcode file paths.

🔗 Further reading: Gutendex API docs · Project Gutenberg

🔨 PRACTICE — Exercise 0: "The Game Is Afoot"

The problem: You need the complete Sherlock Holmes corpus as clean, parseable text files on disk — stripped of Gutenberg boilerplate and with paragraphs properly joined.

Your task:

Create src/00-download.ts that:

  1. Fetches metadata for all 9 Holmes volumes from the Gutendex API
  2. Downloads the plain text for each
  3. Strips the Gutenberg header (everything before *** START OF THE PROJECT GUTENBERG EBOOK) and footer (everything after *** END OF THE PROJECT GUTENBERG EBOOK)
  4. Unwraps hard-wrapped lines into proper paragraphs (join lines that don't end a paragraph, preserve blank-line paragraph boundaries)
  5. Saves each volume as a clean .txt file in ./data/

Here are the Gutenberg IDs:

const HOLMES_BOOKS = [
  { id: 244,   title: "A Study in Scarlet" },
  { id: 2097,  title: "The Sign of the Four" },
  { id: 1661,  title: "The Adventures of Sherlock Holmes" },
  { id: 834,   title: "The Memoirs of Sherlock Holmes" },
  { id: 2852,  title: "The Hound of the Baskervilles" },
  { id: 108,   title: "The Return of Sherlock Holmes" },
  { id: 2350,  title: "His Last Bow" },
  { id: 3289,  title: "The Valley of Fear" },
  { id: 69700, title: "The Case-Book of Sherlock Holmes" },
];

How to verify:

bun run src/00-download.ts
# Should produce 9 files in ./data/
ls data/
# No file should contain "PROJECT GUTENBERG" or "*** START"
grep -l "PROJECT GUTENBERG" data/*.txt
# Should return nothing

Spot-check: open any file. The first line should be the book title or opening text, not a license header. Paragraphs should flow naturally, not be chopped at 72 characters.

Stretch goal: Split each short story collection into individual story files by detecting story title boundaries (e.g., "ADVENTURE I. A SCANDAL IN BOHEMIA"). You'll end up with ~60 individual story files instead of 9 volumes — much better granularity for retrieval later.


Module 1: Chunking Strategies

📖 LEARN

Chunking is how you break documents into units that can be independently retrieved. It is the single most impactful variable in RAG quality — more important than your choice of embedding model or vector database.

The core tension: chunks that are too large average multiple ideas into a single embedding vector, making retrieval imprecise. Chunks that are too small lose context that even a human would need to understand the passage.

Five strategies, from simple to sophisticated:

1. Fixed-size chunking. Split every N tokens with M tokens of overlap. Simple, fast, predictable. But it cuts mid-sentence and mid-thought. Start here as your baseline.

2. Recursive character splitting. Try splitting by paragraph (\n\n). If any chunk is still too big, split by sentence (.). Still too big? Split by word. This naturally respects document structure. It's the default in LangChain's RecursiveCharacterTextSplitter for good reason.

3. Sentence-based chunking. Group N consecutive sentences per chunk. Preserves complete thoughts but produces variable-length chunks.

4. Semantic chunking. Embed each sentence, then measure cosine similarity between adjacent sentences. When similarity drops below a threshold, that's a chunk boundary. The most intelligent automated approach — it finds natural topic shifts. But it requires computing embeddings during indexing.

5. Document-structure-aware chunking. Split along structural boundaries: markdown headers, HTML tags, story titles, chapter breaks. For Holmes stories, this means one chunk per story section or chapter. Ideal when your documents have reliable structure.

The rule of thumb: if a chunk makes sense when read alone by a human, it will make sense to the LLM.

🔗 Further reading: Weaviate — Chunking Strategies for RAG · Greg Kamradt — 5 Levels of Text Splitting (YouTube) · ChonkieJS on GitHub

🔨 PRACTICE — Exercise 1: "The Science of Deduction"

The problem: You have 9 large text files. You can't embed or search an entire novel at once — the embedding would be a meaningless average of thousands of ideas. You need to break them into retrievable units, but how you break them dramatically affects what you can find later.

Your task:

Create src/01-chunking.ts that implements three chunking strategies and compares them:

  1. Fixed-size: 500 characters with 50-character overlap
  2. Paragraph-based: split on double newlines, merge short paragraphs (< 100 chars) with the next one
  3. Story-aware: detect story/chapter boundaries by title patterns (e.g., lines that are ALL CAPS or match "CHAPTER", "ADVENTURE") and chunk within those boundaries

Each chunk should be an object:

interface Chunk {
  text: string;
  source: string;      // filename
  strategy: string;    // "fixed" | "paragraph" | "story-aware"
  index: number;       // chunk number within source
  charCount: number;
}

Print a comparison table at the end:

Strategy       | Total Chunks | Avg Length | Min | Max
fixed          |         1847 |        498 |  50 |  550
paragraph      |         1203 |        764 | 101 | 3211
story-aware    |          312 |       2874 | 423 | 8901

How to verify:

Pick a test passage you can find by eye — for example, Holmes's famous quote about eliminating the impossible from The Sign of the Four. Search for it (with grep or includes()) in the chunks from each strategy:

  • Fixed-size: Is the quote split across two chunks? (Likely yes.)
  • Paragraph-based: Is the quote in a single chunk with surrounding context? (Should be.)
  • Story-aware: Is the quote in a large section that includes the full scene? (Should be.)

There's no single "right" strategy — the verification here is that you can articulate the tradeoffs and see them in your data.

Stretch goal: Implement semantic chunking. Embed each sentence with Ollama's nomic-embed-text, compute cosine similarity between consecutive sentences, and split where similarity drops below a threshold (start with 0.5). Compare the boundaries it finds with your paragraph-based boundaries.


Module 2: Embeddings & Vector Search

📖 LEARN

An embedding is a dense vector representation of text — a list of numbers (typically 384 to 1536 floats) that captures semantic meaning. Texts with similar meanings produce vectors that are close together in this high-dimensional space.

How it works: An embedding model (like nomic-embed-text) processes your text through a transformer neural network and outputs a fixed-size vector. Two pieces of text that mean similar things — even if they use completely different words — will have vectors with high cosine similarity (close to 1.0). Unrelated texts will have low similarity (close to 0.0).

Cosine similarity measures the angle between two vectors, ignoring magnitude. It's the standard similarity metric for text embeddings:

cos(A, B) = (A · B) / (||A|| × ||B||)

Where A · B is the dot product and ||A|| is the vector's length (L2 norm).

Vector search means: given a query, embed it, then find the stored vectors with the highest cosine similarity. This is a "nearest neighbors" search in high-dimensional space.

For small datasets (< 100K vectors), you can compute cosine similarity against every stored vector (brute force). For larger datasets, you need approximate nearest neighbor (ANN) algorithms. sqlite-vec provides this as a SQLite extension — it adds a virtual table type that supports KNN queries using a MATCH clause.

However, for this course (thousands of chunks, not millions), brute-force search in plain SQLite is perfectly fast and simpler to understand. We'll store vectors as JSON arrays and compute similarity in TypeScript. This removes the sqlite-vec dependency and teaches you what vector databases actually do under the hood.

🔗 Further reading: Ollama Embedding Models · sqlite-vec on GitHub (for when you outgrow brute force) · What Are Embeddings? (Vicki Boykis)

🔨 PRACTICE — Exercise 2: "A Study in Vectors"

The problem: Your chunks are plain text. You can search them with keyword matching (includes()), but that fails for semantic queries. "What methods does Holmes use to solve crimes?" won't match a passage about "deductive reasoning" or "examining boot prints" because the exact words don't appear.

You need to represent each chunk as a vector so you can search by meaning.

Your task:

Create src/02-embeddings.ts that:

  1. Takes the paragraph-based chunks from Exercise 1
  2. Embeds each chunk using Ollama's nomic-embed-text model
  3. Stores chunks + embeddings in a SQLite database:
CREATE TABLE chunks (
  id INTEGER PRIMARY KEY,
  text TEXT NOT NULL,
  source TEXT NOT NULL,
  embedding TEXT NOT NULL  -- JSON array of floats
);
  1. Implements a search(query: string, topK: number) function that:

    • Embeds the query
    • Computes cosine similarity against every stored embedding
    • Returns the top-K most similar chunks
  2. Runs three test queries and prints results:

    • "How does Holmes use disguises?" — should find passages about Holmes's famous disguise abilities
    • "What is the relationship between Holmes and Watson?" — should find passages about their friendship
    • "dangerous hound on the moor" — should find Hound of the Baskervilles passages

Starter code for the cosine similarity function:

function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0, normA = 0, normB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}

How to verify:

For the query "dangerous hound on the moor", the top 3 results should all come from The Hound of the Baskervilles (source file contains "2852" or "Hound"). If results come from random other stories, your embeddings or similarity function are broken.

Measure search latency:

const start = performance.now();
const results = search("dangerous hound", 5);
console.log(`Search took ${(performance.now() - start).toFixed(1)}ms`);

With a few thousand chunks, brute-force search should complete in < 200ms. If it's seconds, you're likely re-parsing JSON on every search — cache the parsed float arrays.

Stretch goal: Try two embedding approaches and compare:

  • Ollama's nomic-embed-text (768 dimensions)
  • A smaller model like all-minilm via Ollama (384 dimensions)

Do the same queries return the same top results? Is the smaller model noticeably faster? Does it miss relevant passages?


Module 3: The RAG Pipeline

📖 LEARN

Retrieval-Augmented Generation (RAG) is the foundational context engineering pattern. Instead of relying on the LLM's training data (which may be wrong, outdated, or missing), you retrieve relevant information from your own knowledge base and inject it into the prompt.

The basic RAG pipeline has three steps:

Query → Retrieve (vector search) → Generate (LLM with retrieved context)

The prompt template looks like this:

You are a Sherlock Holmes expert. Answer the question using ONLY
the provided context. If the context doesn't contain the answer,
say so.

Context:
{retrieved_chunks}

Question: {user_query}

Why RAG works: The LLM gets exactly the relevant passages it needs and generates an answer grounded in source material. This reduces hallucination, allows the system to cite sources, and works with any documents — no fine-tuning required.

Why naive RAG fails: If your retrieval returns the wrong chunks (bad chunking, bad embeddings, or too few results), the LLM either hallucinates or gives a vague answer. If your retrieval returns too many chunks, the LLM gets overwhelmed and may miss the key passage ("lost in the middle" effect). Modules 4–8 exist to fix these failure modes.

🔗 Further reading: Anthropic — Effective context engineering for AI agents · Pinecone — RAG

🔨 PRACTICE — Exercise 3: "The Adventure of the Retrieval Pipeline"

The problem: You can embed and search, but you can't answer questions yet. A human asking "How was the Red-Headed League scheme uncovered?" doesn't want a list of text chunks — they want a coherent answer with details from the story.

Your task:

Create src/03-rag.ts that:

  1. Takes a natural language question as a command-line argument
  2. Retrieves the top 5 most similar chunks from your SQLite database
  3. Assembles a prompt with the retrieved context
  4. Sends it to Ollama's llama3.2 model
  5. Prints the answer with source citations
bun run src/03-rag.ts "How was the Red-Headed League scheme uncovered?"

Your prompt template should:

  • Include a system instruction defining the role ("Sherlock Holmes literary expert")
  • Inject retrieved chunks inside clearly delimited tags (use <context> / </context>)
  • Number each chunk with its source so the LLM can cite them
  • Place the question after the context (models attend better to content at the end)

Example prompt structure:

<system>
You are an expert on the Sherlock Holmes stories by Arthur Conan Doyle.
Answer questions using ONLY the provided context passages. Cite the
passage numbers you used. If the context doesn't contain enough
information, say so honestly.
</system>

<context>
[1] (The Adventures of Sherlock Holmes)
Holmes laughed at the ingenious...

[2] (The Adventures of Sherlock Holmes)
"You see, Watson, the Red-Headed League...

[3] ...
</context>

Question: How was the Red-Headed League scheme uncovered?

How to verify:

Test with these three questions and evaluate the answers manually:

Question Should mention Should NOT do
"How was the Red-Headed League scheme uncovered?" Jabez Wilson, tunnel to bank, John Clay Make up details not in the stories
"Describe the Hound of the Baskervilles." Phosphorus, Stapleton, the moor Confuse it with other stories
"What does Holmes think about Irene Adler?" "The Woman", respect, photograph Invent a romance subplot

If an answer is clearly wrong or hallucinates details, your retrieval is likely returning irrelevant chunks. Print the retrieved chunks alongside the answer so you can debug which passages the LLM was working from.

Stretch goal: Implement query rewriting. Before searching, send the user's question to the LLM with this prompt: "Rewrite this question as 3 different search queries that would help find relevant passages in the Sherlock Holmes stories." Search with all 3 queries, deduplicate results, then use the combined context. Does this improve answer quality for vague queries like "Tell me about Moriarty"?


Module 4: Reranking & Filtering

📖 LEARN

Vector search retrieves chunks that are similar to the query embedding. But similarity ≠ relevance. A passage about "Watson's description of the London fog" might be similar to a query about "the weather in Holmes stories" in embedding space, but a passage about "Holmes examining footprints in the fog" is far more relevant.

Reranking adds a second pass that re-scores retrieved results for relevance to the specific query. The standard approach is a two-stage funnel:

  1. Stage 1: Retrieve broadly. Pull the top 20–50 chunks using fast vector search (maximize recall).
  2. Stage 2: Rerank precisely. Score each chunk against the query using a more expensive method, then take the top 5 (maximize precision).

Three reranking approaches, from simplest to most powerful:

LLM-based reranking: Ask your LLM to score each chunk's relevance on a 1–10 scale. This is the most accurate method and requires no additional models, but it's slow because you make one LLM call per chunk. Practical when reranking 10–20 candidates, not 200.

Keyword + vector hybrid scoring: Combine the cosine similarity score with a keyword overlap score (e.g., BM25 or simple TF-IDF). This catches cases where the exact terms from the query appear in a chunk but the embeddings don't rank it highly.

MMR (Maximal Marginal Relevance): After reranking, you may have 5 chunks that all say basically the same thing. MMR selects results that are relevant to the query and diverse from each other. The formula penalizes a chunk if it's too similar to one already selected. This ensures your context covers multiple angles rather than repeating one perspective.

Metadata filtering is the simplest and most overlooked technique: if the user asks about The Hound of the Baskervilles, filter to only chunks from that book before doing vector search. This eliminates irrelevant results instantly.

🔗 Further reading: Weaviate — Cross-Encoders as Reranker · Pinecone — Rerankers

🔨 PRACTICE — Exercise 4: "The Deduction Engine"

The problem: Your RAG pipeline returns the top 5 chunks by embedding similarity, but when you ask "What clues did Holmes find at the crime scene in A Study in Scarlet?", some of the returned chunks are from other stories that happen to mention crime scenes. The answer mixes up details from multiple cases.

Your task:

Create src/04-reranking.ts that upgrades your retrieval with:

  1. Metadata filtering: Add a book_title column to your chunks table. When the query mentions a specific book or story (detect by keyword matching), filter to only that book's chunks before searching.

  2. Broader initial retrieval: Retrieve top 20 instead of top 5.

  3. LLM-based reranking: For each of the 20 retrieved chunks, ask the LLM:

On a scale of 1-10, how relevant is this passage to answering the question?
Question: {query}
Passage: {chunk_text}
Respond with ONLY a number.
  1. MMR diversity selection: From the reranked list, select the top 5 using MMR with λ=0.7 (70% relevance, 30% diversity). Implement MMR as:
function mmrSelect(
  candidates: { text: string; embedding: number[]; score: number }[],
  selected: { embedding: number[] }[],
  lambda: number
): typeof candidates[0] {
  return candidates.reduce((best, candidate) => {
    const maxSimToSelected = selected.length === 0
      ? 0
      : Math.max(...selected.map(s => cosineSimilarity(candidate.embedding, s.embedding)));
    const mmrScore = lambda * candidate.score - (1 - lambda) * maxSimToSelected;
    return mmrScore > (best.mmrScore ?? -Infinity) ? { ...candidate, mmrScore } : best;
  });
}

How to verify:

Run a controlled comparison. Ask the same question with both the old (Exercise 3) and new pipeline:

Question: "What clues did Holmes find at the crime scene in A Study in Scarlet?"

Old pipeline: Print the book sources of all 5 returned chunks. How many are from A Study in Scarlet?

New pipeline: Print the same. With metadata filtering + reranking, all 5 should be from the correct book, and they should cover different clues (the word "RACHE", the blood, the pill box, etc.) rather than repeating the same scene.

Also compare the wall-clock time. The reranking pass makes 20 LLM calls — how much slower is it? (This motivates why production systems use cross-encoder models instead of LLM reranking.)

Stretch goal: Implement a simple hybrid score that combines cosine similarity (0–1) with keyword overlap. Count how many words from the query appear in the chunk, normalize to 0–1, then use finalScore = 0.7 * cosineSim + 0.3 * keywordScore. Does this catch passages that embeddings miss?


Module 5: Memory Systems

📖 LEARN

Everything we've built so far is stateless — each question is answered independently. But real conversations require memory. If a user asks "What does Holmes think of Moriarty?" and then follows up with "When did they first meet?", the system needs to know that "they" means Holmes and Moriarty.

Context engineering recognizes three types of memory:

Short-term memory (conversation history): The raw message history of the current session. The simplest form: append every user message and assistant response to an array, include it in the prompt. The problem: conversations grow, and context windows are finite.

Long-term memory (semantic memory): Extracted facts that persist across sessions. After a conversation, the system might store: "The user is particularly interested in Moriarty as a character" or "The user prefers detailed plot summaries over thematic analysis." These are retrieved by relevance at the start of future conversations.

Episodic memory: Summaries of past interactions stored as examples. "Last time you asked about Moriarty, I explained the Reichenbach Falls encounter and you wanted more detail about the mathematics." This enables the system to learn from past successes.

For this course, we'll implement short-term memory with a sliding window and long-term fact extraction. The key challenge is context allocation: your context window has a fixed token budget, and you must divide it between retrieved knowledge, conversation history, memories, and the system prompt.

🔗 Further reading: LangChain — Memory for Agents · Letta — Agent Memory · Mem0 Research

🔨 PRACTICE — Exercise 5: "Watson's Notebook"

The problem: Ask your chatbot "What does Holmes think of Irene Adler?" — it answers well. Now ask "Why does he call her that?" — it has no idea what "that" refers to. Every query is independent. There's no conversation.

Your task:

Create src/05-memory.ts that adds two types of memory to your chatbot:

Part A: Conversation history (short-term)

  1. Create an interactive chat loop (read user input from stdin)
  2. Maintain a messages array of { role: "user" | "assistant", content: string } objects
  3. Include the last N messages (start with N=6, i.e., 3 turns) in the prompt, placed between the system instruction and the retrieved context
  4. The prompt structure becomes:
<system>...</system>

<conversation_history>
User: What does Holmes think of Irene Adler?
Assistant: Holmes regards Irene Adler with great respect...
User: Why does he call her that?
</conversation_history>

<context>
{retrieved chunks — now also informed by conversation history}
</context>

Question: Why does he call her that?
  1. Critical detail: When searching for relevant chunks, use the current question combined with recent context, not just the raw question. Create a search query by asking the LLM: "Given this conversation, what search query would find relevant information? Conversation: {last 2 turns}. Respond with just the search query."

Part B: Fact extraction (long-term)

  1. Add a memories table to your SQLite database:
CREATE TABLE memories (
  id INTEGER PRIMARY KEY,
  fact TEXT NOT NULL,
  embedding TEXT NOT NULL,
  created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
  1. After every 3 conversation turns, extract facts by prompting the LLM:
Extract 1-3 key facts about the user's interests from this conversation.
Format each as a single sentence. Only extract facts that would be useful
for future conversations. If nothing notable, respond with "NONE".

Conversation:
{recent messages}
  1. At the start of each new query, retrieve the top 2 most relevant memories and include them in the prompt.

How to verify:

Run this exact conversation sequence:

You: What does Holmes think about Irene Adler?
Bot: [should answer about "The Woman", admiration, etc.]
You: Why does he respect her so much?
Bot: [should understand "her" = Irene Adler from conversation history, discuss the photograph/disguise]
You: Which story is this from?
Bot: [should answer "A Scandal in Bohemia" — requires connecting the full conversation thread]

If the bot loses track of who "her" or "this" refers to, your conversation history isn't being included properly. Print the full assembled prompt to debug.

For long-term memory, start a conversation about Moriarty. End it. Start a new conversation and ask "What character was I interested in last time?" — the memory system should retrieve the stored fact about the user's interest in Moriarty.

Stretch goal: Implement a token budget allocator. Given a context window of 4096 tokens:

  • System prompt: ~200 tokens (fixed)
  • Memories: up to 200 tokens
  • Conversation history: up to 800 tokens
  • Retrieved context: up to 2500 tokens
  • Query + buffer: ~300 tokens

If conversation history exceeds 800 tokens, summarize older turns. If retrieved context exceeds 2500 tokens, drop the lowest-ranked chunks. Count tokens approximately using text.length / 4.


Module 6: Context Compression

📖 LEARN

Context windows are finite. As your system gets more sophisticated — adding conversation history, memories, tool results, retrieved chunks — you'll hit the limit. And even before you hit it, performance degrades as context grows. Anthropic's research on "context rot" showed that model recall drops steadily as you pack more into the window. More context ≠ better answers.

Compression techniques, from simple to sophisticated:

Truncation: Simply drop the oldest content. Fast but brutal — you might lose critical early context.

Rolling summarization: When history exceeds a threshold, compress older turns into a summary. Keep recent turns verbatim. The summary gets updated incrementally — each time you compress, you merge the new messages into the existing summary rather than re-summarizing everything from scratch.

Map-reduce summarization for documents: For long retrieved passages, summarize each individually (map), then combine the summaries (reduce). This fits more source material into less space.

Selective compression: Not all parts of the context are equally important. Tool outputs (like full web pages) can often be compressed aggressively while conversation turns should be preserved. JetBrains Research found that replacing old tool observations with "[details omitted]" while keeping reasoning steps reduced costs 7–11% while improving success rates.

The key insight: compression is a context allocation strategy, not a last resort. You should design your system with compression as a first-class operation, not bolt it on when you run out of space.

🔗 Further reading: Factory.ai — Compressing Context · LLMLingua — Prompt Compression (Microsoft Research) · JetBrains — Efficient Context Management

🔨 PRACTICE — Exercise 6: "The Reichenbach Compression"

The problem: Your chatbot works for 3–4 turns, but then conversations get long. The conversation history, retrieved chunks, memories, and system prompt together exceed the model's comfort zone. Answers get vague. The model starts ignoring the retrieved context. (With llama3.2 at 3B parameters, this happens faster than with larger models.)

You need to compress older context without losing important information.

Your task:

Create src/06-compression.ts that implements:

  1. A token counter. Approximate tokens as Math.ceil(text.length / 4). Track total context size at every turn.

  2. Rolling conversation summarization. When conversation history exceeds 800 tokens:

    • Take all messages except the last 2 turns (4 messages)
    • Send them to the LLM: "Summarize this conversation in 2-3 sentences, preserving key topics, character names, and any questions the user asked: {messages}"
    • Replace the old messages with a single { role: "system", content: "[Summary of earlier conversation]: ..." } entry
    • Keep the last 2 turns verbatim
  3. Retrieved context compression. When the combined retrieved chunks exceed 2000 tokens:

    • Keep the top 2 highest-scored chunks in full
    • For chunks 3–5, ask the LLM: "Summarize this passage in one sentence, preserving key facts and character names: {chunk}"
    • Use the summaries instead of the full chunks
  4. A context budget dashboard. Print the token allocation at every turn:

--- Context Budget (4096 tokens) ---
System prompt:       187 tokens  ( 5%)
Memories:            143 tokens  ( 3%)
Conversation:        612 tokens (15%)  [includes summary]
Retrieved context:  1847 tokens (45%)  [2 full + 3 compressed]
Query:               89 tokens  ( 2%)
Available:          1218 tokens (30%)

How to verify:

Have a 10-turn conversation. At each turn, print the context budget dashboard. Check that:

  • Total never exceeds 4096 tokens (or whatever budget you set)
  • The summary accurately reflects older conversation. After 6+ turns, print the summary. Does it capture the key topics? Ask the bot to recall something from turn 1 — can it answer from the summary?
  • Compressed chunks still convey useful information. Compare answers to the same question with and without compression. If the compressed version hallucinates or is significantly worse, your compression is too aggressive.

Run this test sequence:

Turn 1: "Tell me about The Hound of the Baskervilles"
Turn 2: "Who is the villain?"
Turn 3: "How does Holmes solve the case?"
Turn 4: "Now tell me about A Study in Scarlet"
Turn 5: "Who is the villain in that one?"
Turn 6: "Compare the two villains"  ← requires memory of both books

At Turn 6, the bot must reference details from both Turn 1–3 (Stapleton) and Turn 4–5 (Jefferson Hope), even though the early turns are now compressed into a summary. If it can't, your summary is losing critical information.

Stretch goal: Implement incremental summarization. Instead of re-summarizing from scratch each time, update the existing summary: "Current summary: {existing}. New messages: {new_messages}. Update the summary to include the new information." This is more token-efficient and preserves better continuity. Compare the quality of summaries after 10 turns between from-scratch and incremental approaches.


Module 7: Dynamic Prompt Assembly

📖 LEARN

In production LLM applications, prompts are not static strings. They are dynamically assembled at runtime from modular components based on the current state of the conversation, the user's query, and the results of various retrieval and analysis steps.

Think of it as a pipeline with decision points:

User query arrives
  → Classify intent (question, follow-up, meta-question, off-topic)
  → If question: determine which books/stories are relevant
  → Retrieve context (with reranking and filtering)
  → Check conversation history (is this a follow-up?)
  → Retrieve relevant memories
  → Apply compression if over budget
  → Select the right system prompt variant
  → Assemble final prompt from all components
  → Send to LLM

Key principles from Anthropic's engineering blog:

System prompt altitude: Write system prompts at the right level of abstraction. Too specific ("Always answer in exactly 3 paragraphs") creates brittle behavior. Too vague ("Be helpful") gives the model no useful guidance. The sweet spot: describe principles and examples ("When discussing plot details, reference specific story events. When you're unsure, say so rather than guessing.").

Just-in-time context injection: Don't load everything upfront. If the user hasn't asked about a specific story, don't waste tokens including its context. Inject only when needed — and remove it when the conversation moves on.

Prompt chaining: Some queries need multiple LLM calls. "Compare Holmes's deductive methods across all four novels" might require: (1) search each novel separately, (2) summarize findings per novel, (3) generate a comparison. This is a three-step chain, not a single prompt.

🔗 Further reading: Philipp Schmid — The New Skill in AI is Not Prompting, It's Context Engineering · LangChain — Context Engineering for Agents · 12-Factor Agents

🔨 PRACTICE — Exercise 7: "The Consulting Detective"

The problem: Your chatbot currently uses the same prompt structure for every query. But "What is Holmes's address?" (simple factual lookup), "Compare Moriarty and Stapleton as villains" (multi-document synthesis), and "Hello!" (casual greeting) all need fundamentally different context strategies. A greeting doesn't need 2000 tokens of retrieved context. A comparison needs retrieval from multiple books.

Your task:

Create src/07-assembly.ts that implements an intelligent prompt assembler:

  1. Intent classifier. Before doing anything else, classify the user's query by prompting the LLM:
Classify this message into one category:
- GREETING: casual hello, small talk
- FACTUAL: specific question with a clear answer
- ANALYTICAL: requires comparison, analysis, or synthesis
- FOLLOW_UP: refers to something said earlier in conversation
- META: question about the chatbot itself or how it works

Message: "{user_message}"
Respond with only the category name.
  1. Route-specific context strategies:
Intent Retrieved chunks Conversation history Memories
GREETING 0 Last 1 turn Top 1
FACTUAL Top 5, from a single relevant book if detectable Last 2 turns 0
ANALYTICAL Top 10, from multiple books, with MMR diversity Last 3 turns Top 2
FOLLOW_UP Top 5, using conversation-aware query rewriting All available Top 2
META 0 Last 2 turns 0
  1. Dynamic system prompt. Maintain a library of system prompt snippets:
const PROMPTS = {
  base: "You are an expert on the Sherlock Holmes stories by Arthur Conan Doyle.",
  factual: "Answer precisely and cite the specific story. If unsure, say so.",
  analytical: "Compare and contrast thoroughly. Reference specific scenes and quotes.",
  greeting: "Respond warmly and briefly. Mention you can discuss Sherlock Holmes stories.",
  meta: "Explain your capabilities honestly. You search a database of Sherlock Holmes texts.",
};

Assemble the system prompt dynamically: base + route-specific snippet.

  1. Context budget enforcement. After assembling all components, check total tokens. If over budget, compress in this priority order: (a) reduce retrieved chunks (drop lowest-ranked), (b) compress conversation history, (c) trim memories.

  2. Logging. At each turn, print a one-line summary of what the assembler decided:

[ANALYTICAL] 10 chunks retrieved from 3 books | 3 turns of history | 2 memories | 2847/4096 tokens

How to verify:

Run this conversation and check the routing log at each turn:

You: Hi there!
  → Should classify as GREETING, retrieve 0 chunks, respond briefly.

You: In which story does Holmes fake his death?
  → Should classify as FACTUAL, retrieve from specific story.

You: How does that compare to how Irene Adler outsmarted him?
  → Should classify as FOLLOW_UP (refers to "that") or ANALYTICAL.
  → Search query should reference Holmes faking his death + Irene Adler.

You: What makes a great detective villain?
  → Should classify as ANALYTICAL, retrieve from multiple books.

You: How do you find your answers?
  → Should classify as META, no retrieval needed.

Verify that GREETING turns are fast (no retrieval delay) and ANALYTICAL turns produce richer answers (more source material). If the intent classifier miscategorizes, adjust the classification prompt or add examples.

Stretch goal: Implement prompt chaining for ANALYTICAL queries. Break "Compare Moriarty and Stapleton" into: (1) Retrieve and summarize Moriarty passages → (2) Retrieve and summarize Stapleton passages → (3) Generate comparison from both summaries. Compare the chained answer to a single-retrieval answer. The chained version should cover both characters more thoroughly.


Module 8: Capstone — The Master Detective

📖 LEARN

This final module ties everything together into a production-quality (ish) system. But beyond integration, there's one more concept: evaluation.

How do you know your context engineering is actually good? In production, teams use both automated and human evaluation:

Automated evaluation with a test set of question-answer pairs. You create 10–20 questions with known correct answers, run them through your system, and score the outputs. This is your regression test — it catches when a change to your chunking strategy breaks answers that previously worked.

Faithfulness checking: Does the answer only contain information from the retrieved context? Or does it hallucinate details? You can check this by asking a second LLM call: "Does this answer contain any claims not supported by the provided context?"

Retrieval evaluation: Separately from answer quality, measure whether the retrieval found the right passages. If you know which story contains the answer, check whether that story appears in the retrieved chunks.

🔗 Further reading: RAGAS — RAG evaluation framework · Anthropic — Evaluating AI Agents

🔨 PRACTICE — Exercise 8: "The Final Problem"

The problem: You have all the pieces — chunking, embeddings, retrieval, reranking, memory, compression, dynamic assembly. Now integrate them into one clean system and prove it works with a rigorous evaluation.

Your task:

Create src/08-capstone.ts — a polished interactive chatbot that uses everything from Modules 0–7. Then create src/08-eval.ts — an automated evaluation harness.

Part A: The integrated chatbot

Refactor your code into clean modules:

src/
  lib/
    db.ts          — SQLite connection, schema, queries
    embeddings.ts  — embed(), search(), cosineSimilarity()
    chunking.ts    — chunkText(), strategies
    reranker.ts    — rerank(), mmrSelect()
    memory.ts      — addMemory(), getRelevantMemories(), extractFacts()
    compressor.ts  — summarizeHistory(), compressChunks(), tokenCount()
    assembler.ts   — classifyIntent(), assemblePrompt(), budgetCheck()
    llm.ts         — generate(), generateStreaming()
  08-capstone.ts   — interactive chat loop wiring everything together
  08-eval.ts       — automated evaluation

The chatbot should:

  • Start with a welcome message
  • Support multi-turn conversation
  • Log the context assembly decisions at each turn
  • Handle gracefully when it doesn't know something
  • Stream the response token by token (Ollama supports this)

Part B: The evaluation harness

Create a test set of 10 questions in data/eval.json:

[
  {
    "question": "What is the address of Sherlock Holmes?",
    "expected_keywords": ["221B", "Baker Street"],
    "expected_source": "multiple"
  },
  {
    "question": "How does Holmes determine the owner of a hat in 'The Blue Carbuncle'?",
    "expected_keywords": ["cubic capacity", "intellect", "hat"],
    "expected_source": "Adventures"
  },
  {
    "question": "Who is Professor Moriarty?",
    "expected_keywords": ["Napoleon of crime", "mathematician", "criminal"],
    "expected_source": "Memoirs"
  },
  {
    "question": "What breed is the hound in The Hound of the Baskervilles?",
    "expected_keywords": ["phosphorus", "Stapleton"],
    "expected_source": "Hound"
  },
  {
    "question": "How does Watson describe his first meeting with Holmes?",
    "expected_keywords": ["chemical", "laboratory", "Stamford", "Bart's"],
    "expected_source": "Study in Scarlet"
  }
]

(Add 5 more questions yourself, covering different stories and question types.)

The evaluation script should:

  1. Run each question through the full pipeline
  2. Check if the answer contains the expected keywords (case-insensitive)
  3. Check if the retrieved chunks include the expected source book
  4. Calculate a retrieval score (% of questions where the right source was retrieved)
  5. Calculate an answer score (% of questions where all expected keywords appear)
  6. Print a report:
===== EVALUATION REPORT =====
Total questions: 10

Retrieval accuracy: 8/10 (80%)
  ✓ "What is the address of Sherlock Holmes?" — correct source retrieved
  ✗ "How does Watson describe..." — expected "Study in Scarlet", got "Memoirs"

Answer accuracy: 7/10 (70%)
  ✓ "What is the address..." — found: 221B, Baker Street
  ✗ "What breed is the hound..." — missing: phosphorus

Overall score: 75%

How to verify:

Your targets:

  • Retrieval accuracy ≥ 80% — the right source book appears in retrieved chunks
  • Answer accuracy ≥ 70% — expected keywords appear in the generated answer

If you're below these thresholds, go back and adjust:

  • Low retrieval? Try different chunk sizes, or add story-level metadata filtering
  • Low answer quality? Try different prompt templates, or increase the number of retrieved chunks
  • Inconsistent? Run the eval 3 times — LLM outputs are non-deterministic, so average the scores

Stretch goal: Add a faithfulness check. For each answer, ask the LLM: "Does the following answer contain any claims NOT supported by the provided context passages? Answer YES or NO, then list any unsupported claims." Report a faithfulness score (% of answers with no unsupported claims). Target: ≥ 90%.


Where to go next

You've built a complete context engineering system from scratch. Here's where each technique leads in the real world:

Vector databases at scale: When your corpus grows beyond what SQLite can handle, look at purpose-built vector databases — Qdrant (open source, Rust), ChromaDB (Python-native, simple), or Weaviate (feature-rich). They add approximate nearest neighbor search, sharding, and filtering optimizations.

Agentic context engineering: The system you built is still "pull-based" — the user drives every query. AI agents are "push-based" — they decide themselves what context to fetch next. Read Anthropic's Building Effective Agents and the 12-Factor Agents manifesto.

Production prompt management: When your system has 20 prompt variants across 5 routes, you need version control for prompts. Tools like PromptHub, Langfuse, and Braintrust help track prompt versions, run A/B tests, and monitor quality.

Evaluation frameworks: RAGAS, DeepEval, and Langfuse provide structured evaluation for RAG systems — going far beyond the simple keyword checking in Exercise 8.

Context caching: Anthropic and OpenAI offer prompt caching that stores the KV attention states for static prompt prefixes, reducing costs by up to 90%. Structure your prompts with static content first to maximize cache hits. Anthropic — Prompt Caching.


Essential reading list

These are the foundational texts on context engineering, in recommended reading order:

  1. Karpathy — "Software Is Changing (Again)" (YC AI Startup School talk, June 2025) — The CPU/RAM mental model. [Transcript available via YouTube]
  2. Anthropic — "Effective context engineering for AI agents" (September 2025) — The practitioner bible. anthropic.com/engineering/effective-context-engineering-for-ai-agents
  3. Philipp Schmid — "The New Skill in AI is Not Prompting, It's Context Engineering" (June 2025) — The clearest definition. philschmid.de/context-engineering
  4. LangChain — "Context Engineering" (June 2025) — The write/select/compress/isolate framework. blog.langchain.com/context-engineering-for-agents
  5. Dex Horthy — 12-Factor Agents (April 2025) — "Own your context window." github.com/humanlayer/12-factor-agents
  6. LlamaIndex — "Context Engineering: What it is" (2025) — Adds global state and workflow engineering. llamaindex.ai/blog/context-engineering-what-it-is-and-techniques-to-consider
  7. Weaviate — "Chunking Strategies for RAG" — The best chunking reference. weaviate.io/blog/chunking-strategies-for-rag
  8. Mei et al. — "A Survey of Context Engineering for Large Language Models" (arXiv 2507.13334, July 2025) — The 166-page academic survey. arxiv.org/abs/2507.13334
  9. Simon Willison — "Context engineering" (June 2025) — Concise endorsement with links. simonwillison.net/2025/jun/27/context-engineering
  10. Addy Osmani — "Context Engineering: Bringing Engineering Discipline to Prompts" (2025) — Engineering discipline applied to prompt systems. addyo.substack.com/p/context-engineering-bringing-engineering