|
| 1 | +--- |
| 2 | +title: "LLMs 101: A Practical Introduction" |
| 3 | +description: "A hands-on, code-first introduction to large language models for Cookbook readers." |
| 4 | +last_updated: "2025-08-24" |
| 5 | +--- |
| 6 | + |
| 7 | +# LLMs 101: A Practical Introduction |
| 8 | + |
| 9 | +> **Who this is for.** Developers who want a fast, working understanding of large language models and the knobs that matter in real apps. |
| 10 | +
|
| 11 | +## At a glance |
| 12 | + |
| 13 | +``` |
| 14 | +Text prompt |
| 15 | + ↓ (tokenization) |
| 16 | +Tokens → Embeddings → [Transformer layers × N] → Next‑token probabilities |
| 17 | + ↓ ↓ |
| 18 | +Detokenization Sampling (temperature/top_p) → Output text |
| 19 | +``` |
| 20 | + |
| 21 | +- **LLMs** are neural networks (usually **transformers**) trained on lots of text to predict the next token. |
| 22 | +- **Tokenization** splits text into subword units; **embeddings** map tokens to vectors; transformer layers build context‑aware representations. |
| 23 | +- Generation repeats next‑token sampling until a stop condition (length or stop sequences) is met. |
| 24 | + |
| 25 | +--- |
| 26 | + |
| 27 | +## Quick start: generate text |
| 28 | + |
| 29 | +### Python |
| 30 | + |
| 31 | +```python |
| 32 | +from openai import OpenAI |
| 33 | + |
| 34 | +client = OpenAI() |
| 35 | +resp = client.responses.create( |
| 36 | + model="gpt-4o", |
| 37 | + instructions="You are a concise technical explainer.", |
| 38 | + input="In one paragraph, explain what a token is in an LLM." |
| 39 | +) |
| 40 | +print(resp.output_text) |
| 41 | +``` |
| 42 | + |
| 43 | +### JavaScript / TypeScript |
| 44 | + |
| 45 | +```js |
| 46 | +import OpenAI from "openai"; |
| 47 | +const client = new OpenAI(); |
| 48 | + |
| 49 | +const resp = await client.chat.completions.create({ |
| 50 | + model: "gpt-4o", |
| 51 | + messages: [ |
| 52 | + { role: "system", content: "You are a concise technical explainer." }, |
| 53 | + { role: "user", content: "In one paragraph, explain what a token is in an LLM." } |
| 54 | + ] |
| 55 | +}); |
| 56 | +console.log(resp.choices[0].message.content); |
| 57 | +``` |
| 58 | + |
| 59 | +> **Tip.** Model names evolve; check your Models list before shipping. Prefer streaming for chat‑like UIs (see below). |
| 60 | +
|
| 61 | +--- |
| 62 | + |
| 63 | +## What can LLMs do? |
| 64 | + |
| 65 | +Despite the name, LLMs can be **multi‑modal** when models and inputs support it (text, code, sometimes images/audio). Core text tasks: |
| 66 | + |
| 67 | +- **Generate**: draft, rewrite, continue, or brainstorm. |
| 68 | +- **Transform**: translate, rephrase, format, classify, extract. |
| 69 | +- **Analyze**: summarize, compare, tag, or answer questions. |
| 70 | +- **Tool use / agents**: call functions or APIs as part of a loop to act. |
| 71 | + |
| 72 | +These patterns compose into search, assistants, form‑fillers, data extraction, QA, and more. |
| 73 | + |
| 74 | +--- |
| 75 | + |
| 76 | +## How LLMs work (just enough to be dangerous) |
| 77 | + |
| 78 | +1. **Tokenization.** Input text → tokens (IDs). Whitespace and punctuation matter—“token‑budget math” is a real constraint. |
| 79 | +2. **Embeddings.** Each token ID becomes a vector; positions are encoded so order matters. |
| 80 | +3. **Transformer layers.** Self‑attention mixes information across positions so each token’s representation becomes **contextual** (richer than the raw embedding). |
| 81 | +4. **Decoding.** The model outputs a probability distribution over the next token. |
| 82 | +5. **Sampling.** Choose how “adventurous” generation is (see knobs below), append the token, and repeat until done. |
| 83 | + |
| 84 | +--- |
| 85 | + |
| 86 | +## The knobs you’ll touch most |
| 87 | + |
| 88 | +- **Temperature** *(0.0–2.0)* — Lower → more deterministic/boring; higher → more diverse/creative. |
| 89 | +- **Top‑p (nucleus)** *(0–1)* — Sample only from the smallest set of tokens whose cumulative probability ≤ *p*. |
| 90 | +- **Max output tokens** — Hard limit on output length; controls latency and cost. |
| 91 | +- **System / instructions** — Up‑front role, constraints, and style to steer behavior. |
| 92 | +- **Stop sequences** — Cleanly cut off output at known boundaries. |
| 93 | +- **Streaming** — Receive tokens as they’re generated; improves perceived latency. |
| 94 | + |
| 95 | +**Practical defaults:** `temperature=0.2–0.7`, `top_p=1.0`, set a **max output** that fits your UI, and **stream** by default for chat UX. |
| 96 | + |
| 97 | +--- |
| 98 | + |
| 99 | +## Make context do the heavy lifting |
| 100 | + |
| 101 | +- **Context window.** Inputs + outputs share a finite token budget; plan prompts and retrieval to fit. |
| 102 | +- **Ground with your data (RAG).** Retrieve relevant snippets and include them in the prompt to improve factuality. |
| 103 | +- **Structured outputs.** Ask for JSON (and validate) when you need machine‑readable results. |
| 104 | +- **Few‑shot examples.** Provide 1–3 compact exemplars to stabilize format and tone. |
| 105 | + |
| 106 | +--- |
| 107 | + |
| 108 | +## Minimal streaming example |
| 109 | + |
| 110 | +### Python |
| 111 | + |
| 112 | +```python |
| 113 | +from openai import OpenAI |
| 114 | +client = OpenAI() |
| 115 | + |
| 116 | +with client.responses.stream( |
| 117 | + model="gpt-4o", |
| 118 | + input="Stream a two-sentence explanation of context windows." |
| 119 | +) as stream: |
| 120 | + for event in stream: |
| 121 | + if event.type == "response.output_text.delta": |
| 122 | + print(event.delta, end="") |
| 123 | +``` |
| 124 | + |
| 125 | +### JavaScript |
| 126 | + |
| 127 | +```js |
| 128 | +import OpenAI from "openai"; |
| 129 | +const client = new OpenAI(); |
| 130 | + |
| 131 | +const stream = await client.responses.stream({ |
| 132 | + model: "gpt-4o", |
| 133 | + input: "Stream a two-sentence explanation of context windows." |
| 134 | +}); |
| 135 | + |
| 136 | +for await (const event of stream) { |
| 137 | + if (event.type === "response.output_text.delta") { |
| 138 | + process.stdout.write(event.delta); |
| 139 | + } |
| 140 | +} |
| 141 | +``` |
| 142 | + |
| 143 | +--- |
| 144 | + |
| 145 | +## Limitations (design around these) |
| 146 | + |
| 147 | +- **Hallucinations.** Models can generate plausible but false statements. Ground with citations/RAG; validate critical outputs. |
| 148 | +- **Recency.** Models don’t inherently know the latest facts; retrieve or provide current data. |
| 149 | +- **Ambiguity.** Vague prompts → vague answers; specify domain, audience, length, and format. |
| 150 | +- **Determinism.** Even at `temperature=0`, responses may vary across runs/envs. Don’t promise bit‑for‑bit reproducibility. |
| 151 | +- **Cost & latency.** Longer prompts and bigger models are slower and costlier; iterate toward the smallest model that meets quality. |
| 152 | + |
| 153 | +--- |
| 154 | + |
| 155 | +## Common gotchas |
| 156 | + |
| 157 | +- **Characters ≠ tokens.** Budget both input and output to avoid truncation. |
| 158 | +- **Over‑prompting.** Prefer simple, testable instructions; add examples sparingly. |
| 159 | +- **Leaky formats.** If you need JSON, enforce it (schema + validators) and add a repair step. |
| 160 | +- **One prompt for everything.** Separate prompts per task/endpoint; keep them versioned and testable. |
| 161 | +- **Skipping evaluation.** Keep a tiny dataset of real tasks; score changes whenever you tweak prompts, models, or retrieval. |
| 162 | + |
| 163 | +--- |
| 164 | + |
| 165 | +## Glossary |
| 166 | + |
| 167 | +- **Token** — Small unit of text (≈ subword) used by models. |
| 168 | +- **Embedding** — Vector representation of a token or text span. |
| 169 | +- **Context window** — Max tokens the model can attend to at once (prompt + output). |
| 170 | +- **Temperature / top‑p** — Randomness controls during sampling. |
| 171 | +- **System / instructions** — Up‑front guidance that shapes responses. |
| 172 | +- **RAG** — Retrieval‑Augmented Generation; retrieve data and include it in the prompt. |
| 173 | + |
| 174 | +--- |
| 175 | + |
| 176 | +## Where to go next |
| 177 | + |
| 178 | +- Prompt patterns for **structured outputs** |
| 179 | +- **Retrieval‑augmented generation (RAG)** basics |
| 180 | +- **Evaluating** LLM quality (offline + online) |
| 181 | +- **Streaming UX** patterns and backpressure handling |
| 182 | +- **Safety** and policy‑aware prompting |
| 183 | + |
| 184 | +> Adapted from a shorter draft and expanded with code-first guidance. |
0 commit comments