Skip to content

Commit 987e46e

Browse files
committed
Merge branch 'context-refactoring'
2 parents a1ab889 + 83ede9f commit 987e46e

File tree

5 files changed

+603
-224
lines changed

5 files changed

+603
-224
lines changed

notes/EMBEDDING.md

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
## Are there cases where llama.cpp can produce more than one embedding from a single text input?
2+
3+
In **llama.cpp**, whether you get **one embedding** or **multiple embeddings** from a text input depends on:
4+
5+
1. **Pooling type (`llama_pooling_type`)**
6+
7+
* `LLAMA_POOLING_TYPE_NONE` → no pooling is applied.
8+
9+
* You get an embedding **per token** (so length = number of tokens).
10+
* You retrieve it with `llama_get_embeddings(ctx)` after `llama_encode()` or `llama_decode()`.
11+
* `LLAMA_POOLING_TYPE_MEAN`, `LLAMA_POOLING_TYPE_CLS`, `LLAMA_POOLING_TYPE_LAST` → pooling is applied.
12+
13+
* You get **one embedding per sequence ID**.
14+
* You retrieve it with `llama_get_embeddings_seq(ctx, seq_id)`.
15+
16+
2. **Number of sequence IDs (`seq_id`) in the batch**
17+
18+
* Each sequence in the batch (distinguished by `seq_id`) can produce its own pooled embedding if pooling is enabled.
19+
* Example: if you feed 3 sentences in parallel with `seq_id = 0,1,2` and pooling = MEAN, you’ll get **3 embeddings**.
20+
21+
3. **How you mark logits/outputs in the batch**
22+
23+
* With pooling enabled, you only need to request an output on one token per sequence (often the last token).
24+
* Without pooling, you must request logits for every token you want embeddings for.
25+
26+
---
27+
28+
So, to directly answer:
29+
30+
* **Single text input, pooling enabled****1 embedding** (sequence-level).
31+
* **Single text input, pooling disabled****N embeddings** (token-level, one per token).
32+
* **Multiple texts in one batch****N embeddings** (one per sequence if pooling, or token-level if not).
33+
34+
---
35+
36+
In your embedding use case (sounds like you’re doing what `llama.cpp` calls **sentence embedding**), you probably want **pooling enabled** (MEAN or CLS) and therefore **one embedding per text input**.
37+
38+
## Is there a way to force one embedding (sentence embedding)?
39+
40+
Yes, in **llama.cpp** you can force sentence-level embeddings by enabling a **pooling mode** in the context parameters. That way you’ll always get exactly **one embedding per sequence ID**, regardless of how many tokens the text expands into.
41+
42+
---
43+
44+
### How to do it
45+
46+
When you create the context (`llama_context_params`), set:
47+
48+
```c
49+
struct llama_context_params cparams = llama_context_default_params();
50+
51+
// choose your pooling mode:
52+
cparams.pooling_type = LLAMA_POOLING_TYPE_MEAN; // common for sentence embeddings
53+
// or: LLAMA_POOLING_TYPE_CLS
54+
// or: LLAMA_POOLING_TYPE_LAST
55+
56+
struct llama_context *ctx = llama_new_context_with_model(model, cparams);
57+
```
58+
59+
Then, when you process text with `llama_encode()` or `llama_decode()`:
60+
61+
* Use a single `seq_id` for that text (e.g. `seq_id = 0`).
62+
* After the call, retrieve the **sequence embedding**:
63+
64+
```c
65+
const float *embedding = llama_get_embeddings_seq(ctx, seq_id);
66+
// embedding size = llama_n_embd(model)
67+
```
68+
69+
That’s it: you’ll get **one embedding vector** for the entire input string.
70+
71+
---
72+
73+
### Contrast with token-level embeddings
74+
75+
* If `pooling_type == LLAMA_POOLING_TYPE_NONE``llama_get_embeddings(ctx)` gives you **token-level embeddings** (matrix: tokens × dim).
76+
* If `pooling_type != NONE``llama_get_embeddings_seq(ctx, seq_id)` gives you **sentence-level embeddings** (one vector per sequence).
77+
78+
---
79+
80+
**So the way to force one embedding per text is:**
81+
82+
* Set `pooling_type` in the context params (`MEAN`, `CLS`, or `LAST`).
83+
* Use `llama_get_embeddings_seq()` instead of `llama_get_embeddings()`.
84+
85+
---
86+
87+
## Is the pooling_type related to the model? Can I be sure that all models support all the pooling types?
88+
89+
90+
The **pooling\_type** in **llama.cpp** is **not a property of the model itself** – it’s purely how **llama.cpp** aggregates the **token embeddings** it already computes. All decoder-only LLaMA-style models output a hidden state per token, so pooling is just a post-processing step that the library offers.
91+
92+
---
93+
94+
### What this means
95+
96+
* Any model you load with **llama.cpp** produces **per-token embeddings**.
97+
* Pooling (`MEAN`, `CLS`, `LAST`) is applied on top of those token embeddings to compress them into **one vector**.
98+
* So yes: **all models supported by llama.cpp can use all pooling types**.
99+
* The only requirement is that you create the context with `cparams.pooling_type` set, so llama.cpp knows to allocate memory for pooled embeddings.
100+
101+
---
102+
103+
### Differences between pooling modes
104+
105+
* **MEAN** → average of all token embeddings in the sequence.
106+
107+
* Default / most common for sentence embeddings & semantic search.
108+
* **CLS** → use the embedding of the first token (analogous to `[CLS]` in BERT-style models).
109+
110+
* Works, but LLaMA models weren’t trained with a `[CLS]` objective, so performance may vary.
111+
* **LAST** → use the embedding of the final token.
112+
113+
* Sometimes useful for autoregressive tasks; not typically the best for embeddings.
114+
115+
---
116+
117+
### Important
118+
119+
The quality of the **sentence embedding** you get *does* depend on the **model training objective**:
120+
121+
* Models explicitly trained for embedding tasks (e.g. `all-MiniLM-L6-v2`, `text-embedding-ada-002`, `nomic-embed`) will usually outperform a raw LLaMA model, even if you apply pooling.
122+
* LLaMA-style models with pooling still give you usable vectors, but they weren’t optimized for semantic similarity.
123+
124+
---
125+
126+
So in summary:
127+
128+
* Pooling is **always available** in llama.cpp.
129+
* All models supported by llama.cpp can use **MEAN / CLS / LAST** pooling.
130+
* The *choice of pooling* affects embedding quality, but you won’t get an error from the library.
131+
132+
---

0 commit comments

Comments
 (0)