You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Are there cases where llama.cpp can produce more than one embedding from a single text input?
2
+
3
+
In **llama.cpp**, whether you get **one embedding** or **multiple embeddings** from a text input depends on:
4
+
5
+
1.**Pooling type (`llama_pooling_type`)**
6
+
7
+
*`LLAMA_POOLING_TYPE_NONE` → no pooling is applied.
8
+
9
+
* You get an embedding **per token** (so length = number of tokens).
10
+
* You retrieve it with `llama_get_embeddings(ctx)` after `llama_encode()` or `llama_decode()`.
11
+
*`LLAMA_POOLING_TYPE_MEAN`, `LLAMA_POOLING_TYPE_CLS`, `LLAMA_POOLING_TYPE_LAST` → pooling is applied.
12
+
13
+
* You get **one embedding per sequence ID**.
14
+
* You retrieve it with `llama_get_embeddings_seq(ctx, seq_id)`.
15
+
16
+
2.**Number of sequence IDs (`seq_id`) in the batch**
17
+
18
+
* Each sequence in the batch (distinguished by `seq_id`) can produce its own pooled embedding if pooling is enabled.
19
+
* Example: if you feed 3 sentences in parallel with `seq_id = 0,1,2` and pooling = MEAN, you’ll get **3 embeddings**.
20
+
21
+
3.**How you mark logits/outputs in the batch**
22
+
23
+
* With pooling enabled, you only need to request an output on one token per sequence (often the last token).
24
+
* Without pooling, you must request logits for every token you want embeddings for.
25
+
26
+
---
27
+
28
+
So, to directly answer:
29
+
30
+
***Single text input, pooling enabled** → **1 embedding** (sequence-level).
31
+
***Single text input, pooling disabled** → **N embeddings** (token-level, one per token).
32
+
***Multiple texts in one batch** → **N embeddings** (one per sequence if pooling, or token-level if not).
33
+
34
+
---
35
+
36
+
In your embedding use case (sounds like you’re doing what `llama.cpp` calls **sentence embedding**), you probably want **pooling enabled** (MEAN or CLS) and therefore **one embedding per text input**.
37
+
38
+
## Is there a way to force one embedding (sentence embedding)?
39
+
40
+
Yes, in **llama.cpp** you can force sentence-level embeddings by enabling a **pooling mode** in the context parameters. That way you’ll always get exactly **one embedding per sequence ID**, regardless of how many tokens the text expands into.
41
+
42
+
---
43
+
44
+
### How to do it
45
+
46
+
When you create the context (`llama_context_params`), set:
That’s it: you’ll get **one embedding vector** for the entire input string.
70
+
71
+
---
72
+
73
+
### Contrast with token-level embeddings
74
+
75
+
* If `pooling_type == LLAMA_POOLING_TYPE_NONE` → `llama_get_embeddings(ctx)` gives you **token-level embeddings** (matrix: tokens × dim).
76
+
* If `pooling_type != NONE` → `llama_get_embeddings_seq(ctx, seq_id)` gives you **sentence-level embeddings** (one vector per sequence).
77
+
78
+
---
79
+
80
+
**So the way to force one embedding per text is:**
81
+
82
+
* Set `pooling_type` in the context params (`MEAN`, `CLS`, or `LAST`).
83
+
* Use `llama_get_embeddings_seq()` instead of `llama_get_embeddings()`.
84
+
85
+
---
86
+
87
+
## Is the pooling_type related to the model? Can I be sure that all models support all the pooling types?
88
+
89
+
90
+
The **pooling\_type** in **llama.cpp** is **not a property of the model itself** – it’s purely how **llama.cpp** aggregates the **token embeddings** it already computes. All decoder-only LLaMA-style models output a hidden state per token, so pooling is just a post-processing step that the library offers.
91
+
92
+
---
93
+
94
+
### What this means
95
+
96
+
* Any model you load with **llama.cpp** produces **per-token embeddings**.
97
+
* Pooling (`MEAN`, `CLS`, `LAST`) is applied on top of those token embeddings to compress them into **one vector**.
98
+
* So yes: **all models supported by llama.cpp can use all pooling types**.
99
+
* The only requirement is that you create the context with `cparams.pooling_type` set, so llama.cpp knows to allocate memory for pooled embeddings.
100
+
101
+
---
102
+
103
+
### Differences between pooling modes
104
+
105
+
***MEAN** → average of all token embeddings in the sequence.
106
+
107
+
* Default / most common for sentence embeddings & semantic search.
108
+
***CLS** → use the embedding of the first token (analogous to `[CLS]` in BERT-style models).
109
+
110
+
* Works, but LLaMA models weren’t trained with a `[CLS]` objective, so performance may vary.
111
+
***LAST** → use the embedding of the final token.
112
+
113
+
* Sometimes useful for autoregressive tasks; not typically the best for embeddings.
114
+
115
+
---
116
+
117
+
### Important
118
+
119
+
The quality of the **sentence embedding** you get *does* depend on the **model training objective**:
120
+
121
+
* Models explicitly trained for embedding tasks (e.g. `all-MiniLM-L6-v2`, `text-embedding-ada-002`, `nomic-embed`) will usually outperform a raw LLaMA model, even if you apply pooling.
122
+
* LLaMA-style models with pooling still give you usable vectors, but they weren’t optimized for semantic similarity.
123
+
124
+
---
125
+
126
+
So in summary:
127
+
128
+
* Pooling is **always available** in llama.cpp.
129
+
* All models supported by llama.cpp can use **MEAN / CLS / LAST** pooling.
130
+
* The *choice of pooling* affects embedding quality, but you won’t get an error from the library.
// even with chunking, decoder embeddings need the full sequence to be in the KV once
785
+
if (n_tokens>n_ctx) {
786
+
sqlite3_free(embedding);
787
+
sqlite_context_result_error(context, SQLITE_TOOBIG, "Input too large for model context: %d tokens > n_ctx %d. Create a context with a n_ctx value higher than %d.", n_tokens, n_ctx, n_tokens);
0 commit comments