|
| 1 | +## Are there cases where llama.cpp can produce more than one embedding from a single text input? |
| 2 | + |
| 3 | +In **llama.cpp**, whether you get **one embedding** or **multiple embeddings** from a text input depends on: |
| 4 | + |
| 5 | +1. **Pooling type (`llama_pooling_type`)** |
| 6 | + |
| 7 | + * `LLAMA_POOLING_TYPE_NONE` → no pooling is applied. |
| 8 | + |
| 9 | + * You get an embedding **per token** (so length = number of tokens). |
| 10 | + * You retrieve it with `llama_get_embeddings(ctx)` after `llama_encode()` or `llama_decode()`. |
| 11 | + * `LLAMA_POOLING_TYPE_MEAN`, `LLAMA_POOLING_TYPE_CLS`, `LLAMA_POOLING_TYPE_LAST` → pooling is applied. |
| 12 | + |
| 13 | + * You get **one embedding per sequence ID**. |
| 14 | + * You retrieve it with `llama_get_embeddings_seq(ctx, seq_id)`. |
| 15 | + |
| 16 | +2. **Number of sequence IDs (`seq_id`) in the batch** |
| 17 | + |
| 18 | + * Each sequence in the batch (distinguished by `seq_id`) can produce its own pooled embedding if pooling is enabled. |
| 19 | + * Example: if you feed 3 sentences in parallel with `seq_id = 0,1,2` and pooling = MEAN, you’ll get **3 embeddings**. |
| 20 | + |
| 21 | +3. **How you mark logits/outputs in the batch** |
| 22 | + |
| 23 | + * With pooling enabled, you only need to request an output on one token per sequence (often the last token). |
| 24 | + * Without pooling, you must request logits for every token you want embeddings for. |
| 25 | + |
| 26 | +--- |
| 27 | + |
| 28 | +So, to directly answer: |
| 29 | + |
| 30 | +* **Single text input, pooling enabled** → **1 embedding** (sequence-level). |
| 31 | +* **Single text input, pooling disabled** → **N embeddings** (token-level, one per token). |
| 32 | +* **Multiple texts in one batch** → **N embeddings** (one per sequence if pooling, or token-level if not). |
| 33 | + |
| 34 | +--- |
| 35 | + |
| 36 | +In your embedding use case (sounds like you’re doing what `llama.cpp` calls **sentence embedding**), you probably want **pooling enabled** (MEAN or CLS) and therefore **one embedding per text input**. |
| 37 | + |
| 38 | +## Is there a way to force one embedding (sentence embedding)? |
| 39 | + |
| 40 | +Yes, in **llama.cpp** you can force sentence-level embeddings by enabling a **pooling mode** in the context parameters. That way you’ll always get exactly **one embedding per sequence ID**, regardless of how many tokens the text expands into. |
| 41 | + |
| 42 | +--- |
| 43 | + |
| 44 | +### How to do it |
| 45 | + |
| 46 | +When you create the context (`llama_context_params`), set: |
| 47 | + |
| 48 | +```c |
| 49 | +struct llama_context_params cparams = llama_context_default_params(); |
| 50 | + |
| 51 | +// choose your pooling mode: |
| 52 | +cparams.pooling_type = LLAMA_POOLING_TYPE_MEAN; // common for sentence embeddings |
| 53 | +// or: LLAMA_POOLING_TYPE_CLS |
| 54 | +// or: LLAMA_POOLING_TYPE_LAST |
| 55 | + |
| 56 | +struct llama_context *ctx = llama_new_context_with_model(model, cparams); |
| 57 | +``` |
| 58 | + |
| 59 | +Then, when you process text with `llama_encode()` or `llama_decode()`: |
| 60 | + |
| 61 | +* Use a single `seq_id` for that text (e.g. `seq_id = 0`). |
| 62 | +* After the call, retrieve the **sequence embedding**: |
| 63 | + |
| 64 | +```c |
| 65 | +const float *embedding = llama_get_embeddings_seq(ctx, seq_id); |
| 66 | +// embedding size = llama_n_embd(model) |
| 67 | +``` |
| 68 | + |
| 69 | +That’s it: you’ll get **one embedding vector** for the entire input string. |
| 70 | + |
| 71 | +--- |
| 72 | + |
| 73 | +### Contrast with token-level embeddings |
| 74 | + |
| 75 | +* If `pooling_type == LLAMA_POOLING_TYPE_NONE` → `llama_get_embeddings(ctx)` gives you **token-level embeddings** (matrix: tokens × dim). |
| 76 | +* If `pooling_type != NONE` → `llama_get_embeddings_seq(ctx, seq_id)` gives you **sentence-level embeddings** (one vector per sequence). |
| 77 | + |
| 78 | +--- |
| 79 | + |
| 80 | +**So the way to force one embedding per text is:** |
| 81 | + |
| 82 | +* Set `pooling_type` in the context params (`MEAN`, `CLS`, or `LAST`). |
| 83 | +* Use `llama_get_embeddings_seq()` instead of `llama_get_embeddings()`. |
| 84 | + |
| 85 | +--- |
| 86 | + |
| 87 | +## Is the pooling_type related to the model? Can I be sure that all models support all the pooling types? |
| 88 | + |
| 89 | + |
| 90 | +The **pooling\_type** in **llama.cpp** is **not a property of the model itself** – it’s purely how **llama.cpp** aggregates the **token embeddings** it already computes. All decoder-only LLaMA-style models output a hidden state per token, so pooling is just a post-processing step that the library offers. |
| 91 | + |
| 92 | +--- |
| 93 | + |
| 94 | +### What this means |
| 95 | + |
| 96 | +* Any model you load with **llama.cpp** produces **per-token embeddings**. |
| 97 | +* Pooling (`MEAN`, `CLS`, `LAST`) is applied on top of those token embeddings to compress them into **one vector**. |
| 98 | +* So yes: **all models supported by llama.cpp can use all pooling types**. |
| 99 | +* The only requirement is that you create the context with `cparams.pooling_type` set, so llama.cpp knows to allocate memory for pooled embeddings. |
| 100 | + |
| 101 | +--- |
| 102 | + |
| 103 | +### Differences between pooling modes |
| 104 | + |
| 105 | +* **MEAN** → average of all token embeddings in the sequence. |
| 106 | + |
| 107 | + * Default / most common for sentence embeddings & semantic search. |
| 108 | +* **CLS** → use the embedding of the first token (analogous to `[CLS]` in BERT-style models). |
| 109 | + |
| 110 | + * Works, but LLaMA models weren’t trained with a `[CLS]` objective, so performance may vary. |
| 111 | +* **LAST** → use the embedding of the final token. |
| 112 | + |
| 113 | + * Sometimes useful for autoregressive tasks; not typically the best for embeddings. |
| 114 | + |
| 115 | +--- |
| 116 | + |
| 117 | +### Important |
| 118 | + |
| 119 | +The quality of the **sentence embedding** you get *does* depend on the **model training objective**: |
| 120 | + |
| 121 | +* Models explicitly trained for embedding tasks (e.g. `all-MiniLM-L6-v2`, `text-embedding-ada-002`, `nomic-embed`) will usually outperform a raw LLaMA model, even if you apply pooling. |
| 122 | +* LLaMA-style models with pooling still give you usable vectors, but they weren’t optimized for semantic similarity. |
| 123 | + |
| 124 | +--- |
| 125 | + |
| 126 | +So in summary: |
| 127 | + |
| 128 | +* Pooling is **always available** in llama.cpp. |
| 129 | +* All models supported by llama.cpp can use **MEAN / CLS / LAST** pooling. |
| 130 | +* The *choice of pooling* affects embedding quality, but you won’t get an error from the library. |
| 131 | + |
| 132 | +--- |
0 commit comments