Skip to content

Commit bf0bf46

Browse files
authored
Update API.md
1 parent 3b205a1 commit bf0bf46

File tree

1 file changed

+85
-10
lines changed

1 file changed

+85
-10
lines changed

API.md

Lines changed: 85 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -79,28 +79,99 @@ SELECT llm_model_free();
7979

8080
---
8181

82-
## `llm_context_create(options TEXT)`
82+
## `llm_context_create(context_settings TEXT)`
83+
84+
**Parameters:** context_settings: comma-separated key=value pairs (see [context settings](#context settings)).
8385

8486
**Returns:** `NULL`
8587

8688
**Description:**
8789
Creates a new inference context with comma separated key=value configuration.
8890

89-
Context must explicitly created before performing any AI operation!
91+
**Context must explicitly created before performing any AI operation!**
92+
93+
## context_settings
94+
The following keys are available in context_settings:
95+
96+
### General
97+
98+
| Key | Type | Meaning |
99+
| ------------------------| -------- | ---------------------------------------------------------------- |
100+
| `generate_embedding` | `1 or 0` | Force the model to generate embeddings. |
101+
| `normalize_embedding` | `1 or 0` | Force normalization during embedding generation (default to 1). |
102+
| `json_output` | `1 or 0` | Force JSON output in embedding generation (default to 0). |
103+
| `max_tokens` | `number` | Set a maximum number of tokens in input. If input is too large then an error is returned. |
104+
| `n_predict` | `number` | Control the maximum number of tokens generated during text generation. |
105+
| `embedding_type` | `FLOAT32, FLOAT16, BFLOAT16, UINT8, INT8` | Set the model native type, mandatory during embedding generation. |
106+
107+
### Core sizing & threading
108+
109+
| Key | Type | Meaning |
110+
| ------------------------ | -------- | ---------------------------------------------------------------- |
111+
| `context_size` | `number` | Equivalent to n_ctx = N and n_batch = N. |
112+
| `n_ctx` | `number` | Text context length (tokens). `0` = from model. |
113+
| `n_batch` | `number` | **Logical** max batch size submitted to `llama_decode`. |
114+
| `n_ubatch` | `number` | **Physical** max micro-batch size. |
115+
| `n_seq_max` | `number` | Max concurrent sequences (parallel states for recurrent models). |
116+
| `n_threads` | `number` | Threads for generation. |
117+
| `n_threads_batch` | `number` | Threads for batch processing. |
118+
119+
### Attention, pooling & flash-attention
120+
121+
| Key | Type | Meaning |
122+
| ----------------- | ---------------------------- | ------------------------------------------------- |
123+
| `pooling_type` | `none, unspecified, mean, cls, last or rank` | How to aggregate token embeddings (e.g., `mean`). |
124+
| `attention_type` | `unspecified, causal, non_causal` | Attention algorithm for embeddings. |
125+
| `flash_attn_type` | `auto, disabled, enabled` | Controls when/if Flash-Attention is used. |
126+
127+
### RoPE & YaRN (positional scaling)
128+
129+
| Key | Type | Meaning |
130+
| ------------------- | ------------------------------ | ------------------------------------------------- |
131+
| `rope_scaling_type` | `unspecified, none, linear, yarn, longrope` | RoPE scaling strategy. |
132+
| `rope_freq_base` | `float number` | RoPE base frequency. `0` = from model. |
133+
| `rope_freq_scale` | `float number` | RoPE frequency scaling factor. `0` = from model. |
134+
| `yarn_ext_factor` | `float number` | YaRN extrapolation mix factor. `<0` = from model. |
135+
| `yarn_attn_factor` | `float number` | YaRN magnitude scaling. |
136+
| `yarn_beta_fast` | `float number` | YaRN low correction dimension. |
137+
| `yarn_beta_slow` | `float number` | YaRN high correction dimension. |
138+
| `yarn_orig_ctx` | `number` | YaRN original context size. |
139+
140+
### KV cache types (experimental)
141+
142+
| Key | Type | Meaning |
143+
| -------- | ---------------- | ---------------------- |
144+
| `type_k` | [ggml_type](https://github.com/ggml-org/llama.cpp/blob/00681dfc16ba4cebb9c7fbd2cf2656e06a0692a4/ggml/include/ggml.h#L377) | Data type for K cache. |
145+
| `type_v` | [ggml_type](https://github.com/ggml-org/llama.cpp/blob/00681dfc16ba4cebb9c7fbd2cf2656e06a0692a4/ggml/include/ggml.h#L377) | Data type for V cache. |
146+
147+
### Flags
148+
149+
> Place booleans at the end of your option string if you’re copy-by-value mirroring a struct; otherwise order doesn’t matter.
150+
151+
| Key | Type | Meaning |
152+
| -------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
153+
| `embeddings` | `1 or 0` | If `1`, extract embeddings (with logits). Used by the embedding preset. |
154+
| `offload_kqv` | `1 or 0` | Offload KQV ops (incl. KV cache) to GPU. |
155+
| `no_perf` | `1 or 0` | Disable performance timing. |
156+
| `op_offload` | `1 or 0` | Offload host tensor ops to device. |
157+
| `swa_full` | `1 or 0` | Use full-size SWA cache. When `false` and `n_seq_max > 1`, performance may degrade. |
158+
| `kv_unified` | `1 or 0` | Use a unified buffer across input sequences during attention. Try disabling when `n_seq_max > 1` and sequences do not share a long prefix. |
159+
| `defrag_thold` | `float number` | **Deprecated.** Defragment KV cache if `holes/size > thold`. `<= 0` disables. |
160+
161+
---
90162

91-
The following keys are available:
92-
```
93-
```
94163

95164
**Example:**
96165

97166
```sql
98-
SELECT llm_context_create('n_ctx=2048');
167+
SELECT llm_context_create('n_ctx=2048,n_threads=6,n_batch=256');
99168
```
100169

101170
---
102171

103-
## `llm_context_create_embedding()`
172+
## `llm_context_create_embedding(context_settings TEXT)`
173+
174+
**Parameters:** **`context_settings` (optional):** Comma-separated `key=value` pairs to override or extend default settings (see [context settings](#context_settings) in `llm_context_create`).
104175

105176
**Returns:** `NULL`
106177

@@ -109,7 +180,7 @@ Creates a new inference context specifically set for embedding generation.
109180

110181
It is equivalent to `SELECT llm_context_create('generate_embedding=1,normalize_embedding=1,pooling_type=mean');`
111182

112-
Context must explicitly created before performing any AI operation!
183+
**Context must explicitly created before performing any AI operation!**
113184

114185
**Example:**
115186

@@ -119,7 +190,9 @@ SELECT llm_context_create_embedding();
119190

120191
---
121192

122-
## `llm_context_create_chat()`
193+
## `llm_context_create_chat(context_settings TEXT)`
194+
195+
**Parameters:** **`context_settings` (optional):** Comma-separated `key=value` pairs to override or extend default settings (see [context settings](#context_settings) in `llm_context_create`).
123196

124197
**Returns:** `NULL`
125198

@@ -138,7 +211,9 @@ SELECT llm_context_create_chat();
138211

139212
---
140213

141-
## `llm_context_create_textgen()`
214+
## `llm_context_create_textgen(context_settings TEXT)`
215+
216+
**Parameters:** **`context_settings` (optional):** Comma-separated `key=value` pairs to override or extend default settings (see [context settings](#context_settings) in `llm_context_create`).
142217

143218
**Returns:** `NULL`
144219

0 commit comments

Comments
 (0)