You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
--user-data-dir USER_DATA_DIR Path to the user data directory. Default: auto-detected.
260
266
--multi-user Multi-user mode. Chat histories are not saved or automatically loaded. Warning: this is likely not safe for sharing publicly.
261
267
--model MODEL Name of the model to load by default.
262
268
--lora LORA [LORA ...] The list of LoRAs to load. If you want to load more than one LoRA, write the names separated by spaces.
@@ -280,12 +286,12 @@ Image model:
280
286
Quantization method for image model.
281
287
282
288
Model loader:
283
-
--loader LOADER Choose the model loader manually, otherwise, it will get autodetected. Valid options: Transformers, llama.cpp, ExLlamav3_HF, ExLlamav3,
284
-
TensorRT-LLM.
289
+
--loader LOADER Choose the model loader manually, otherwise, it will get autodetected. Valid options: Transformers, llama.cpp, ExLlamav3_HF, ExLlamav3, TensorRT-
290
+
LLM.
285
291
286
292
Context and cache:
287
-
--ctx-size N, --n_ctx N, --max_seq_len N Context size in tokens. llama.cpp: 0 = auto if gpu-layers is also -1.
288
-
--cache-type N, --cache_type N KV cache type; valid options: llama.cpp - fp16, q8_0, q4_0; ExLlamaV3 - fp16, q2 to q8 (can specify k_bits and v_bits separately, e.g. q4_q8).
293
+
--ctx-size, --n_ctx, --max_seq_len N Context size in tokens. llama.cpp: 0 = auto if gpu-layers is also -1.
294
+
--cache-type, --cache_type N KV cache type; valid options: llama.cpp - fp16, q8_0, q4_0; ExLlamaV3 - fp16, q2 to q8 (can specify k_bits and v_bits separately, e.g. q4_q8).
289
295
290
296
Speculative decoding:
291
297
--model-draft MODEL_DRAFT Path to the draft model for speculative decoding.
@@ -300,7 +306,7 @@ Speculative decoding:
300
306
--spec-ngram-min-hits SPEC_NGRAM_MIN_HITS Minimum n-gram hits for ngram-map speculative decoding.
301
307
302
308
llama.cpp:
303
-
--gpu-layers N, --n-gpu-layers N Number of layers to offload to the GPU. -1 = auto.
309
+
--gpu-layers, --n-gpu-layers N Number of layers to offload to the GPU. -1 = auto.
304
310
--cpu-moe Move the experts to the CPU (for MoE models).
305
311
--mmproj MMPROJ Path to the mmproj file for vision models.
306
312
--streaming-llm Activate StreamingLLM to avoid re-evaluating the entire prompt when old messages are removed.
@@ -314,13 +320,17 @@ llama.cpp:
314
320
--threads THREADS Number of threads to use.
315
321
--threads-batch THREADS_BATCH Number of threads to use for batches/prompt processing.
316
322
--numa Activate NUMA task allocation for llama.cpp.
323
+
--parallel PARALLEL Number of parallel request slots. The context size is divided equally among slots. For example, to have 4 slots with 8192 context each, set
324
+
ctx_size to 32768.
325
+
--fit-target FIT_TARGET Target VRAM margin per device for auto GPU layers, comma-separated list of values in MiB. A single value is broadcast across all devices.
326
+
Default: 1024.
317
327
--extra-flags EXTRA_FLAGS Extra flags to pass to llama-server. Format: "flag1=value1,flag2,flag3=value3". Example: "override-tensor=exps=CPU"
318
328
319
329
Transformers/Accelerate:
320
330
--cpu Use the CPU to generate text. Warning: Training on CPU is extremely slow.
321
331
--cpu-memory CPU_MEMORY Maximum CPU memory in GiB. Use this for CPU offloading.
322
332
--disk If the model is too large for your GPU(s) and CPU combined, send the remaining layers to the disk.
323
-
--disk-cache-dir DISK_CACHE_DIR Directory to save the disk cache to. Defaults to "user_data/cache".
333
+
--disk-cache-dir DISK_CACHE_DIR Directory to save the disk cache to.
324
334
--load-in-8bit Load the model with 8-bit precision (using bitsandbytes).
325
335
--bf16 Load the model with bfloat16 precision. Requires NVIDIA Ampere GPU.
326
336
--no-cache Set use_cache to False while generating text. This reduces VRAM usage slightly, but it comes at a performance cost.
@@ -341,9 +351,6 @@ ExLlamaV3:
341
351
--tp-backend TP_BACKEND The backend for tensor parallelism. Valid options: native, nccl. Default: native.
342
352
--cfg-cache Create an additional cache for CFG negative prompts. Necessary to use CFG with that loader.
343
353
344
-
TensorRT-LLM:
345
-
--cpp-runner Use the ModelRunnerCpp runner, which is faster than the default ModelRunner.
346
-
347
354
RoPE:
348
355
--alpha_value ALPHA_VALUE Positional embeddings alpha factor for NTK RoPE scaling. Use either this or compress_pos_emb, not both.
349
356
--rope_freq_base ROPE_FREQ_BASE If greater than 0, will be used instead of alpha_value. Those two are related by rope_freq_base = 10000 * alpha_value ^ (64 / 63).
@@ -373,6 +380,50 @@ API:
373
380
--api-enable-ipv6 Enable IPv6 for the API
374
381
--api-disable-ipv4 Disable IPv4 for the API
375
382
--nowebui Do not launch the Gradio UI. Useful for launching the API in standalone mode.
383
+
384
+
API generation defaults:
385
+
--temperature N Temperature
386
+
--dynatemp-low N Dynamic temperature low
387
+
--dynatemp-high N Dynamic temperature high
388
+
--dynatemp-exponent N Dynamic temperature exponent
389
+
--smoothing-factor N Smoothing factor
390
+
--smoothing-curve N Smoothing curve
391
+
--min-p N Min P
392
+
--top-p N Top P
393
+
--top-k N Top K
394
+
--typical-p N Typical P
395
+
--xtc-threshold N XTC threshold
396
+
--xtc-probability N XTC probability
397
+
--epsilon-cutoff N Epsilon cutoff
398
+
--eta-cutoff N Eta cutoff
399
+
--tfs N TFS
400
+
--top-a N Top A
401
+
--top-n-sigma N Top N Sigma
402
+
--adaptive-target N Adaptive target
403
+
--adaptive-decay N Adaptive decay
404
+
--dry-multiplier N DRY multiplier
405
+
--dry-allowed-length N DRY allowed length
406
+
--dry-base N DRY base
407
+
--repetition-penalty N Repetition penalty
408
+
--frequency-penalty N Frequency penalty
409
+
--presence-penalty N Presence penalty
410
+
--encoder-repetition-penalty N Encoder repetition penalty
411
+
--no-repeat-ngram-size N No repeat ngram size
412
+
--repetition-penalty-range N Repetition penalty range
413
+
--penalty-alpha N Penalty alpha
414
+
--guidance-scale N Guidance scale
415
+
--mirostat-mode N Mirostat mode
416
+
--mirostat-tau N Mirostat tau
417
+
--mirostat-eta N Mirostat eta
418
+
--do-sample, --no-do-sample Do sample
419
+
--dynamic-temperature, --no-dynamic-temperature Dynamic temperature
420
+
--temperature-last, --no-temperature-last Temperature last
--chat-template-file CHAT_TEMPLATE_FILE Path to a chat template file (.jinja, .jinja2, or .yaml) to use as the default instruction template for API requests. Overrides the model's
0 commit comments