You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`-sp, --special`| special tokens output enabled (default: false) |
50
-
|`--spm-infill`| use Suffix/Prefix/Middle pattern for infill (instead of Prefix/Suffix/Middle) as some models prefer this. (default: disabled) |
50
+
|`--rope-scaling {none,linear,yarn}`| RoPE frequency scaling method, defaults to linear unless specified by the model<br/>(env: LLAMA_ARG_ROPE_SCALING_TYPE) |
51
+
|`--rope-scale N`| RoPE context scaling factor, expands context by a factor of N<br/>(env: LLAMA_ARG_ROPE_SCALE) |
52
+
|`--rope-freq-base N`| RoPE base frequency, used by NTK-aware scaling (default: loaded from model)<br/>(env: LLAMA_ARG_ROPE_FREQ_BASE) |
53
+
|`--rope-freq-scale N`| RoPE frequency scaling factor, expands context by a factor of 1/N<br/>(env: LLAMA_ARG_ROPE_FREQ_SCALE) |
54
+
|`--yarn-orig-ctx N`| YaRN: original context size of model (default: 0 = model training context size)<br/>(env: LLAMA_ARG_YARN_ORIG_CTX) |
|`-np, --parallel N`| number of parallel sequences to decode (default: 1)<br/>(env: LLAMA_ARG_N_PARALLEL) |
67
+
|`--mlock`| force system to keep model in RAM rather than swapping or compressing<br/>(env: LLAMA_ARG_MLOCK) |
68
+
|`--no-mmap`| do not memory-map model (slower load but may reduce pageouts if not using mlock)<br/>(env: LLAMA_ARG_NO_MMAP) |
69
+
|`--numa TYPE`| attempt optimizations that help on some NUMA systems<br/>- distribute: spread execution evenly over all nodes<br/>- isolate: only spawn threads on CPUs on the node that execution started on<br/>- numactl: use the CPU map provided by numactl<br/>if run without this previously, it is recommended to drop the system page cache before using this<br/>see https://github.com/ggerganov/llama.cpp/issues/1437<br/>(env: LLAMA_ARG_NUMA) |
70
+
|`-ngl, --gpu-layers, --n-gpu-layers N`| number of layers to store in VRAM<br/>(env: LLAMA_ARG_N_GPU_LAYERS) |
71
+
|`-sm, --split-mode {none,layer,row}`| how to split the model across multiple GPUs, one of:<br/>- none: use one GPU only<br/>- layer (default): split layers and KV across GPUs<br/>- row: split rows across GPUs<br/>(env: LLAMA_ARG_SPLIT_MODE) |
72
+
|`-ts, --tensor-split N0,N1,N2,...`| fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1<br/>(env: LLAMA_ARG_TENSOR_SPLIT) |
73
+
|`-mg, --main-gpu INDEX`| the GPU to use for the model (with split-mode = none), or for intermediate results and KV (with split-mode = row) (default: 0)<br/>(env: LLAMA_ARG_MAIN_GPU) |
74
+
|`--check-tensors`| check model tensor data for invalid values (default: false) |
75
+
|`--override-kv KEY=TYPE:VALUE`| advanced option to override model metadata by key. may be specified multiple times.<br/>types: int, float, bool, str. example: --override-kv tokenizer.ggml.add_bos_token=bool:false|
76
+
|`--lora FNAME`| path to LoRA adapter (can be repeated to use multiple adapters) |
77
+
|`--lora-scaled FNAME SCALE`| path to LoRA adapter with user defined scaling (can be repeated to use multiple adapters) |
78
+
|`--control-vector FNAME`| add a control vector<br/>note: this argument can be repeated to add multiple control vectors |
79
+
|`--control-vector-scaled FNAME SCALE`| add a control vector with user defined scaling SCALE<br/>note: this argument can be repeated to add multiple scaled control vectors |
80
+
|`--control-vector-layer-range START END`| layer range to apply the control vector(s) to, start and end inclusive |
81
+
|`-m, --model FNAME`| model path (default: `models/$filename` with filename from `--hf-file` or `--model-url` if set, otherwise models/7B/ggml-model-f16.gguf)<br/>(env: LLAMA_ARG_MODEL) |
82
+
|`-mu, --model-url MODEL_URL`| model download url (default: unused)<br/>(env: LLAMA_ARG_MODEL_URL) |
83
+
|`-hfr, --hf-repo REPO`| Hugging Face model repository (default: unused)<br/>(env: LLAMA_ARG_HF_REPO) |
84
+
|`-hff, --hf-file FILE`| Hugging Face model file (default: unused)<br/>(env: LLAMA_ARG_HF_FILE) |
85
+
|`-hft, --hf-token TOKEN`| Hugging Face access token (default: value from HF_TOKEN environment variable)<br/>(env: HF_TOKEN) |
86
+
|`-ld, --logdir LOGDIR`| path under which to save YAML logs (no logging if unset) |
|`-v, --verbose, --log-verbose`| Set verbosity level to infinity (i.e. log all messages, useful for debugging) |
91
+
|`-lv, --verbosity, --log-verbosity N`| Set the verbosity threshold. Messages with a higher verbosity will be ignored.<br/>(env: LLAMA_LOG_VERBOSITY) |
92
+
|`--log-prefix`| Enable prefx in log messages<br/>(env: LLAMA_LOG_PREFIX) |
93
+
|`--log-timestamps`| Enable timestamps in log messages<br/>(env: LLAMA_LOG_TIMESTAMPS) |
94
+
95
+
96
+
**Sampling params**
97
+
98
+
| Argument | Explanation |
99
+
| -------- | ----------- |
51
100
|`--samplers SAMPLERS`| samplers that will be used for generation in the order, separated by ';'<br/>(default: top_k;tfs_z;typ_p;top_p;min_p;temperature) |
52
101
|`-s, --seed SEED`| RNG seed (default: 4294967295, use random seed for 4294967295) |
53
102
|`--sampling-seq SEQUENCE`| simplified sequence for samplers that will be used (default: kfypmt) |
@@ -72,54 +121,28 @@ The project is under active development, and we are [looking for feedback and co
72
121
|`--grammar GRAMMAR`| BNF-like grammar to constrain generations (see samples in grammars/ dir) (default: '') |
73
122
|`--grammar-file FNAME`| file to read grammar from |
74
123
|`-j, --json-schema SCHEMA`| JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object<br/>For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead |
75
-
|`--rope-scaling {none,linear,yarn}`| RoPE frequency scaling method, defaults to linear unless specified by the model |
76
-
|`--rope-scale N`| RoPE context scaling factor, expands context by a factor of N |
77
-
|`--rope-freq-base N`| RoPE base frequency, used by NTK-aware scaling (default: loaded from model) |
78
-
|`--rope-freq-scale N`| RoPE frequency scaling factor, expands context by a factor of 1/N |
79
-
|`--yarn-orig-ctx N`| YaRN: original context size of model (default: 0 = model training context size) |
|`--mlock`| force system to keep model in RAM rather than swapping or compressing |
95
-
|`--no-mmap`| do not memory-map model (slower load but may reduce pageouts if not using mlock) |
96
-
|`--numa TYPE`| attempt optimizations that help on some NUMA systems<br/>- distribute: spread execution evenly over all nodes<br/>- isolate: only spawn threads on CPUs on the node that execution started on<br/>- numactl: use the CPU map provided by numactl<br/>if run without this previously, it is recommended to drop the system page cache before using this<br/>see https://github.com/ggerganov/llama.cpp/issues/1437|
97
-
|`-ngl, --gpu-layers, --n-gpu-layers N`| number of layers to store in VRAM<br/>(env: LLAMA_ARG_N_GPU_LAYERS) |
98
-
|`-sm, --split-mode {none,layer,row}`| how to split the model across multiple GPUs, one of:<br/>- none: use one GPU only<br/>- layer (default): split layers and KV across GPUs<br/>- row: split rows across GPUs |
99
-
|`-ts, --tensor-split N0,N1,N2,...`| fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1 |
100
-
|`-mg, --main-gpu INDEX`| the GPU to use for the model (with split-mode = none), or for intermediate results and KV (with split-mode = row) (default: 0) |
101
-
|`--check-tensors`| check model tensor data for invalid values (default: false) |
102
-
|`--override-kv KEY=TYPE:VALUE`| advanced option to override model metadata by key. may be specified multiple times.<br/>types: int, float, bool, str. example: --override-kv tokenizer.ggml.add_bos_token=bool:false|
103
-
|`--lora FNAME`| path to LoRA adapter (can be repeated to use multiple adapters) |
104
-
|`--lora-scaled FNAME SCALE`| path to LoRA adapter with user defined scaling (can be repeated to use multiple adapters) |
105
-
|`--control-vector FNAME`| add a control vector<br/>note: this argument can be repeated to add multiple control vectors |
106
-
|`--control-vector-scaled FNAME SCALE`| add a control vector with user defined scaling SCALE<br/>note: this argument can be repeated to add multiple scaled control vectors |
107
-
|`--control-vector-layer-range START END`| layer range to apply the control vector(s) to, start and end inclusive |
108
-
|`-a, --alias STRING`| set alias for model name (to be used by REST API) |
109
-
|`-m, --model FNAME`| model path (default: `models/$filename` with filename from `--hf-file` or `--model-url` if set, otherwise models/7B/ggml-model-f16.gguf)<br/>(env: LLAMA_ARG_MODEL) |
110
-
|`-mu, --model-url MODEL_URL`| model download url (default: unused)<br/>(env: LLAMA_ARG_MODEL_URL) |
111
-
|`-hfr, --hf-repo REPO`| Hugging Face model repository (default: unused)<br/>(env: LLAMA_ARG_HF_REPO) |
112
-
|`-hff, --hf-file FILE`| Hugging Face model file (default: unused)<br/>(env: LLAMA_ARG_HF_FILE) |
113
-
|`-hft, --hf-token TOKEN`| Hugging Face access token (default: value from HF_TOKEN environment variable)<br/>(env: HF_TOKEN) |
136
+
|`-a, --alias STRING`| set alias for model name (to be used by REST API)<br/>(env: LLAMA_ARG_ALIAS) |
114
137
|`--host HOST`| ip address to listen (default: 127.0.0.1)<br/>(env: LLAMA_ARG_HOST) |
115
138
|`--port PORT`| port to listen (default: 8080)<br/>(env: LLAMA_ARG_PORT) |
116
-
|`--path PATH`| path to serve static files from (default: ) |
139
+
|`--path PATH`| path to serve static files from (default: )<br/>(env: LLAMA_ARG_STATIC_PATH)|
117
140
|`--embedding, --embeddings`| restrict to only support embedding use case; use only with dedicated embedding models (default: disabled)<br/>(env: LLAMA_ARG_EMBEDDINGS) |
118
141
|`--api-key KEY`| API key to use for authentication (default: none)<br/>(env: LLAMA_API_KEY) |
119
142
|`--api-key-file FNAME`| path to file containing API keys (default: none) |
120
-
|`--ssl-key-file FNAME`| path to file a PEM-encoded SSL private key |
121
-
|`--ssl-cert-file FNAME`| path to file a PEM-encoded SSL certificate |
122
-
|`-to, --timeout N`| server read/write timeout in seconds (default: 600) |
143
+
|`--ssl-key-file FNAME`| path to file a PEM-encoded SSL private key<br/>(env: LLAMA_ARG_SSL_KEY_FILE)|
144
+
|`--ssl-cert-file FNAME`| path to file a PEM-encoded SSL certificate<br/>(env: LLAMA_ARG_SSL_CERT_FILE)|
145
+
|`-to, --timeout N`| server read/write timeout in seconds (default: 600)<br/>(env: LLAMA_ARG_TIMEOUT)|
123
146
|`--threads-http N`| number of threads used to process HTTP requests (default: -1)<br/>(env: LLAMA_ARG_THREADS_HTTP) |
124
147
|`-spf, --system-prompt-file FNAME`| set a file to load a system prompt (initial prompt of all slots), this is useful for chat applications |
@@ -128,14 +151,6 @@ The project is under active development, and we are [looking for feedback and co
128
151
|`--chat-template JINJA_TEMPLATE`| set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted:<br/>https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
129
152
|`-sps, --slot-prompt-similarity SIMILARITY`| how much the prompt of a request must match the prompt of a slot in order to use that slot (default: 0.50, 0.0 = disabled)<br/> |
130
153
|`--lora-init-without-apply`| load LoRA adapters without applying them (apply later via POST /lora-adapters) (default: disabled) |
131
-
|`-ld, --logdir LOGDIR`| path under which to save YAML logs (no logging if unset) |
|`-v, --verbose, --log-verbose`| Set verbosity level to infinity (i.e. log all messages, useful for debugging) |
136
-
|`-lv, --verbosity, --log-verbosity N`| Set the verbosity threshold. Messages with a higher verbosity will be ignored.<br/>(env: LLAMA_LOG_VERBOSITY) |
137
-
|`--log-prefix`| Enable prefx in log messages<br/>(env: LLAMA_LOG_PREFIX) |
138
-
|`--log-timestamps`| Enable timestamps in log messages<br/>(env: LLAMA_LOG_TIMESTAMPS) |
139
154
140
155
Note: If both command line argument and environment variable are both set for the same param, the argument will take precedence over env var.
0 commit comments