Skip to content

Conversation

@ThiloteE
Copy link
Collaborator

@ThiloteE ThiloteE commented Oct 4, 2025

More often than not, one of the two commands I propose are the default when running inference.

  • The first one, I use, when trying a model for the first time, just to check if everything works: llama-server 8080 --top-k 1 --n-predict 128 --reasoning-budget 0 --threads -1 --jinja --flash-attn auto --cache-type-k q8_0 --cache-type-v q8_0 This command is supposed to trigger deterministic inference and a short generation of tokens at that.

  • The second one is more geared towards "normal" inference with variety in the responses and no limit to numbers of tokens generated: llama-server 8080 --top-k 40 --n-predict -1 --reasoning-budget -1 --threads -1 --jinja --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0.

Here are all the arguments explained in detail:

-t,    --threads N                      number of threads to use during generation (default: -1)
                                        (env: LLAMA_ARG_THREADS)
													 
-fa,   --flash-attn [on|off|auto]       set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
                                        (env: LLAMA_ARG_FLASH_ATTN)
													 
-n,    --predict, --n-predict N         number of tokens to predict (default: -1, -1 = infinity)
                                        (env: LLAMA_ARG_N_PREDICT)
													 
--reasoning-budget N                    controls the amount of thinking allowed; currently only one of: -1 for
                                        unrestricted thinking budget, or 0 to disable thinking (default: -1)
                                        (env: LLAMA_ARG_THINK_BUDGET)

--jinja                                 use jinja template for chat (default: disabled)
                                        (env: LLAMA_ARG_JINJA)

--top-k N                               top-k sampling (default: 40, 0 = disabled)

-ctk,  --cache-type-k TYPE              KV cache data type for K
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_K)

-ctv,  --cache-type-v TYPE              KV cache data type for V
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_V)

More often than not, the two commands I propose are the default when running inference. This one, I use, when trying a model for the first time, just to check if everything works: `llama-server 8080 --top-k 1 --n-predict 128 --reasoning-budget 0  --threads -1 --jinja --flash-attn auto --cache-type-k q8_0 --cache-type-v q8_0` This command is supposed to trigger deterministic inference and a short generation of tokens at that.

This one is more geared towards "normal" inference with variety in the responses and no limit to numbers of tokens generated: `llama-server 8080 --top-k 40 --n-predict -1 --reasoning-budget -1  --threads -1 --jinja --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0`.

Here are all the arguments explained:
```
-t,    --threads N                      number of threads to use during generation (default: -1)
                                        (env: LLAMA_ARG_THREADS)
													 
-fa,   --flash-attn [on|off|auto]       set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
                                        (env: LLAMA_ARG_FLASH_ATTN)
													 
-n,    --predict, --n-predict N         number of tokens to predict (default: -1, -1 = infinity)
                                        (env: LLAMA_ARG_N_PREDICT)
													 
--reasoning-budget N                    controls the amount of thinking allowed; currently only one of: -1 for
                                        unrestricted thinking budget, or 0 to disable thinking (default: -1)
                                        (env: LLAMA_ARG_THINK_BUDGET)

--jinja                                 use jinja template for chat (default: disabled)
                                        (env: LLAMA_ARG_JINJA)

--top-k N                               top-k sampling (default: 40, 0 = disabled)

-ctk,  --cache-type-k TYPE              KV cache data type for K
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_K)

-ctv,  --cache-type-v TYPE              KV cache data type for V
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_V)
```
@ThiloteE ThiloteE requested a review from 3Simplex October 4, 2025 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants