Add more commands to Toolbox-Config.psm1 #13

ThiloteE · 2025-10-04T13:39:14Z

More often than not, one of the two commands I propose are the default when running inference.

The first one, I use, when trying a model for the first time, just to check if everything works: llama-server 8080 --top-k 1 --n-predict 128 --reasoning-budget 0 --threads -1 --jinja --flash-attn auto --cache-type-k q8_0 --cache-type-v q8_0 This command is supposed to trigger deterministic inference and a short generation of tokens at that.
The second one is more geared towards "normal" inference with variety in the responses and no limit to numbers of tokens generated: llama-server 8080 --top-k 40 --n-predict -1 --reasoning-budget -1 --threads -1 --jinja --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0.

Here are all the arguments explained in detail:

-t,    --threads N                      number of threads to use during generation (default: -1)
                                        (env: LLAMA_ARG_THREADS)
													 
-fa,   --flash-attn [on|off|auto]       set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
                                        (env: LLAMA_ARG_FLASH_ATTN)
													 
-n,    --predict, --n-predict N         number of tokens to predict (default: -1, -1 = infinity)
                                        (env: LLAMA_ARG_N_PREDICT)
													 
--reasoning-budget N                    controls the amount of thinking allowed; currently only one of: -1 for
                                        unrestricted thinking budget, or 0 to disable thinking (default: -1)
                                        (env: LLAMA_ARG_THINK_BUDGET)

--jinja                                 use jinja template for chat (default: disabled)
                                        (env: LLAMA_ARG_JINJA)

--top-k N                               top-k sampling (default: 40, 0 = disabled)

-ctk,  --cache-type-k TYPE              KV cache data type for K
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_K)

-ctv,  --cache-type-v TYPE              KV cache data type for V
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_V)

More often than not, the two commands I propose are the default when running inference. This one, I use, when trying a model for the first time, just to check if everything works: `llama-server 8080 --top-k 1 --n-predict 128 --reasoning-budget 0 --threads -1 --jinja --flash-attn auto --cache-type-k q8_0 --cache-type-v q8_0` This command is supposed to trigger deterministic inference and a short generation of tokens at that. This one is more geared towards "normal" inference with variety in the responses and no limit to numbers of tokens generated: `llama-server 8080 --top-k 40 --n-predict -1 --reasoning-budget -1 --threads -1 --jinja --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0`. Here are all the arguments explained: ``` -t, --threads N number of threads to use during generation (default: -1) (env: LLAMA_ARG_THREADS) -fa, --flash-attn [on|off|auto] set Flash Attention use ('on', 'off', or 'auto', default: 'auto') (env: LLAMA_ARG_FLASH_ATTN) -n, --predict, --n-predict N number of tokens to predict (default: -1, -1 = infinity) (env: LLAMA_ARG_N_PREDICT) --reasoning-budget N controls the amount of thinking allowed; currently only one of: -1 for unrestricted thinking budget, or 0 to disable thinking (default: -1) (env: LLAMA_ARG_THINK_BUDGET) --jinja use jinja template for chat (default: disabled) (env: LLAMA_ARG_JINJA) --top-k N top-k sampling (default: 40, 0 = disabled) -ctk, --cache-type-k TYPE KV cache data type for K allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1 (default: f16) (env: LLAMA_ARG_CACHE_TYPE_K) -ctv, --cache-type-v TYPE KV cache data type for V allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1 (default: f16) (env: LLAMA_ARG_CACHE_TYPE_V) ```

ThiloteE requested a review from 3Simplex October 4, 2025 13:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add more commands to Toolbox-Config.psm1 #13

Add more commands to Toolbox-Config.psm1 #13

Uh oh!

ThiloteE commented Oct 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add more commands to Toolbox-Config.psm1 #13

Are you sure you want to change the base?

Add more commands to Toolbox-Config.psm1 #13

Uh oh!

Conversation

ThiloteE commented Oct 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants