Add more commands to Toolbox-Config.psm1 #13
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
More often than not, one of the two commands I propose are the default when running inference.
The first one, I use, when trying a model for the first time, just to check if everything works:
llama-server 8080 --top-k 1 --n-predict 128 --reasoning-budget 0 --threads -1 --jinja --flash-attn auto --cache-type-k q8_0 --cache-type-v q8_0This command is supposed to trigger deterministic inference and a short generation of tokens at that.The second one is more geared towards "normal" inference with variety in the responses and no limit to numbers of tokens generated:
llama-server 8080 --top-k 40 --n-predict -1 --reasoning-budget -1 --threads -1 --jinja --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0.Here are all the arguments explained in detail: