doc : update documentation for --tensor-split (#15980)

rgerganov · JohannesGaessler · slaren · web-flow · commit d1c6f11f47ba · 2025-09-14T12:10:07.000+03:00
* doc : update documentation for --tensor-split

* Update tools/main/README.md

Co-authored-by: Johannes Gäßler &lt;johannesg@5d6.de&gt;

* Update tools/main/README.md

Co-authored-by: Diego Devesa &lt;slarengh@gmail.com&gt;

---------

Co-authored-by: Johannes Gäßler &lt;johannesg@5d6.de&gt;
Co-authored-by: Diego Devesa &lt;slarengh@gmail.com&gt;
diff --git a/tools/main/README.md b/tools/main/README.md
@@ -384,5 +384,5 @@ These options provide extra functionality and customization when running the LLa
 -   `--verbose-prompt`: Print the prompt before generating text.
 -   `--no-display-prompt`: Don't print prompt at generation.
 -   `-mg i, --main-gpu i`: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used.
--   `-ts SPLIT, --tensor-split SPLIT`: When using multiple GPUs this option controls how large tensors should be split across all GPUs. `SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance.
+-   `-ts SPLIT, --tensor-split SPLIT`: When using multiple devices this option controls how tensors should be split across devices. `SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each device should get in order. For example, "3,2" will assign 60% of the data to device 0 and 40% to device 1. By default, the data is split in proportion to VRAM, but this may not be optimal for performance. The list of the devices which are being used is printed on startup and can be different from the device list given by `--list-devices` or e.g. `nvidia-smi`.
 -   `-hfr URL --hf-repo URL`: The url to the Hugging Face model repository. Used in conjunction with `--hf-file` or `-hff`. The model is downloaded and stored in the file provided by `-m` or `--model`. If `-m` is not provided, the model is auto-stored in the path specified by the `LLAMA_CACHE` environment variable  or in an OS-specific local cache.