Skip to content

Commit 67e31c2

Browse files
committed
Add explanation to --no-mmap in llama server
1 parent b253462 commit 67e31c2

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

tools/server/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ The project is under active development, and we are [looking for feedback and co
6969
| `-dt, --defrag-thold N` | KV cache defragmentation threshold (default: 0.1, < 0 - disabled)<br/>(env: LLAMA_ARG_DEFRAG_THOLD) |
7070
| `-np, --parallel N` | number of parallel sequences to decode (default: 1)<br/>(env: LLAMA_ARG_N_PARALLEL) |
7171
| `--mlock` | force system to keep model in RAM rather than swapping or compressing<br/>(env: LLAMA_ARG_MLOCK) |
72-
| `--no-mmap` | do not memory-map model (slower load but may reduce pageouts if not using mlock)<br/>(env: LLAMA_ARG_NO_MMAP) |
72+
| `--no-mmap` | do not memory-map model (slower load but may reduce pageouts if not using mlock; if all layers are offloaded to a GPU and RAM < VRAM, this actually makes loading faster)<br/>(env: LLAMA_ARG_NO_MMAP) |
7373
| `--numa TYPE` | attempt optimizations that help on some NUMA systems<br/>- distribute: spread execution evenly over all nodes<br/>- isolate: only spawn threads on CPUs on the node that execution started on<br/>- numactl: use the CPU map provided by numactl<br/>if run without this previously, it is recommended to drop the system page cache before using this<br/>see https://github.com/ggml-org/llama.cpp/issues/1437<br/>(env: LLAMA_ARG_NUMA) |
7474
| `-dev, --device <dev1,dev2,..>` | comma-separated list of devices to use for offloading (none = don't offload)<br/>use --list-devices to see a list of available devices<br/>(env: LLAMA_ARG_DEVICE) |
7575
| `--list-devices` | print list of available devices and exit |

0 commit comments

Comments
 (0)