Compare `ik_llama.cpp`, `vLLM`, `llama.cpp`, and `ktransformers` engines

The top runners for hybrid CPU/GPU inferencing of R1 671B currently seem to be:

* [ikawrakow/ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/discussions/258) custom `llama.cpp` fork with many optimizations.
* `vLLM` - best tensor / data parallel stuff for multi GPU inferencing but very limited CPU / GGUF capabilities
* `llama.cpp` - good old llama.cpp has some with experimental branches in the works moving slowly along, but refer above to `ik_llama.cpp` for fastest stuff currently available for some hardware configurations
* `ktransformers` currently the best option especially combined with AMD Epyc `NPS0` but *requires* at least 16GB VRAM *CUDA* GPU. Also might be okay for big dual Intel Xeon assuming enough RAM to hold *entire* model weights *twice* for data parallel across two sockets, but probably not worth the $$$ afaict so far today

Hey @zts9989 I saw your benchmark over on [llama.cpp PR#11397](https://github.com/ggml-org/llama.cpp/pull/11397#issuecomment-2696800101), could you give some more info about your setup? Looks like you have at least one CUDA GPU and probably an AMD processor with NPS0 just guessing? If you like, please compare your speed against ktransformers using this guide.

Also for anyone interested there is a video screen-share of running ktransformers locally by using this guide: https://www.reddit.com/r/LocalLLaMA/comments/1j329e9/ktransformers_troll_rig_r1_671b_udq2_k_xl_on_96gb/




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compare `ik_llama.cpp`, `vLLM`, `llama.cpp`, and `ktransformers` engines #11

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Compare ik_llama.cpp, vLLM, llama.cpp, and ktransformers engines #11

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Compare `ik_llama.cpp`, `vLLM`, `llama.cpp`, and `ktransformers` engines #11