Skip to content

Compare ik_llama.cpp, vLLM, llama.cpp, and ktransformers engines #11

@ubergarm

Description

@ubergarm

The top runners for hybrid CPU/GPU inferencing of R1 671B currently seem to be:

  • ikawrakow/ik_llama.cpp custom llama.cpp fork with many optimizations.
  • vLLM - best tensor / data parallel stuff for multi GPU inferencing but very limited CPU / GGUF capabilities
  • llama.cpp - good old llama.cpp has some with experimental branches in the works moving slowly along, but refer above to ik_llama.cpp for fastest stuff currently available for some hardware configurations
  • ktransformers currently the best option especially combined with AMD Epyc NPS0 but requires at least 16GB VRAM CUDA GPU. Also might be okay for big dual Intel Xeon assuming enough RAM to hold entire model weights twice for data parallel across two sockets, but probably not worth the $$$ afaict so far today

Hey @zts9989 I saw your benchmark over on llama.cpp PR#11397, could you give some more info about your setup? Looks like you have at least one CUDA GPU and probably an AMD processor with NPS0 just guessing? If you like, please compare your speed against ktransformers using this guide.

Also for anyone interested there is a video screen-share of running ktransformers locally by using this guide: https://www.reddit.com/r/LocalLLaMA/comments/1j329e9/ktransformers_troll_rig_r1_671b_udq2_k_xl_on_96gb/

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions