-
Notifications
You must be signed in to change notification settings - Fork 16
Open
Description
The top runners for hybrid CPU/GPU inferencing of R1 671B currently seem to be:
- ikawrakow/ik_llama.cpp custom
llama.cppfork with many optimizations. vLLM- best tensor / data parallel stuff for multi GPU inferencing but very limited CPU / GGUF capabilitiesllama.cpp- good old llama.cpp has some with experimental branches in the works moving slowly along, but refer above toik_llama.cppfor fastest stuff currently available for some hardware configurationsktransformerscurrently the best option especially combined with AMD EpycNPS0but requires at least 16GB VRAM CUDA GPU. Also might be okay for big dual Intel Xeon assuming enough RAM to hold entire model weights twice for data parallel across two sockets, but probably not worth the $$$ afaict so far today
Hey @zts9989 I saw your benchmark over on llama.cpp PR#11397, could you give some more info about your setup? Looks like you have at least one CUDA GPU and probably an AMD processor with NPS0 just guessing? If you like, please compare your speed against ktransformers using this guide.
Also for anyone interested there is a video screen-share of running ktransformers locally by using this guide: https://www.reddit.com/r/LocalLLaMA/comments/1j329e9/ktransformers_troll_rig_r1_671b_udq2_k_xl_on_96gb/
Metadata
Metadata
Assignees
Labels
No labels