Skip to content
Discussion options

You must be logged in to vote

I jumped onto Performance of llama.cpp on Apple Silicon M-series which has at least one (eye-watering) result for a dual! Xeon Platinum on Ubuntu 22.
OMG!
Will try to run the test with Xeon CPU 2x AVX-512 256GB DDR5 only and
secondly hope the model fits into the 40GB GDDR6 (X?) of the 2x 20GB RTX 4000.
Guess the test can resort to partial offload of layers if the model does not fit into the GPUs.
Would be nice if things fit into ONE GPU to avoid overhead of sharding via the GPUs' PCIe 4.

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by ai-bits
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants