Ideas for models & also distributed inference over LAN. #6
ghchris2021
started this conversation in
Ideas
Replies: 1 comment
-
Thank you for the great suggestions! Merely supporting heterogeneous GPUs would not be a problem for KTransformers because it is based on transformers/torch. It may not be as efficient as the highly optimized Marlin CUDA kernel, but it can still benefit from CPU offloading. We are also interested in implementing an Exo-like multi-machine operator. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Congratulations on the great FOSS project, thank you very much, I look forward to see what becomes of this project!
Per. the request for ideas & aspirations for features / model support I'll share my own thoughts.
In terms of facilitating running larger models in general my primary wish lists from an inference system are:
A: Support distributed inference using any mix of combinations of GPU,+VRAM / CPU+RAM resources across an IP LAN using multiple linux PCs to share the available GPU, CPU, RAM resources effectively when dealing with models using RAM more than the 16-24 GB VRAM size of a typical GPN.
B: Support heterogeneous GPUs -- nvidia, intel arc, amd RDNA in any combination alone, together, distributed.
For model support my main desires from inference are (in no particular order):
LLama-3.1; Deepseek-coder-v2; Deepseek-chat-most recent; mistral-large; codestral; mixtral-8x22b; gemma-2-27b; codegemma; qwen2; codeqwen.
Beta Was this translation helpful? Give feedback.
All reactions