Skip to content

Distributed LLM inference. Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.

License

Notifications You must be signed in to change notification settings

b4rtaz/distributed-llama

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed Llama

Distributed Llama

GitHub Actions Workflow Status License: MIT Discord

Connect home devices into a powerful cluster to accelerate LLM inference. More devices mean faster performance, leveraging tensor parallelism and high-speed synchronization over Ethernet.

Supports Linux, macOS, and Windows. Optimized for ARM and x86_64 AVX2 CPUs.

How to Run

News

🔥 Setup Root Node by Single Command

Python 3 and C++ compiler required. The command will download the model and the tokenizer.

Model Size Command
Llama 3.1 8B Instruct Q40 6.32 GB python launch.py llama3_1_8b_instruct_q40
Llama 3.1 405B Instruct Q40 238 GB python launch.py llama3_1_405b_instruct_q40.
Llama 3.2 1B Instruct Q40 1.7 GB python launch.py llama3_2_1b_instruct_q40
Llama 3.2 3B Instruct Q40 3.4 GB python launch.py llama3_2_3b_instruct_q40
Llama 3.3 70B Instruct Q40 40 GB python launch.py llama3_3_70b_instruct_q40
DeepSeek R1 Distill Llama 8B Q40 6.32 GB python launch.py deepseek_r1_distill_llama_8b_q40
Qwen 3 0.6B Q40 0.9 GB python launch.py qwen3_0.6b_q40
Qwen 3 1.7B Q40 2.2 GB python launch.py qwen3_1.7b_q40
Qwen 3 8B Q40 6.7 GB python launch.py qwen3_8b_q40
Qwen 3 14B Q40 10.9 GB python launch.py qwen3_14b_q40
Qwen 3 30B A3B Q40 17.0 GB python launch.py qwen3_30b_a3b_q40

🛠️ Convert Model Manually

🚧 Known Limitations

  • You can run Distributed Llama only on 1, 2, 4... 2^n nodes.
  • The maximum number of nodes is equal to the number of KV heads in the model #70.
  • Only the following quantizations are supported #183:
    • q40 model with q80 buffer-float-type
    • f32 model with f32 buffer-float-type

👷 Architecture

[🔀 SWITCH OR ROUTER]
      | | | |
      | | | |_______ 🔸 device1 (ROOT)     10.0.0.1
      | | |_________ 🔹 device2 (WORKER 1) 10.0.0.2:9999
      | |___________ 🔹 device3 (WORKER 2) 10.0.0.3:9999
      |_____________ 🔹 device4 (WORKER 3) 10.0.0.4:9999
                        ...

The project is split up into two parts:

  • 🔸 Root node - it's responsible for loading the model and weights and forward them to workers. Also, it synchronizes the state of the neural network. The root node is also a worker, it processes own slice of the neural network.
  • 🔹 Worker node - it processes own slice of the neural network. It doesn't require any configuration related to the model.

You always need the root node and you can add 2^n - 1 worker nodes to speed up the inference. The RAM usage of the neural network is split up across all nodes. The root node requires a bit more RAM than worker nodes.

🎹 Commands

  • dllama inference - run the inference with a simple benchmark,
  • dllama chat - run the CLI chat,
  • dllama worker - run the worker node,
  • dllama-api - run the API server.
🎹 Supported Arguments


Inference, Chat, API

Argument Description Example
--model <path> Path to model. dllama_model_meta-llama-3-8b_q40.m
--tokenizer <path> Tokenizer to model. dllama_tokenizer_llama3.t
--buffer-float-type <type> Float precision of synchronization. q80
--workers <workers> Addresses of workers (ip:port), separated by space. 10.0.0.1:9999 10.0.0.2:9999
--max-seq-len <n> The maximum sequence length, it helps to reduce the RAM usage. 4096

Inference, Chat, Worker, API

Argument Description Example
--nthreads <n> Amount of threads. Don't set a higher value than number of CPU cores. 4

Worker, API

Argument Description Example
--port <port> Binding port. 9999

Inference

Argument Description Example
--prompt <prompt> Initial prompt. "Hello World"
--steps <steps> Number of tokens to generate. 256

📊 Measurements

Please check the discussions section, where many measurements were published on different configurations.

✋ Contribution

Feel free to contribute to this project. For small changes, simply create a new merge request. For larger changes, please create an issue to discuss your plans. Please follow these guidelines when contributing:

  • Make only minimal changes and avoid modifying files that are not necessary.
  • Ensure the code is compatible across all supported systems and CPUs.
  • This repository is maintained in English.

💡 License

This project is released under the MIT license.

📖 Citation

@misc{dllama,
  author = {Bartłomiej Tadych},
  title = {Distributed Llama},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/b4rtaz/distributed-llama}},
  commit = {7eb77ca93ec0d502e28d36b6fb20039b449cbea4}
}

About

Distributed LLM inference. Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.

Topics

Resources

License

Stars

Watchers

Forks

Languages