update readme.md.

b4rtaz · b4rtaz · commit 45aa2410d941 · 2025-09-06T12:04:20.000+02:00
diff --git a/README.md b/README.md
@@ -8,6 +8,9 @@ Connect home devices into a powerful cluster to accelerate LLM inference. More d
 
 Supports Linux, macOS, and Windows. Optimized for ARM and x86_64 AVX2 CPUs.
 
+**How to Run**
+- [🍓 How to Run on Raspberry Pi](./docs/HOW_TO_RUN_RASPBERRYPI.md)
+
 **News**
 - 5 Sep 2025 - Qwen 3 MOE models are now supported on CPU.
 - 3 Aug 2025 - Qwen 3 0.6B, 1.7B, 8B and 14B models are now supported.
@@ -51,9 +54,19 @@ Supported architectures: Llama, Qwen3.
 
 ### 👷 Architecture
 
+````
+[🔀 SWITCH OR ROUTER]
+      | | | |
+      | | | |_______ 🔸 device1 (ROOT)     10.0.0.1
+      | | |_________ 🔹 device2 (WORKER 1) 10.0.0.2:9999
+      | |___________ 🔹 device3 (WORKER 2) 10.0.0.3:9999
+      |_____________ 🔹 device4 (WORKER 3) 10.0.0.4:9999
+                        ...
+````
+
 The project is split up into two parts:
-* **Root node** - it's responsible for loading the model and weights and forward them to workers. Also, it synchronizes the state of the neural network. The root node is also a worker, it processes own slice of the neural network.
-* **Worker node** - it processes own slice of the neural network. It doesn't require any configuration related to the model.
+* **🔸 Root node** - it's responsible for loading the model and weights and forward them to workers. Also, it synchronizes the state of the neural network. The root node is also a worker, it processes own slice of the neural network.
+* **🔹 Worker node** - it processes own slice of the neural network. It doesn't require any configuration related to the model.
 
 You always need the root node and you can add 2^n - 1 worker nodes to speed up the inference. The RAM usage of the neural network is split up across all nodes. The root node requires a bit more RAM than worker nodes.
 
diff --git a/docs/HOW_TO_RUN_RASPBERRYPI.md b/docs/HOW_TO_RUN_RASPBERRYPI.md
@@ -0,0 +1,94 @@
+# How to run Distributed Llama on 🍓 Raspberry Pi
+
+This article describes how to run Distributed Llama on 4 Raspberry Pi devices, but you can also run it on 1, 2, 4, 8... devices. Please adjust the commands and topology according to your configuration.
+
+````
+[🔀 SWITCH OR ROUTER]
+      | | | |
+      | | | |_______ 🔸 raspberrypi1 (ROOT)     10.0.0.1
+      | | |_________ 🔹 raspberrypi2 (WORKER 1) 10.0.0.2:9999
+      | |___________ 🔹 raspberrypi3 (WORKER 2) 10.0.0.3:9999
+      |_____________ 🔹 raspberrypi4 (WORKER 3) 10.0.0.4:9999
+````
+
+1. Install `Raspberry Pi OS Lite (64 bit)` on your **🔸🔹 ALL** Raspberry Pi devices. This OS doesn't have desktop environment but you can easily connect via SSH to manage it.
+2. Connect **🔸🔹 ALL** devices to your **🔀 SWITCH OR ROUTER** via Ethernet cable.
+3. Connect to all devices via SSH from your computer.
+
+```
+ssh user@raspberrypi1.local
+ssh user@raspberrypi2.local
+ssh user@raspberrypi3.local
+ssh user@raspberrypi4.local
+```
+
+4. Install Git on **🔸🔹 ALL** devices:
+
+```sh
+sudo apt install git
+```
+
+5. Clone this repository and compile Distributed Llama on **🔸🔹 ALL** devices:
+
+```sh
+git clone https://github.com/b4rtaz/distributed-llama.git
+cd distributed-llama
+make dllama
+make dllama-api
+```
+
+6. Download the model to the **🔸 ROOT** device using the `launch.py` script. You don't need to download the model on worker devices.
+
+```sh
+python3 launch.py # Prints a list of available models
+
+python3 launch.py llama3_2_3b_instruct_q40 # Downloads the model to the root device
+```
+
+7. Assign static IP addresses on **🔸🔹 ALL** devices. Each device must have a unique IP address in the same subnet.
+
+```sh
+sudo ip addr add 10.0.0.1/24 dev eth0 # 🔸 ROOT
+sudo ip addr add 10.0.0.2/24 dev eth0 # 🔹 WORKER 1
+sudo ip addr add 10.0.0.3/24 dev eth0 # 🔹 WORKER 2
+sudo ip addr add 10.0.0.4/24 dev eth0 # 🔹 WORKER 3
+```
+
+8. Start workers on all **🔹 WORKER** devices:
+
+```sh
+sudo nice -n -20 ./dllama worker --port 9999 --nthreads 4
+```
+
+8. Run the inference to test if everything works fine on the **🔸 ROOT** device:
+
+```sh
+sudo nice -n -20 ./dllama inference \
+  --prompt "Hello world" \
+  --steps 32 \
+  --model models/llama3_2_3b_instruct_q40/dllama_model_llama3_2_3b_instruct_q40.m \
+  --tokenizer models/llama3_2_3b_instruct_q40/dllama_tokenizer_llama3_2_3b_instruct_q40.t \
+  --buffer-float-type q80 \
+  --nthreads 4 \
+  --max-seq-len 4096 \
+  --workers 10.0.0.2:9999 10.0.0.3:9999 10.0.0.4:9999
+```
+
+9. To run the API server, start it on the **🔸 ROOT** device:
+
+```sh
+sudo nice -n -20 ./dllama-api \
+  --port 9999 \
+  --model models/llama3_2_3b_instruct_q40/dllama_model_llama3_2_3b_instruct_q40.m \
+  --tokenizer models/llama3_2_3b_instruct_q40/dllama_tokenizer_llama3_2_3b_instruct_q40.t \
+  --buffer-float-type q80 \
+  --nthreads 4 \
+  --max-seq-len 4096 \
+  --workers 10.0.0.2:9999 10.0.0.3:9999 10.0.0.4:9999
+```
+
+Now you can connect to the API server from your computer:
+
+```
+http://raspberrypi1.local:9999/v1/models
+```