[v0.12.2] Deepseek R1 Distill 8B Q40 on 4 x Raspberry Pi 5 8GB #162

b4rtaz · 2025-02-15T11:49:33Z

b4rtaz
Feb 15, 2025
Maintainer

Model: deepseek_r1_distill_llama_8b_q40
Version: 0.12.2

	Evaluation	Prediction
2 x Raspberry Pi 5 8GB	7.70 tok/s	3.54 tok/s
4 x Raspberry Pi 5 8GB	11.68 tok/s	6.43 tok/s

2 x Raspberry Pi 5 8GB

...
🔶 P  278 ms S   288 kB R   522 kB First
🔶 P  258 ms S   288 kB R   522 kB ,
🔶 P  323 ms S   288 kB R   522 kB  I
🔶 P  275 ms S   288 kB R   522 kB  need
🔶 P  293 ms S   288 kB R   522 kB  to
🔶 P  269 ms S   288 kB R   522 kB  understand
🔶 P  281 ms S   288 kB R   522 kB  what

Evaluation
   nBatches: 32
    nTokens: 19
   tokens/s: 7.70 (129.89 ms/tok)
Prediction
    nTokens: 77
   tokens/s: 3.54 (282.22 ms/tok)
⭕ Network is closed

4 x Raspberry Pi 5 8GB

...
🔶 P  162 ms S   864 kB R  1191 kB The
🔶 P  160 ms S   864 kB R  1191 kB  Multi
🔶 P  157 ms S   864 kB R  1191 kB -
🔶 P  176 ms S   864 kB R  1191 kB Device
🔶 P  130 ms S   864 kB R  1191 kB  In
🔶 P  174 ms S   864 kB R  1191 kB ference
🔶 P  132 ms S   864 kB R  1191 kB  Cluster
🔶 P  172 ms S   864 kB R  1191 kB  (
🔶 P  139 ms S   864 kB R  1191 kB MD
🔶 P  184 ms S   864 kB R  1191 kB IC
🔶 P  162 ms S   864 kB R  1191 kB )
🔶 P  156 ms S   864 kB R  1191 kB  is

Evaluation
   nBatches: 32
    nTokens: 19
   tokens/s: 11.68 (85.63 ms/tok)
Prediction
    nTokens: 77
   tokens/s: 6.43 (155.60 ms/tok)
⭕ Network is closed

NetOpWibby · 2025-02-15T16:32:19Z

NetOpWibby
Feb 15, 2025

Pretty sweet

0 replies

danielfalbo · 2025-02-15T18:59:52Z

danielfalbo
Feb 15, 2025

@Hitomamacs

0 replies

dhenson02 · 2025-02-19T17:49:02Z

dhenson02
Feb 19, 2025

You should go into detail how you did this across multiple

I'd love to read/watch that

0 replies

D-i-t-gh · 2025-03-03T14:56:15Z

D-i-t-gh
Mar 3, 2025

Thanks for showing your results. Just started to use dllama and it looks very nice!
I use 8 nodes (mostly older Intel CPUs with 4 or 6 cores, AVX2 capable), which are connected to a 2.5G LAN. I get this

...
🔶 P   60 ms S  2016 kB R  2342 kB  less
🔶 P   56 ms S  2016 kB R  2342 kB ,
🔶 P   57 ms S  2016 kB R  2342 kB  which
🔶 P   56 ms S  2016 kB R  2342 kB  is
🔶 P   56 ms S  2016 kB R  2342 kB
🔶 P   62 ms S  2016 kB R  2342 kB 14
🔶 P   63 ms S  2016 kB R  2342 kB .
🔶 P   56 ms S  2016 kB R  2342 kB  Got
🔶 P   84 ms S  2016 kB R  2342 kB  that
🔶 P   63 ms S  2016 kB R  2342 kB  part
🔶 P   56 ms S  2016 kB R  2342 kB  down
🔶 P   56 ms S  2016 kB R  2342 kB .
🔶 P   56 ms S  2016 kB R  2342 kB  Now
🔶 P   56 ms S  2016 kB R  2342 kB ,

Evaluation
   nBatches: 32
    nTokens: 11
   tokens/s: 33.64 (29.73 ms/tok)
Prediction
    nTokens: 245
   tokens/s: 16.63 (60.13 ms/tok)
⭕ Network is closed

This is the command I used (v0.12.7):

./dllama inference --model models/deepseek_r1_distill_llama_8b_q40/dllama_model_deepseek_r1_distill_llama_8b_q40.m --tokenizer models/deepseek_r1_distill_llama_8b_q40/dllama_tokenizer_deepseek_r1_distill_llama_8b_q40.t --buffer-float-type q80 --nthreads 6 --max-seq-len 4096 --workers  192.168.0.151:9990 192.168.0.152:9990 192.168.0.153:9990 192.168.0.154:9990 192.168.0.155:9990 192.168.0.160:9990 192.168.0.161:9990 --prompt "What is 5 plus 9 minus 3?" --steps 256

0 replies

LJ-Hao · 2025-03-17T03:27:04Z

LJ-Hao
Mar 17, 2025

Hello b4rtaz, could you tell me what is diffference between evaluation and prediction

0 replies

MengchiYang · 2025-09-04T02:50:35Z

MengchiYang
Sep 4, 2025

How did you run it successfully on 2 x Raspberry Pi 5 8GB ? When I run it on this scenario, worker node show Listening on 0.0.0.0:9999...
⭕ The root node has connected
⭕ nNodes: 0
⭕ NodeIndex: 0
⭕ Socket[0]: accepted root node
⭕ Network is initialized
📀 RequiredMemory: 20474 MB
🧠 CPU: neon dotprod fp16
Killed

My command on root node is ./dllama inference --model models/deepseek_r1_distill_llama_8b_q40/dllama_model_deepseek_r1_distill_llama_8b_q40.m --tokenizer models/deepseek_r1_distill_llama_8b_q40/dllama_tokenizer_deepseek_r1_distill_llama_8b_q40.t --buffer-float-type q80 --steps 16 --nthreads 4 --workers 192.168.1.100:9999

2 replies

b4rtaz Sep 4, 2025
Maintainer Author

You need to reduce the context size: --max-seq-len 4096

MengchiYang Sep 8, 2025

Great!

MengchiYang · 2025-09-08T14:41:32Z

MengchiYang
Sep 8, 2025

Do you know why it only show blank and dot ?

💡 RopeScaling: f=8.0, l=1.0, h=4.0, o=8192
📀 RequiredMemory: 3310 MB
⭕ Socket[0]: connecting to 192.168.0.105:9999 worker
⭕ Socket[0]: connected
⭕ Socket[1]: connecting to 192.168.0.106:9999 worker
⭕ Socket[1]: connected
⭕ Socket[2]: connecting to 192.168.0.108:9999 worker
⭕ Socket[2]: connected
⭕ Network is initialized
🧠 CPU: neon dotprod fp16
💿 Loading weights...
💿 Weights loaded
🚁 Network is in non-blocking mode
⭐ Chat template: deepSeek3
🛑 Stop: <｜end▁of▁sentence｜>
🛑 Stop: <｜end▁of▁sentence｜>
💻 System prompt (optional):

👱 User

what is 99+12

🤖 Assistant

.

.

3 replies

b4rtaz Sep 8, 2025
Maintainer Author

What version are you using?

MengchiYang Sep 8, 2025

I cloned this repo on Sep 4

b4rtaz Sep 8, 2025
Maintainer Author

Could you pull last changes? On what kind of CPU are you trying to run it?

[v0.12.2] Deepseek R1 Distill 8B Q40 on 4 x Raspberry Pi 5 8GB #162

Uh oh!

b4rtaz Feb 15, 2025 Maintainer

2 x Raspberry Pi 5 8GB

4 x Raspberry Pi 5 8GB

Replies: 7 comments · 5 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

b4rtaz Sep 4, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

b4rtaz Sep 8, 2025 Maintainer Author

Uh oh!

Uh oh!

b4rtaz Sep 8, 2025 Maintainer Author

b4rtaz
Feb 15, 2025
Maintainer

Replies: 7 comments 5 replies

b4rtaz Sep 4, 2025
Maintainer Author

b4rtaz Sep 8, 2025
Maintainer Author

b4rtaz Sep 8, 2025
Maintainer Author