-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
๐๏ธ Speed Report
Hi everyone,
Iโve been testing the community versions of MLC and llama.cpp on Jetson Thor, and noticed a significant performance gap during prefill. Iโd like to check if this is expected behavior or if there are optimization options I might have missed.
Model setup: Qwen3-30B-A3B (tested with Q4_K_M and q4bf16_1/q4f16_1 quantized variants)
Hardware: Jetson Thor
Performance comparison:
llama.cpp: Prefill for ~10k tokens takes about 30โ40 seconds
MLC: Same setup requires 1.5~2ร longer for prefill
On the other hand, MLC seems much faster during decode (roughly 3ร faster than llama.cpp)
Stability:
llama.cpp server sometimes reports illegal memory access errors
MLC is more stable, but the prefill speed gap is quite large
Additional note: As far as I can tell, MLC currently does not support FP8 activation yet.
My questions are:
Is this prefill slowdown mainly due to MLCโs framework design, or lack of optimization/adaptation for Jetson Thor?
Are there recommended build flags, runtime parameters, or configuration tweaks to improve prefill performance?
If there are known issues or a roadmap for improvements, Iโd really appreciate any pointers.
Thanks a lot!
-
The model code: Qwen 30b A3b(MOE)
-
The model configuration (e.g. quantization mode, running data type, etc.): q4f16_1, q4bf16_1 (llama.cpp using Q4_k_m)
-
Device (e.g. MacBook Pro M2, PC+RTX 3080): Jetson Thor and Orin(64G)
-
OS (if applicable):
-
Encode speed (Token/s): for 7,000-token context on Orin,45โ52 s first token(llama.cpp (q4_k_m): 23โ26 s first token)
-
Decode speed (Token/s): 2~3 times faster than llama.cpp
-
Memory usage (if applicable):