Skip to content

[Speed] Prefill speed on MLC significantly slower than llama.cpp on Jetson Thor โ€“ any optimization suggestions?ย #3328

@Songyanfei

Description

@Songyanfei

๐ŸŽ๏ธ Speed Report

Hi everyone,

Iโ€™ve been testing the community versions of MLC and llama.cpp on Jetson Thor, and noticed a significant performance gap during prefill. Iโ€™d like to check if this is expected behavior or if there are optimization options I might have missed.

Model setup: Qwen3-30B-A3B (tested with Q4_K_M and q4bf16_1/q4f16_1 quantized variants)

Hardware: Jetson Thor

Performance comparison:

llama.cpp: Prefill for ~10k tokens takes about 30โ€“40 seconds

MLC: Same setup requires 1.5~2ร— longer for prefill

On the other hand, MLC seems much faster during decode (roughly 3ร— faster than llama.cpp)

Stability:

llama.cpp server sometimes reports illegal memory access errors

MLC is more stable, but the prefill speed gap is quite large

Additional note: As far as I can tell, MLC currently does not support FP8 activation yet.

My questions are:

Is this prefill slowdown mainly due to MLCโ€™s framework design, or lack of optimization/adaptation for Jetson Thor?

Are there recommended build flags, runtime parameters, or configuration tweaks to improve prefill performance?

If there are known issues or a roadmap for improvements, Iโ€™d really appreciate any pointers.

Thanks a lot!

  • The model code: Qwen 30b A3b(MOE)

  • The model configuration (e.g. quantization mode, running data type, etc.): q4f16_1, q4bf16_1 (llama.cpp using Q4_k_m)

  • Device (e.g. MacBook Pro M2, PC+RTX 3080): Jetson Thor and Orin(64G)

  • OS (if applicable):

  • Encode speed (Token/s): for 7,000-token context on Orin,45โ€“52 s first token(llama.cpp (q4_k_m): 23โ€“26 s first token)

  • Decode speed (Token/s): 2~3 times faster than llama.cpp

  • Memory usage (if applicable):

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions