Exploring LLMs with MLX and the Neural Accelerators in the M5 GPU #2816

jrp2014 · 2025-11-21T23:29:05Z

jrp2014
Nov 21, 2025

just a link to an apple paper that finds that

“In LLM inference, generating the first token is compute-bound, and takes full advantage of the Neural Accelerators. The M5 pushes the time-to-first-token generation under 10 seconds for a dense 14B architecture, and under 3 seconds for a 30B MoE, delivering strong performance for these architectures on a MacBook Pro.

Generating subsequent tokens is bounded by memory bandwidth, rather than by compute ability. On the architectures we tested in this post, the M5 provides 19-27% performance boost compared to the M4, thanks to its greater memory bandwidth (120GB/s for the M4, 153GB/s for the M5, which is 28% higher). Regarding memory footprint, the MacBook Pro 24GB can easily hold a 8B in BF16 precision or a 30B MoE 4-bit quantized, keeping the inference workload under 18GB for both of these architectures.“

link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Exploring LLMs with MLX and the Neural Accelerators in the M5 GPU #2816

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Exploring LLMs with MLX and the Neural Accelerators in the M5 GPU #2816

Uh oh!

jrp2014 Nov 21, 2025

Replies: 0 comments

jrp2014
Nov 21, 2025