Skip to content

Commit b755e12

Browse files
authored
Update 1-overview-and-build.md
1 parent e66bd01 commit b755e12

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ vLLM’s performance on Arm servers is driven by both software optimization and
3636
Each component of this optimized build contributes to higher throughput and lower latency during inference:
3737

3838
- Optimized kernels: The aarch64 vLLM build uses direct oneDNN with the Arm Compute Library for key operations.
39-
- 4‑bit weight quantization: vLLM supports INT4 quantized models, and Arm accelerates this using KleidiAI microkernels, which take advantage of DOT-product (SDOT/UDOT) and SME2 (Scalable Matrix Extension) instructions.
39+
- 4‑bit weight quantization: vLLM supports INT4 quantized models, and Arm accelerates this using KleidiAI microkernels, which take advantage of DOT-product (SDOT/UDOT) instructions.
4040
- Efficient MoE execution: For Mixture-of-Experts (MoE) models, vLLM fuses INT4 quantized expert layers to reduce intermediate memory transfers, which minimizes bandwidth bottlenecks
4141
- Optimized Paged attention: The paged attention mechanism, which handles token reuse during long-sequence generation, is SIMD-tuned for Arm’s NEON and SVE (Scalable Vector Extension) pipelines.
4242
- System tuning: Using thread affinity ensures efficient CPU core pinning and balanced thread scheduling across Arm clusters.

0 commit comments

Comments
 (0)