You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/vllm-acceleration/_index.md
+13-17Lines changed: 13 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,28 +1,24 @@
1
1
---
2
-
title: Optimized LLM Inference with vLLM on Arm-Based Servers
3
-
4
-
draft: true
5
-
cascade:
6
-
draft: true
2
+
title: Accelerate vLLM inference on Azure Cobalt 100 virtual machines
7
3
8
4
minutes_to_complete: 60
9
5
10
-
who_is_this_for: This learning path is designed for software developers and AI engineers who want to build and optimize vLLM for Arm-based servers, quantize large language models (LLMs) to INT4, serve them efficiently through an OpenAI-compatible API, and benchmark model accuracy using the LM Evaluation Harness.
6
+
who_is_this_for: This is an introductory topic for developers interested in building and optimizing vLLM for Arm-based servers. This Learning Path shows you how to quantize large language models (LLMs) to INT4, serve them efficiently using an OpenAI-compatible API, and benchmark model accuracy with the LM Evaluation Harness.
11
7
12
8
learning_objectives:
13
-
- Build an optimized vLLM for aarch64 with oneDNN and the Arm Compute Library(ACL).
14
-
- Set up all runtime dependencies including PyTorch, llmcompressor, and Arm-optimized libraries.
15
-
- Quantize an LLM (DeepSeek‑V2‑Lite) to 4-bit integer (INT4) precision.
16
-
- Run and serve both quantized and BF16 (non-quantized) variants using vLLM.
17
-
- Use OpenAI‑compatible endpoints and understand sequence and batch limits.
18
-
- Evaluate accuracy using the LM Evaluation Harness on BF16 and INT4 models with vLLM.
9
+
- Build an optimized vLLM for aarch64 with oneDNN and the Arm Compute Library(ACL)
10
+
- Set up all runtime dependencies including PyTorch, llmcompressor, and Arm-optimized libraries
11
+
- Quantize an LLM (DeepSeek‑V2‑Lite) to 4-bit integer (INT4) precision
12
+
- Run and serve both quantized and BF16 (non-quantized) variants using vLLM
13
+
- Use OpenAI‑compatible endpoints and understand sequence and batch limits
14
+
- Evaluate accuracy using the LM Evaluation Harness on BF16 and INT4 models with vLLM
19
15
20
16
prerequisites:
21
-
- An Arm-based Linux server (Ubuntu 22.04+ recommended) with a minimum of 32 vCPUs, 64 GB RAM, and 64 GB free disk space.
22
-
- Python 3.12 and basic familiarity with Hugging Face Transformers and quantization.
17
+
- An Arm-based Linux server (Ubuntu 22.04+ recommended) with a minimum of 32 vCPUs, 64 GB RAM, and 64 GB free disk space
18
+
- Python 3.12 and basic familiarity with Hugging Face Transformers and quantization
0 commit comments