Skip to content

Commit a363c8f

Browse files
Update vLLM learning path details and resources
1 parent f898cf3 commit a363c8f

File tree

1 file changed

+13
-17
lines changed
  • content/learning-paths/servers-and-cloud-computing/vllm-acceleration

1 file changed

+13
-17
lines changed

content/learning-paths/servers-and-cloud-computing/vllm-acceleration/_index.md

Lines changed: 13 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,24 @@
11
---
2-
title: Optimized LLM Inference with vLLM on Arm-Based Servers
3-
4-
draft: true
5-
cascade:
6-
draft: true
2+
title: Accelerate vLLM inference on Azure Cobalt 100 virtual machines
73

84
minutes_to_complete: 60
95

10-
who_is_this_for: This learning path is designed for software developers and AI engineers who want to build and optimize vLLM for Arm-based servers, quantize large language models (LLMs) to INT4, serve them efficiently through an OpenAI-compatible API, and benchmark model accuracy using the LM Evaluation Harness.
6+
who_is_this_for: This is an introductory topic for developers interested in building and optimizing vLLM for Arm-based servers. This Learning Path shows you how to quantize large language models (LLMs) to INT4, serve them efficiently using an OpenAI-compatible API, and benchmark model accuracy with the LM Evaluation Harness.
117

128
learning_objectives:
13-
- Build an optimized vLLM for aarch64 with oneDNN and the Arm Compute Library(ACL).
14-
- Set up all runtime dependencies including PyTorch, llmcompressor, and Arm-optimized libraries.
15-
- Quantize an LLM (DeepSeek‑V2‑Lite) to 4-bit integer (INT4) precision.
16-
- Run and serve both quantized and BF16 (non-quantized) variants using vLLM.
17-
- Use OpenAI‑compatible endpoints and understand sequence and batch limits.
18-
- Evaluate accuracy using the LM Evaluation Harness on BF16 and INT4 models with vLLM.
9+
- Build an optimized vLLM for aarch64 with oneDNN and the Arm Compute Library (ACL)
10+
- Set up all runtime dependencies including PyTorch, llmcompressor, and Arm-optimized libraries
11+
- Quantize an LLM (DeepSeek‑V2‑Lite) to 4-bit integer (INT4) precision
12+
- Run and serve both quantized and BF16 (non-quantized) variants using vLLM
13+
- Use OpenAI‑compatible endpoints and understand sequence and batch limits
14+
- Evaluate accuracy using the LM Evaluation Harness on BF16 and INT4 models with vLLM
1915

2016
prerequisites:
21-
- An Arm-based Linux server (Ubuntu 22.04+ recommended) with a minimum of 32 vCPUs, 64 GB RAM, and 64 GB free disk space.
22-
- Python 3.12 and basic familiarity with Hugging Face Transformers and quantization.
17+
- An Arm-based Linux server (Ubuntu 22.04+ recommended) with a minimum of 32 vCPUs, 64 GB RAM, and 64 GB free disk space
18+
- Python 3.12 and basic familiarity with Hugging Face Transformers and quantization
2319

2420
author:
25-
- Nikhil Gupta, Pareena Verma
21+
- Nikhil Gupta
2622

2723
### Tags
2824
skilllevels: Introductory
@@ -47,7 +43,7 @@ further_reading:
4743
- resource:
4844
title: vLLM GitHub Repository
4945
link: https://github.com/vllm-project/vllm
50-
type: github
46+
type: website
5147
- resource:
5248
title: Hugging Face Model Hub
5349
link: https://huggingface.co/models
@@ -59,7 +55,7 @@ further_reading:
5955
- resource:
6056
title: LM Evaluation Harness (GitHub)
6157
link: https://github.com/EleutherAI/lm-evaluation-harness
62-
type: github
58+
type: website
6359

6460

6561

0 commit comments

Comments
 (0)