Merge pull request #1859 from pareenaverma/content_review

pareenaverma · web-flow · commit 137278a76386 · 2025-04-23T17:38:13.000-04:00
Tech review of vLLM LP
diff --git a/content/learning-paths/servers-and-cloud-computing/vLLM-quant/1-overview.md b/content/learning-paths/servers-and-cloud-computing/vLLM-quant/1-overview.md
@@ -40,8 +40,7 @@ These packages are needed to build libraries like OpenBLAS and manage system-lev
 
 ```bash
 sudo apt-get update -y
-sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-pip
-sudo apt install python-is-python3
+sudo apt-get install -y gcc-12 g++-12 libnuma-dev make
 ```
 Set the system default compilers to version 12:
 
@@ -82,12 +81,12 @@ Once the system libraries are in place, install the Python packages required for
 
 Before proceeding, make sure the following files are downloaded to your home directory:
 ```bash
-
+[PLACEHOLDER]
 ```
 These are required to complete the installation and model quantization steps.
 Now, navigate to your home directory:
 ```bash
-cd /home/ubuntu/
+cd $HOME
 ```
 
 Install the vLLM wheel. This wheel contains the  CPU-optimized version of `vLLM`, built specifically for Arm architecture. Installing it from a local `.whl` file ensures compatibility with the rest of your environment and avoids potential conflicts from nightly or default pip installations.
@@ -101,12 +100,12 @@ pip install llmcompressor
 ```
 Install torchvision (nightly version for CPU):
 ```bash
-pip install --force-reinstall torchvision==0.22.0.dev20250213 --extra-index-url https://download.pytorch.org/whl/nightly/cpu
-```
+pip install --force-reinstall torchvision==0.22.0.dev20250223 --extra-index-url https://download.pytorch.org/whl/nightly/cpu```
+
 Install the custom PyTorch CPU wheel:<br>
 This custom PyTorch wheel is prebuilt for Arm CPU architectures and includes the necessary optimizations for running inference. Installing it locally ensures compatibility with your environment and avoids conflicts with default pip packages.
 ```bash
 pip install torch-2.7.0.dev20250306-cp312-cp312-manylinux_2_28_aarch64.whl --force-reinstall --no-deps
 ```
 
-You’re now ready to quantize the model and start serving it with `vLLM` on an Arm-based system.
+You’re now ready to quantize the model and start serving it with `vLLM` on an Arm-based system.
diff --git a/content/learning-paths/servers-and-cloud-computing/vLLM-quant/2-quantize-model.md b/content/learning-paths/servers-and-cloud-computing/vLLM-quant/2-quantize-model.md
@@ -15,7 +15,7 @@ huggingface-cli login --token $hf_token
 ```
 ## Quantization Script Template
 
-Create the `vllm_quantize_model.py` script shown below to quantize the model :
+Using a file editor of your choice, create a file named `vllm_quantize_model.py` and copy the content shown below to quantize the model:
 ```bash
 import argparse
 import os
@@ -153,11 +153,10 @@ if __name__ == "__main__":
 Then run the quantization script using `vllm_quantize_model.py`. This generates an INT8 quantized version of the model using channelwise precision, which reduces memory usage while maintaining model accuracy:
 
 ```bash 
-cd /home/ubuntu/
 python vllm_quantize_model.py meta-llama/Llama-3.1-8B-Instruct --mode int8 --scheme channelwise
 ```
-The output model will be saved locally at:
-`/home/ubuntu/Llama-3.1-8B-Instruct-w8a8-channelwise`.
+The quantized model will be saved at:
+`$HOME/Llama-3.1-8B-Instruct-w8a8-channelwise`.
 
 ## Launch the vLLM server
 
@@ -171,8 +170,20 @@ ONEDNN_DEFAULT_FPMATH_MODE=BF16 \
 VLLM_TARGET_DEVICE=cpu \
 VLLM_CPU_KVCACHE_SPACE=32 \
 VLLM_CPU_OMP_THREADS_BIND="0-$(($(nproc) - 1))" \
-vllm serve /home/ubuntu/Llama-3.1-8B-Instruct-w8a8-channelwise \
+vllm serve $HOME/Llama-3.1-8B-Instruct-w8a8-channelwise \
 --dtype float32 --swap-space 16
 ```
 This command starts the vLLM server using the quantized model. It preloads `tcmalloc` for efficient memory allocation and uses OpenBLAS for accelerated matrix operations. Thread binding is dynamically set based on the number of available cores to maximize parallelism on Arm CPUs.
 
+The output from launching the vLLM server with the quantized model should look like:
+
+```output
+INFO 04-23 21:13:59 launcher.py:31] Route: /rerank, Methods: POST
+INFO 04-23 21:13:59 launcher.py:31] Route: /v1/rerank, Methods: POST
+INFO 04-23 21:13:59 launcher.py:31] Route: /v2/rerank, Methods: POST
+INFO 04-23 21:13:59 launcher.py:31] Route: /invocations, Methods: POST
+INFO:     Started server process [77356]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+```
+
diff --git a/content/learning-paths/servers-and-cloud-computing/vLLM-quant/3-run-benchmark.md b/content/learning-paths/servers-and-cloud-computing/vLLM-quant/3-run-benchmark.md
@@ -8,7 +8,7 @@ layout: learningpathall
 
 ## Run Single Inference
 
-Once the server is running, start by verifying it with a basic single-prompt request using `curl`. This confirms the server is running correctly and that the OpenAI-compatible /v1/chat/completions API is responding as expected:
+Once the server is running, open another terminal and verify it is running as expected with a basic single-prompt request using `curl`. This confirms the server is running correctly and that the OpenAI-compatible /v1/chat/completions API is responding as expected:
 
 ```bash
 curl http://localhost:8000/v1/chat/completions \
@@ -51,8 +51,8 @@ After confirming single-prompt inference, run batch testing to simulate concurre
 
 Use the following Python script to simulate concurrent user interactions.
 
-Save it as `batch_test.py`:
-```bash
+Save the content shown below in a file named `batch_test.py`:
+```python
 import requests
 import json
 import os
@@ -151,10 +151,10 @@ if __name__ == "__main__":
 Then, run it using:
 
 ```bash
-python batch_test.py localhost 8000 --schema http --batch 16 -m /home/ubuntu/Llama-3.1-8B-Instruct-w8a8-channelwise
+python3 batch_test.py localhost 8000 --schema http --batch 16 -m $HOME/Llama-3.1-8B-Instruct-w8a8-channelwise
 ```
 This simulates multiple users interacting with the model in parallel and helps validate server-side performance under load.
-You can modify the number of requests using the --batch flag or review/edit batch_test.py to customize prompt content and concurrency logic.
+You can modify the number of requests using the --batch flag or review and edit `batch_test.py` to customize prompt content and concurrency logic.
 
 When the test completes, server logs will display a summary including average prompt throughput and generation throughput. This helps benchmark how well the model performs under concurrent load on your Arm-based system.
 
diff --git a/content/learning-paths/servers-and-cloud-computing/vLLM-quant/_index.md b/content/learning-paths/servers-and-cloud-computing/vLLM-quant/_index.md
@@ -1,6 +1,9 @@
 ---
 title: Quantize and Run a Large Language Model using vLLM on Arm Servers
 
+draft: true
+cascade:
+    draft: true
 
 minutes_to_complete: 45
 
@@ -15,11 +18,12 @@ learning_objectives:
 
     
 prerequisites:
-    - An Arm-based server or cloud instance running with at least 32 CPU cores, 64 GB RAM and 80 GB of available disk space.
+    - An Arm-based server or cloud instance running with at least 32 CPU cores, 64 GB RAM and 32 GB of available disk space.
     - Familiarity with Python and machine learning concepts.
     - An active Hugging Face account with access to the target model.
 
 author: Rani Chowdary Mandepudi
+	Phalani Paladugu
 
 ### Tags
 skilllevels: Introductory