docs: Update vLLM installation documentation and improve CPU configuration

UCSD-JP · UCSD-JP · commit 7bd0ab8b0c2b · 2025-07-17T14:41:45.000-07:00
diff --git a/benchmark/llm_bench/README.md b/benchmark/llm_bench/README.md
@@ -6,7 +6,7 @@ LLM benchmark module for heimdall. This module allows you to measure and compare
 
 - **PyTorch**: CPU inference using Meta's official Llama3 implementation
 - **Llama.cpp**: Efficient CPU inference with quantized models
-- **vLLM**: High-performance inference serving for both CPU and GPU
+- **vLLM**: High-performance inference serving for both CPU and GPU (v0.9.1)
 
 ## Prerequisites
 
@@ -20,13 +20,19 @@ LLM benchmark module for heimdall. This module allows you to measure and compare
    # Ubuntu/Debian
    sudo apt-get update
    sudo apt-get install -y numactl git-lfs
+   
+   # For vLLM CPU (additional requirements)
+   sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev
+   sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
    ```
 
+
+
 ## Quick Start
 
 ### 1. Installation
 
-Install each framework separately:
+Install each framework separately (automatically manages virtual environments):
 
 ```bash
 # Install PyTorch (includes Llama3-8B model)
@@ -35,7 +41,7 @@ uv run heimdall bench install llm pytorch
 # Install Llama.cpp (includes quantized model)
 uv run heimdall bench install llm llamacpp
 
-# Install vLLM CPU
+# Install vLLM CPU (builds from source with v0.9.1)
 uv run heimdall bench install llm vllm_cpu
 
 # Install vLLM GPU
@@ -64,6 +70,24 @@ uv run heimdall bench run llm all
 uv run heimdall bench plot llm all
 ```
 
+## Virtual Environment Management
+
+The benchmark system automatically manages Python virtual environments:
+
+- **Automatic Creation**: Creates `.venv` in the project root if it doesn't exist
+- **Automatic Activation**: All commands run within the virtual environment
+- **Python 3.12**: Uses Python 3.12 with `uv` for optimal performance
+- **Isolation**: Each framework installation is isolated and doesn't conflict
+
+Manual virtual environment usage:
+```bash
+# Activate virtual environment manually
+source .venv/bin/activate
+
+# Check virtual environment status
+which python
+```
+
 ## Detailed Usage
 
 ### PyTorch
@@ -98,7 +122,7 @@ Features:
 
 ### vLLM
 
-#### CPU Mode
+#### CPU Mode (v0.9.1)
 ```bash
 # CPU-based inference benchmark
 bash benchmark/llm_bench/scripts/vllm_run_test.sh
@@ -107,6 +131,8 @@ bash benchmark/llm_bench/scripts/vllm_run_test.sh
 Environment variables:
 - `VLLM_CPU_KVCACHE_SPACE=30`: KV cache space configuration
 - `LD_PRELOAD`: Memory optimization using tcmalloc
+- `VLLM_TARGET_DEVICE=cpu`: Force CPU-only mode
+- `CUDA_VISIBLE_DEVICES=""`: Disable CUDA
 
 #### GPU Mode
 ```bash
@@ -118,25 +144,19 @@ Features:
 - Support for large models (70B)
 - Memory efficiency through CPU offloading
 
-## vLLM Independent Installation
+## vLLM Installation Details
 
-vLLM installation through heimdall may fail depending on individual environments. In such cases, you can install it independently:
+### CPU Mode (Source Build - v0.9.1)
 
-### CPU Mode
-```bash
-pip install vllm[cpu]
-# Or build from source
-git clone https://github.com/vllm-project/vllm.git
-cd vllm
-pip install -e .
-```
+vLLM CPU installation automatically builds from source with the following process:
+
+### Independent Installation (Alternative)
+
+If heimdall installation fails, you can install independently:
+
+> 🔗 **Reference**: For the latest installation methods, check the [Official vLLM CPU Installation Documentation](https://docs.vllm.ai/en/stable/getting_started/installation/cpu.html).
 
-### GPU Mode
-```bash
-pip install vllm
-```
 
-After independent installation, you can run benchmarks by executing the scripts directly.
 
 ## NUMA Configuration
 
@@ -198,7 +218,7 @@ Used datasets:
 - **Format**: 4-bit quantized GGUF
 
 ### vLLM
-- **CPU**: Meta-Llama-3-8B
+- **CPU**: Meta-Llama-3-8B (v0.9.1)
 - **GPU**: Meta-Llama-3-70B
 - **Source**: Automatic download from Hugging Face Hub
 
@@ -212,7 +232,7 @@ Used datasets:
    ```
 
 2. **Memory Shortage**
-   - Ensure sufficient RAM for model size
+   - Ensure sufficient RAM for model size (32GB+ for 8B models)
    - Sufficient VRAM required for vLLM GPU mode
 
 3. **Check NUMA Configuration**
@@ -223,12 +243,35 @@ Used datasets:
 4. **Permission Issues**
    - Permission settings required for perf commands
 
-### vLLM Special Configuration
+5. **uv Command Not Found**
+   ```bash
+   # uv should be available as part of heimdall setup
+   # If missing, check heimdall installation
+   export PATH="$HOME/.local/bin:$PATH"
+   ```
+
+### vLLM Specific Issues
+
+#### CPU Mode Build Errors
+```bash
+# Compiler issues
+sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev
+sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
+
+# CMake version issues
+pip install "cmake>=3.26.1"
+
+# Memory issues during build
+export MAX_JOBS=4  # Limit parallel build jobs
+```
 
-For optimal performance in CPU mode, set the following environment:
+#### Environment Configuration
+For optimal performance in CPU mode:
 ```bash
 export VLLM_CPU_KVCACHE_SPACE=30
 export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4
+export VLLM_TARGET_DEVICE=cpu
+export CUDA_VISIBLE_DEVICES=""
 ```
 
 ## Additional Scripts
@@ -243,4 +286,15 @@ export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4
 # Direct PyTorch test execution
 cd benchmark/llm_bench/llama
 python pytorch_run_test.py --cpu_bind 0 --mem_bind 0 --description "Local DIMM"
-``` 
+
+# Manual virtual environment usage
+source .venv/bin/activate
+python benchmark/llm_bench/src/vllm_run_test.py
+```
+
+## Version Information
+
+- **vLLM**: v0.9.1 (CPU source build)
+- **PyTorch**: 2.7.0+cpu (for vLLM CPU)
+- **Python**: 3.12 (recommended)
+- **uv**: Latest (for package management) 
diff --git a/benchmark/llm_bench/scripts/vllm_run_test.sh b/benchmark/llm_bench/scripts/vllm_run_test.sh
@@ -7,6 +7,8 @@
 # Set environment variables for CPU KV cache space and LD_PRELOAD
 export VLLM_CPU_KVCACHE_SPACE=30
 export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD
+export CUDA_VISIBLE_DEVICES=""
+export VLLM_DEVICE=cpu
 
 HUGGING_FACE_HUB_TOKEN=$(cat ~/.cache/huggingface/token)
 export HUGGING_FACE_HUB_TOKEN
@@ -26,7 +28,7 @@ descriptions=(
 	"CPU 1, Node 2 (Remote CXL)"
 )
 
-VLLM_PATH="benchmark/llm_bench/vllm"
+VLLM_PATH="benchmark/llm_bench/vllm_cpu"
 MODEL="meta-llama/Meta-Llama-3-8B"
 DATASET="benchmark/llm_bench/datasets/ShareGPT_V3_unfiltered_cleaned_split.json"
 LOG_DIR="benchmark/llm_bench/logs/vllm"
@@ -68,7 +70,8 @@ vllm() {
 	"${numa_cmd[@]}" python "$VLLM_PATH/benchmarks/benchmark_throughput.py" \
 		--model "$MODEL" \
 		--dataset "$DATASET" \
-		--output-json "$output_json"
+		--output-json "$output_json" \
+		--device cpu
 
 	sleep 2
 }