Troubleshooting Guide for vLLM Playground

This guide covers common issues and solutions for vLLM Playground.

Quick Fixes

Issue	Quick Fix
Port already in use	`vllm-playground stop` or `python scripts/kill_playground.py`
Container won't start	`podman logs vllm-service`
Image pull errors	`vllm-playground pull --all`
Tool calling not working	Restart server with "Enable Tool Calling" checked

Container Issues

Container Won't Start

# Check if Podman is installed
podman --version

# Check Podman connectivity
podman ps

# View container logs
podman logs vllm-service

# Check container status
podman ps -a | grep vllm-service

"Address Already in Use" Error

If you lose connection to the Web UI and get ERROR: address already in use:

# Quick Fix: Auto-detect and kill old process
python run.py

# Alternative: Manual restart
./scripts/restart_playground.sh

# Or kill manually
python scripts/kill_playground.py

vLLM Container Issues

# Check if container is running
podman ps -a | grep vllm-service

# View vLLM logs
podman logs -f vllm-service

# Stop and remove container
podman stop vllm-service && podman rm vllm-service

# Pull latest vLLM images
podman pull quay.io/rh_ee_micyang/vllm-mac:v0.11.0     # macOS ARM64
podman pull quay.io/rh_ee_micyang/vllm-cpu:v0.11.0    # Linux x86_64
podman pull vllm/vllm-openai:v0.12.0                  # GPU (official, v0.12.0+ for Claude Code)

MCP Server Issues

"Command not found" (npx, uvx)

MCP servers using STDIO transport require specific runtimes:

Command	Required For	Installation
`npx`	Filesystem server	Install Node.js: `brew install node` (macOS) or https://nodejs.org/
`uvx`	Git, Fetch, Time servers	Install uv: `brew install uv` (macOS) or https://docs.astral.sh/uv/

Verify installation:

# Check if npx is available
npx --version

# Check if uvx is available
uvx --version

Tool Calling Issues

Tool Calling Not Working

Tool calling requires server-side configuration. If tools aren't being called:

Verify server was started with tool calling enabled:
- Check "Enable Tool Calling" in Server Configuration BEFORE starting
- Look for this in startup logs: Tool calling enabled with parser: llama3_json

Verify CLI args are passed:

vLLM arguments: --model ... --enable-auto-tool-choice --tool-call-parser llama3_json

If using container mode, ensure the container was started fresh after enabling tool calling (stop and restart if needed)

Tool Call Parse Errors

If you see Error in extracting tool call from response:

Increase Max Tokens to 1024+ in Chat Settings
Use a larger model (Qwen 2.5 7B+, Llama 3.1 8B+)
Reduce tool count - fewer tools = more reliable parsing

OpenShift/Kubernetes Issues

GPU Mode Not Available

The Web UI automatically detects GPU availability by querying Kubernetes nodes for nvidia.com/gpu resources.

Check GPU availability in your cluster:

# List nodes with GPU capacity
oc get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.capacity.nvidia\.com/gpu

# Or check all node details
oc describe nodes | grep nvidia.com/gpu

If GPUs exist but not detected:

Verify RBAC permissions:

oc auth can-i list nodes --as=system:serviceaccount:vllm-playground:vllm-playground-sa
# Should return "yes"

Reapply RBAC if needed:

oc apply -f openshift/manifests/02-rbac.yaml

Pod Not Starting

# Check pod status
oc get pods -n vllm-playground

# View pod logs
oc logs -f deployment/vllm-playground-gpu -n vllm-playground

# Describe pod for events
oc describe pod <pod-name> -n vllm-playground

Out of Memory (OOM) Issues

⚠️ Resource Requirements for GuideLLM Benchmarks

The Web UI pod requires sufficient memory for GuideLLM benchmarks.

Recommended Memory Limits:

GPU Mode: 16Gi minimum, 32Gi+ for intensive benchmarks
CPU Mode: 64Gi minimum, 128Gi+ for intensive benchmarks

To increase resources:

Edit openshift/manifests/04-webui-deployment.yaml:

resources:
  limits:
    memory: "32Gi"
    cpu: "8"

Then reapply:

oc apply -f openshift/manifests/04-webui-deployment.yaml

Image Pull Errors

All images are publicly accessible - no authentication needed:

GPU: vllm/vllm-openai:v0.12.0
CPU: quay.io/rh_ee_micyang/vllm-cpu:v0.11.0

# Verify image accessibility
podman pull vllm/vllm-openai:v0.12.0
podman pull quay.io/rh_ee_micyang/vllm-cpu:v0.11.0

Common vLLM Issues

1. Engine Core Initialization Failed with "Torch not compiled with CUDA enabled"

Error:

AssertionError: Torch not compiled with CUDA enabled
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1}

Root Cause: vLLM is trying to use CUDA/GPU mode on macOS where CUDA is not available. This happens when the --device cpu flag is not explicitly set.

Solution: This is now fixed automatically in the WebUI - it will detect macOS and add the --device cpu flag. However, if you're running vLLM manually or seeing this error:

Ensure you're using CPU mode - The WebUI auto-detects macOS
Verify the command includes --device cpu
Check your vLLM version - Make sure you have vLLM with CPU backend support

Environment variables are set:

export VLLM_CPU_KVCACHE_SPACE=4
export VLLM_CPU_OMP_THREADS_BIND=auto
export VLLM_CPU_MOE_PREPACK=0
export VLLM_CPU_SGL_KERNEL=0

2. Engine Core Initialization Failed (Memory Issues)

Error:

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1}

(Without the CUDA error)

Solution:

Use a smaller model (e.g., facebook/opt-125m or facebook/opt-350m for testing)
Reduce max_model_len to 512, 1024, or 2048
Reduce KV cache space in settings (try 2-4 GB instead of 40 GB)

B. Model Not Compatible with CPU Backend

Some models may not work well with vLLM's CPU backend.

Solution:

Try a different model from the OPT family first: facebook/opt-125m, facebook/opt-350m
Check if the model supports CPU inference

C. CPU Optimization Issues on Apple Silicon

Some CPU optimizations may cause issues on M1/M2/M3 Macs (now automatically disabled).

Solution: The WebUI now automatically disables problematic optimizations. If running manually, add these to your environment or config/vllm_cpu.env:

export VLLM_CPU_MOE_PREPACK=0
export VLLM_CPU_SGL_KERNEL=0

3. max_num_batched_tokens Error

Error:

Value error, max_num_batched_tokens (2048) is smaller than max_model_len (131072)

Solution: This is now fixed automatically, but if you still see it:

Explicitly set max_model_len to a reasonable value (2048, 4096, or 8192)
The WebUI will automatically set max_num_batched_tokens to match

3. Memory Issues - OOM (Out of Memory)

Symptoms:

Process crashes
System becomes unresponsive
"Cannot allocate memory" errors

Solution:

Reduce model size: Use smaller models
Reduce max_model_len: Try 512, 1024, or 2048
Reduce KV cache: Set to 2-4 GB
Close other applications: Free up RAM

Conservative Settings for CPU:

{
  "model": "facebook/opt-125m",
  "max_model_len": 1024,
  "cpu_kvcache_space": 2,
  "dtype": "bfloat16"
}

4. Model Download Issues

Symptoms:

Timeout errors
Connection refused
Model not found

Solution:

Check your internet connection
For gated models (Llama 2, etc.), ensure you have:
- Hugging Face account
- Accepted model terms
- Set HF_TOKEN environment variable

Pre-download models:

python -c "from transformers import AutoModelForCausalLM; AutoModelForCausalLM.from_pretrained('facebook/opt-125m')"

5. Server Won't Start

Solution:

Check if port is already in use:
```
lsof -i :8000
```
Try a different port in settings
Check logs in WebUI for specific errors

6. Slow Performance on CPU

Expected Behavior: CPU inference is inherently slower than GPU. Typical speeds:

Small models (125M-350M): 10-50 tokens/second
Medium models (1B-3B): 1-10 tokens/second
Large models (7B+): 0.1-2 tokens/second

Optimization:

Increase KV cache (if you have RAM): 10-40 GB
Reduce max_tokens in generation
Use smaller models
Ensure no other heavy processes are running

Recommended Starting Configuration

For Testing (Minimal Resources)

{
  "model": "facebook/opt-125m",
  "max_model_len": 1024,
  "cpu_kvcache_space": 2,
  "dtype": "bfloat16"
}

For Development (Moderate Resources)

{
  "model": "facebook/opt-350m",
  "max_model_len": 2048,
  "cpu_kvcache_space": 4,
  "dtype": "bfloat16"
}

For Production (High Resources)

{
  "model": "facebook/opt-1.3b",
  "max_model_len": 4096,
  "cpu_kvcache_space": 10,
  "dtype": "bfloat16"
}

Getting More Debug Information

To see detailed error messages:

In WebUI: Check the Server Logs panel
From command line:
```
python app.py
```
Then check terminal output

Enable verbose logging:

export VLLM_LOGGING_LEVEL=DEBUG
python app.py

macOS-Specific Issues

Apple Silicon (M1/M2/M3) Compatibility

vLLM CPU backend works but may be slower
Some optimizations may need to be disabled
Intel-based Macs may have different behavior

Environment Setup

Make sure you have:

# Check Python version (3.8+)
python --version

# Check if vLLM is installed
python -c "import vllm; print(vllm.__version__)"

# Check available memory
sysctl hw.memsize

Still Having Issues?

Check the full error log - The root cause is usually shown above the final error
Try the smallest model first - facebook/opt-125m with minimal settings
Monitor system resources - Use Activity Monitor to check RAM usage
Check vLLM compatibility - Some features may not work on CPU backend

Reporting Issues

When reporting issues, please include:

Full error log from WebUI
Your system specs (macOS version, RAM, CPU)
Model and configuration you're trying to use
Steps to reproduce the error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting Guide for vLLM Playground

Quick Fixes

Container Issues

Container Won't Start

"Address Already in Use" Error

vLLM Container Issues

MCP Server Issues

"Command not found" (npx, uvx)

Tool Calling Issues

Tool Calling Not Working

Tool Call Parse Errors

OpenShift/Kubernetes Issues

GPU Mode Not Available

Pod Not Starting

Out of Memory (OOM) Issues

Image Pull Errors

Common vLLM Issues

1. Engine Core Initialization Failed with "Torch not compiled with CUDA enabled"

2. Engine Core Initialization Failed (Memory Issues)

B. Model Not Compatible with CPU Backend

C. CPU Optimization Issues on Apple Silicon

3. max_num_batched_tokens Error

3. Memory Issues - OOM (Out of Memory)

4. Model Download Issues

5. Server Won't Start

6. Slow Performance on CPU

Recommended Starting Configuration

For Testing (Minimal Resources)

For Development (Moderate Resources)

For Production (High Resources)

Getting More Debug Information

macOS-Specific Issues

Apple Silicon (M1/M2/M3) Compatibility

Environment Setup

Still Having Issues?

Reporting Issues

FilesExpand file tree

TROUBLESHOOTING.md

Latest commit

History

TROUBLESHOOTING.md

File metadata and controls

Troubleshooting Guide for vLLM Playground

Quick Fixes

Container Issues

Container Won't Start

"Address Already in Use" Error

vLLM Container Issues

MCP Server Issues

"Command not found" (npx, uvx)

Tool Calling Issues

Tool Calling Not Working

Tool Call Parse Errors

OpenShift/Kubernetes Issues

GPU Mode Not Available

Pod Not Starting

Out of Memory (OOM) Issues

Image Pull Errors

Common vLLM Issues

1. Engine Core Initialization Failed with "Torch not compiled with CUDA enabled"

2. Engine Core Initialization Failed (Memory Issues)

B. Model Not Compatible with CPU Backend

C. CPU Optimization Issues on Apple Silicon

3. max_num_batched_tokens Error

3. Memory Issues - OOM (Out of Memory)

4. Model Download Issues

5. Server Won't Start

6. Slow Performance on CPU

Recommended Starting Configuration

For Testing (Minimal Resources)

For Development (Moderate Resources)

For Production (High Resources)

Getting More Debug Information

macOS-Specific Issues

Apple Silicon (M1/M2/M3) Compatibility

Environment Setup

Still Having Issues?

Reporting Issues