rackerlabs
diff --git a/‎docs/blog/posts/2025-12-17-ZenDNN-ai-enablement.md‎
Lines changed: 292 additions & 0 deletions b/‎docs/blog/posts/2025-12-17-ZenDNN-ai-enablement.md‎
Lines changed: 292 additions & 0 deletions
diff --git a/‎docs/blog/posts/assets/images/2025-12-17/dashboard.png‎
128 KB b/‎docs/blog/posts/assets/images/2025-12-17/dashboard.png‎
128 KB
@@ -0,0 +1,292 @@
+---
+date: 2025-12-17
+title: Running AI Inference on AMD EPYC Without a GPU in Sight
+authors:
+  - cloudnull
+description: >
+  Running AI Inference on AMD EPYC Without a GPU in Sight
+categories:
+  - OpenStack
+  - Zen
+  - AMD
+  - ZenDNN
+  - ZenTorch
+  - Virtualization
+
+---
+
+# Running AI Inference on AMD EPYC Without a GPU in Sight
+
+**Spoiler: You don't need a $40,000 GPU to run LLM inference. Sometimes 24 CPU cores and the right software stack will do just fine.**
+
+The AI infrastructure conversation has become almost synonymous with GPU procurement battles, NVIDIA allocation queues, and eye-watering hardware costs. But here's a reality that doesn't get enough attention: for many inference workloads—especially during development, testing, and moderate-scale production—modern CPUs with optimized software can deliver surprisingly capable performance at a fraction of the cost.
+
+<!-- more -->
+
+I recently spent some time exploring AMD's ZenDNN optimization library paired with vLLM on Rackspace OpenStack Flex, and the results challenge the assumption that CPU inference is merely a curiosity. Let me walk through what I found.
+
+## The Setup: AMD EPYC 9454 on OpenStack Flex
+
+For this testing, I spun up a general-purpose VM in Rackspace OpenStack Flex's DFW3 environment using the `gp.5.24.96` flavor:
+
+| Resource | Specification |
+|----------|---------------|
+| vCPUs | 24 |
+| RAM | 96 GB |
+| Root Disk | 240 GB |
+| Ephemeral | 128 GB |
+| Processor | AMD EPYC 9454 (Genoa) |
+| Hourly Cost | $0.79 |
+
+The AMD EPYC 9454 is a 4th-generation Zen 4 processor with AVX-512 support—including the BF16 and VNNI extensions that matter for inference workloads. These aren't just marketing checkboxes; they translate directly into optimized matrix operations that LLMs depend on.
+
+!!! note "Containerization with Docker"
+
+    This post isn't going into how to install [Docker](https://docs.docker.com/engine/install), but before getting started, it should be installed.
+
+## Getting vLLM
+
+vLLM is an open-source library designed for efficient large language model inference. It supports CPU and GPU backends and features a pluggable architecture that allows integration with optimization libraries like ZenDNN. To get started, clone the vLLM repository:
+
+```bash
+git clone https://github.com/vllm-project/vllm
+```
+
+## Building vLLM with ZenTorch
+
+AMD's ZenDNN library provides optimized deep learning primitives specifically tuned for Zen architecture processors. The ZenTorch plugin integrates these optimizations into PyTorch, and by extension, into vLLM's inference pipeline.
+
+Build the initial Docker Image for vLLM with CPU optimizations enabled.
+
+```shell
+docker build -f docker/Dockerfile.cpu \
+             --build-arg VLLM_CPU_AVX512BF16=1 \
+             --build-arg VLLM_CPU_AVX512VNNI=1 \
+             --build-arg VLLM_CPU_DISABLE_AVX512=0 \
+             --tag vllm-cpu \
+             --target vllm-openai \
+             .
+```
+
+With the base container built, we now add the layers to make sure we can leverage ZenDNN optimizations. The build process involves creating a custom Docker image that layers ZenDNN-pytorch-plugin on top of vLLM's CPU-optimized base image.
+
+```dockerfile
+FROM vllm-cpu:latest
+RUN apt-get update -y \
+    && apt-get install -y --no-install-recommends make cmake ccache git curl wget ca-certificates \
+                                                  gcc-12 g++-12 libtcmalloc-minimal4 libnuma-dev ffmpeg \
+                                                  libsm6 libxext6 libgl1 jq lsof libjemalloc2 gfortran \
+    && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
+
+RUN git clone https://github.com/amd/ZenDNN-pytorch-plugin.git && \
+    cd ZenDNN-pytorch-plugin && \
+    uv pip install -r requirements.txt && \
+    CC=gcc CXX=g++ python3 setup.py bdist_wheel && \
+    uv pip install dist/*.whl
+
+ENTRYPOINT ["vllm", "serve"]
+```
+
+The key build flags enable AVX-512 extensions:
+
+```bash
+docker build -f docker/Dockerfile.cpu-amd \
+             --build-arg VLLM_CPU_AVX512BF16=1 \
+             --build-arg VLLM_CPU_AVX512VNNI=1 \
+             --build-arg VLLM_CPU_DISABLE_AVX512=0 \
+             --tag vllm-cpu-zentorch \
+             --target vllm-openai \
+             .
+```
+
+Runtime configuration binds vLLM to available CPU cores and allocates substantial memory for the KV cache:
+
+For the test environment I set the shared memory size to 94G to accommodate larger models.
+
+??? example "computing SHM_SIZE"
+
+    ```bash
+    export SHM_SIZE="$(($(free -m | awk '/Mem/ {print $2}') - 1024))"
+    ```
+
+Now run the vLLM container with ZenTorch enabled.
+
+```bash
+docker run --net=host \
+           --ipc=host \
+           --shm-size=${SHM_SIZE}m \
+           --privileged=true \
+           --detach \
+           --volume /var/lib/huggingface:/root/.cache/huggingface \
+           --env HUGGING_FACE_HUB_TOKEN="${HF_TOKEN}" \
+           --env VLLM_PLUGINS="zentorch" \
+           --env VLLM_CPU_KVCACHE_SPACE=50 \
+           --env VLLM_CPU_OMP_THREADS_BIND=${CORES} \
+           --env VLLM_CPU_NUM_OF_RESERVED_CPU=1 \
+           --name vllm-server \
+           --rm \
+           vllm-cpu-zentorch:latest --dtype=bfloat16 \
+                                    --max-num-seqs=5 \
+                                    --model=${MODEL}
+```
+
+## Benchmark Results: What Can CPU Inference Actually Do?
+
+I ran vLLM's built-in benchmark suite across several model families with 128-token input/output sequences and 4 concurrent requests. Here's what the numbers look like:
+
+!!! example "Benchmark setup and command"
+
+    ```bash
+    # Install
+    apt install python3.12-venv
+    python3 -m venv ~/.venvs/vllm
+    ~/.venvs/vllm/bin/pip install vllm ijson
+
+    # Run benchmark
+    HUGGING_FACE_HUB_TOKEN=${2:-"None"} ~/.venvs/vllm/bin/python3 \
+        -m vllm.entrypoints.cli.main bench serve --backend vllm \
+                                                 --base-url http://localhost:8000 \
+                                                 --model ${MODEL} \
+                                                 --tokenizer ${MODEL} \
+                                                 --random-input-len 128 \
+                                                 --random-output-len 128 \
+                                                 --num-prompts 20 \
+                                                 --max-concurrency 4 \
+                                                 --temperature 0.7
+    ```
+
+### Qwen3 Family
+
+| Model | Parameters | Output Tokens/sec | TTFT (median) | Tokens per Output (median) |
+|-------|------------|-------------------|---------------|---------------------------|
+| Qwen3-0.6B | 0.6B | 121.17 | 247ms | 29.74ms |
+| Qwen3-1.7B | 1.7B | 69.00 | 542ms | 52.55ms |
+| Qwen3-4B | 4B | 35.77 | 1,366ms | 99.59ms |
+| Qwen3-8B | 8B | 20.65 | 2,156ms | 176.40ms |
+
+### Llama 3.2 Family
+
+| Model | Parameters | Output Tokens/sec | TTFT (median) | Tokens per Output (median) |
+|-------|------------|-------------------|---------------|---------------------------|
+| Llama-3.2-1B | 1B | 93.89 | 385ms | 38.46ms |
+| Llama-3.2-3B | 3B | 43.61 | 934ms | 83.52ms |
+
+### Gemma 3 Family
+
+| Model | Parameters | Output Tokens/sec | TTFT (median) | Tokens per Output (median) |
+|-------|------------|-------------------|---------------|---------------------------|
+| Gemma-3-1b-it | 1B | 83.81 | 337ms | 43.66ms |
+| Gemma-3-4b-it | 4B | 36.38 | 1,050ms | 102.40ms |
+| Gemma-3-12b-it | 12B | 13.93 | 3,873ms | 260.42ms |
+
+## Resource Utilization: What the System Actually Does
+
+Beyond throughput numbers, understanding resource consumption patterns matters for capacity planning. Here's what the system looked like under load during these benchmarks.
+
+!!! info "Dashboard: System metrics showing CPU, memory, network, and load patterns during vLLM inference testing"
+
+    ![NewRelic Dashboard](assets/images/2025-12-17/dashboard.png){ align=left : style="max-width:512px;width:75%;" }
+
+    * CPU load patterns (1-minute load spiking to 5-6 during inference)
+    * Memory utilization bands (50-70% during active runs)
+    * Network traffic spikes during HuggingFace model downloads (16 MB/s peak)
+    * Process table data showing VLLM::EngineCore threads (50-2000% CPU, 106-151 threads)
+
+### CPU Behavior
+
+The load average tells the real story. During active inference, the 1-minute load spiked to 5-6 on this 24-vCPU system—significant but not saturated. The CPU usage percentage chart shows bursty patterns: idle between requests, then concentrated utilization during token generation.
+
+The process table captures vLLM's multi-threaded architecture in action. Multiple `VLLM::EngineCore` processes consumed 50-2000% CPU (remember, 100% = one core, so 2000% means 20 cores active). Thread counts ranged from 106 to 151 per engine process, reflecting the parallelized inference pipeline.
+
+### Memory Patterns
+
+Memory utilization climbed to 50-70% during model loading and sustained inference—consuming roughly 48-67GB of the 96GB available. This tracks with model size plus KV cache allocation (configured at 50GB via `VLLM_CPU_KVCACHE_SPACE`).
+
+Container-level metrics show memory consumption scaling with model complexity:
+
+| Model Size Class | Memory Consumption |
+|-----------------|-------------------|
+| Sub-1B models | ~27-57 GB |
+| 3-4B models | ~56-60 GB |
+| 8B+ models | ~69-74 GB |
+
+The larger memory footprint relative to model parameter count reflects vLLM's continuous batching and KV cache management overhead—memory traded for throughput optimization.
+
+### Network and Storage I/O
+
+Network traffic spiked dramatically during model downloads from HuggingFace Hub, reaching 16 MB/s receive rates. Once models cached locally in `/var/lib/huggingface`, subsequent runs showed minimal network activity.
+
+Disk I/O patterns were write-heavy during model caching (21GB+ written across test runs) with modest read activity. The root disk sat at 17% utilization—model weights and container layers fit comfortably within the 240GB allocation.
+
+### Container Resource Summary
+
+Across all benchmark runs, the vLLM containers exhibited these aggregate characteristics:
+
+| Metric | Range | Notes |
+|--------|-------|-------|
+| CPU % | 44-873% | Multi-core utilization during inference |
+| Memory | 682MB - 74GB | Scales with model size |
+| Thread Count | 73-253 | Parallel inference workers |
+| Network Rx | 46-97 GB | Model downloads from HuggingFace |
+
+The key insight: CPU inference is memory-bandwidth bound more than compute-bound. The EPYC 9454's 12-channel DDR5 memory architecture matters as much as its core count for this workload class.
+
+## Reading the Results
+
+Let's be direct about what these numbers mean for practical use cases.
+
+**Sub-2B models are genuinely usable.** The Qwen3-0.6B and 1.7B models deliver 69-121 tokens per second with sub-second time-to-first-token. That's responsive enough for interactive applications—chatbots, code completion, document summarization. You're not waiting around.
+
+**4B models hit a sweet spot for quality vs. speed.** At 35-43 tokens per second, models like Qwen3-4B and Llama-3.2-3B provide meaningfully better outputs than their smaller siblings while remaining practical for batch processing and near-real-time applications. A 1.3-second TTFT is noticeable but not painful.
+
+**8B+ models work but require patience.** The Qwen3-8B at ~21 tokens/sec and Gemma-3-12b at ~14 tokens/sec are slower but absolutely functional for use cases where quality trumps latency—document analysis, async processing, development and testing workflows.
+
+## The Economics: GPU-Free Doesn't Mean Value-Free
+
+Here's where this gets interesting from an infrastructure planning perspective.
+
+That `gp.5.24.96` flavor runs at $0.79/hour—roughly $575/month for continuous operation. Compare that to GPU instance pricing where you're looking at $2-4/hour for entry-level accelerator access, assuming availability.
+
+For development teams iterating on prompts, testing model behavior, or running moderate inference loads, CPU-based instances provide a dramatically lower barrier to entry. You can spin up the infrastructure in minutes without joining a GPU allocation queue.
+
+This isn't about replacing GPU infrastructure for training or high-throughput production inference. It's about recognizing that not every AI workload requires the same hardware profile—and that forcing GPU dependency on all AI workloads is both expensive and often unnecessary.
+
+## Practical Applications
+
+Where does CPU inference with ZenDNN actually make sense?
+
+**Development and testing environments.** Every AI application needs a place to iterate that doesn't burn through GPU budget. CPU inference lets teams test model behavior, refine prompts, and validate integrations without competing for accelerator resources.
+
+**Batch processing at moderate scale.** Processing thousands of documents overnight? Analyzing logs for anomalies? Generating embeddings for search indexing? These workloads often care more about cost-per-token than tokens-per-second.
+
+**Edge and hybrid deployments.** Not every deployment location has GPU infrastructure. Branch offices, on-premise installations, and resource-constrained environments can still run inference workloads.
+
+**Burst capacity.** When your GPU fleet is fully loaded, CPU instances can absorb overflow traffic rather than dropping requests or queuing indefinitely.
+
+## Running This Yourself
+
+The complete setup on Rackspace OpenStack Flex involves:
+
+1. Launch an AMD EPYC instance (gp.5 flavor family)
+2. Install Docker and clone the vLLM repository
+3. Build the CPU-optimized image with ZenTorch
+4. Configure CPU binding and memory allocation
+5. Deploy and test
+
+The vLLM server exposes an OpenAI-compatible API, so existing tooling and integrations work without modification:
+
+```bash
+curl http://localhost:8000/v1/models | jq
+```
+
+From there, your application code doesn't need to know whether inference is happening on a GPU or CPU—the API contract remains identical.
+
+## The Bigger Picture
+
+The AI infrastructure narrative has over-indexed on GPU scarcity and the assumption that meaningful work requires accelerators. That's true for training and high-throughput production inference, but it misses a substantial category of workloads where CPU-based solutions deliver genuine value.
+
+AMD's investment in ZenDNN, combined with vLLM's architecture that supports pluggable backends, creates a practical path for organizations to deploy AI capabilities without GPU dependency. Running this on OpenStack Flex demonstrates that cloud infrastructure doesn't need to be hyperscaler-specific to support modern AI workloads.
+
+The 24-core EPYC VM running inference at 120 tokens per second for a 0.6B model—or 35 tokens per second for a 4B model—isn't a compromise. It's the right tool for a substantial portion of the AI workload landscape.
+
+Sometimes the most expensive hardware isn't the most appropriate hardware. And sometimes, 24 CPU cores are exactly what you need.