rackerlabs
diff --git a/‎docs/blog/posts/2025-12-17-ZenDNN-ai-enablement.md‎
Lines changed: 305 additions & 0 deletions b/‎docs/blog/posts/2025-12-17-ZenDNN-ai-enablement.md‎
Lines changed: 305 additions & 0 deletions
diff --git a/‎docs/blog/posts/assets/images/2025-12-17/dashboard.png‎
128 KB b/‎docs/blog/posts/assets/images/2025-12-17/dashboard.png‎
128 KB
@@ -0,0 +1,305 @@
+---
+date: 2025-12-17
+title: Running AI Inference on AMD EPYC Without a GPU in Sight
+authors:
+  - cloudnull
+description: >
+  Running AI Inference on AMD EPYC Without a GPU in Sight
+categories:
+  - OpenStack
+  - Zen
+  - AMD
+  - ZenDNN
+  - ZenTorch
+  - Virtualization
+
+---
+
+# Running AI Inference on AMD EPYC Without a GPU in Sight
+
+**Spoiler: You don't need a $40,000 GPU to run LLM inference. Sometimes 24 CPU cores and the right software stack will do just fine.**
+
+The AI infrastructure conversation has become almost synonymous with GPU procurement battles, NVIDIA allocation queues, and eye-watering hardware costs. But here's a reality that doesn't get enough attention: for many inference workloads, especially during development, testing, and moderate-scale production, modern CPUs with optimized software can deliver surprisingly capable performance at a fraction of the cost.
+
+<!-- more -->
+
+I recently spent some time exploring AMD's ZenDNN optimization library paired with vLLM on Rackspace OpenStack Flex, and the results challenge the assumption that CPU inference is merely a curiosity. Let me walk through what I found.
+
+## The Setup: AMD EPYC 9454 on OpenStack Flex
+
+For this testing, I spun up a general-purpose VM in Rackspace OpenStack Flex's DFW3 environment using the `gp.5.24.96` flavor.
+
+| Resource | Specification |
+|----------|---------------|
+| vCPUs | 24 |
+| RAM | 96 GB |
+| Root Disk | 240 GB |
+| Ephemeral | 128 GB |
+| Processor | AMD EPYC 9454 (Genoa) |
+| Hourly Cost | $0.79 |
+
+The AMD EPYC 9454 is a 4th-generation Zen 4 processor with AVX-512 support, including the BF16 and VNNI extensions that matter for inference workloads. These aren't just marketing checkboxes; they translate directly into optimized matrix operations that LLMs depend on.
+
+!!! note "Containerization with Docker"
+
+    This post isn't going into how to install [Docker](https://docs.docker.com/engine/install), but before getting started, it should be installed.
+
+## Getting vLLM
+
+vLLM is an open-source library designed for efficient large language model inference. It supports CPU and GPU backends and features a pluggable architecture that allows integration with optimization libraries like ZenDNN. To get started, clone the vLLM repository.
+
+```bash
+git clone https://github.com/vllm-project/vllm
+```
+
+## Building vLLM with ZenTorch
+
+AMD's ZenDNN library provides optimized deep learning primitives specifically tuned for Zen architecture processors. The ZenTorch plugin integrates these optimizations into PyTorch, and by extension, into vLLM's inference pipeline.
+
+Build the initial Docker Image for vLLM with CPU optimizations enabled and the AVX-512 extensions activated.
+
+```shell
+docker build -f docker/Dockerfile.cpu \
+             --build-arg VLLM_CPU_AVX512BF16=1 \
+             --build-arg VLLM_CPU_AVX512VNNI=1 \
+             --build-arg VLLM_CPU_DISABLE_AVX512=0 \
+             --tag vllm-cpu \
+             --target vllm-openai \
+             .
+```
+
+With the base container built, we now add the layers to make sure we can leverage ZenDNN optimizations. The build process involves creating a custom Docker image that layers ZenDNN-pytorch-plugin on top of vLLM's CPU-optimized base image.
+
+!!! example "Dockerfile for vLLM with ZenTorch at `docker/Dockerfile.cpu-amd`"
+
+    ```dockerfile
+    FROM vllm-cpu:latest
+    RUN apt-get update -y \
+        && apt-get install -y --no-install-recommends make cmake ccache git curl wget ca-certificates \
+                                                    gcc-12 g++-12 libtcmalloc-minimal4 libnuma-dev ffmpeg \
+                                                    libsm6 libxext6 libgl1 jq lsof libjemalloc2 gfortran \
+        && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
+
+    RUN git clone https://github.com/amd/ZenDNN-pytorch-plugin.git && \
+        cd ZenDNN-pytorch-plugin && \
+        uv pip install -r requirements.txt && \
+        CC=gcc CXX=g++ python3 setup.py bdist_wheel && \
+        uv pip install dist/*.whl
+
+    ENTRYPOINT ["vllm", "serve"]
+    ```
+
+Now build the final Docker image with ZenTorch enabled.
+
+```bash
+docker build -f docker/Dockerfile.cpu-amd \
+             --build-arg VLLM_CPU_AVX512BF16=1 \
+             --build-arg VLLM_CPU_AVX512VNNI=1 \
+             --build-arg VLLM_CPU_DISABLE_AVX512=0 \
+             --tag vllm-cpu-zentorch \
+             .
+```
+
+Runtime configuration binds vLLM to available CPU cores and allocates substantial memory for the KV cache to maximize throughput. If you plan to use smaller instances, adjust these values accordingly.
+
+For the test environment I set the shared memory size to 95G to accommodate larger models.
+
+??? example "computing SHM_SIZE"
+
+    ```bash
+    export SHM_SIZE="$(($(free -m | awk '/Mem/ {print $2}') - 1024))"
+    ```
+
+For the test environment I set the CPU core binding to use all but one core for vLLM processing.
+
+??? example "computing CORES"
+
+    ```bash
+    export CORES="$(($(nproc) - 1))"
+    ```
+
+Now run the vLLM container with ZenTorch enabled.
+
+!!! note "The HF_TOKEN variable should be set to a valid HuggingFace token with model access."
+
+    If you intend to use a model with access restrictions, ensure your HuggingFace token is set in the `HF_TOKEN` environment variable. Models like LLama 3.2 require an acceptance to their terms as well as authentication using a read-only token.
+
+```bash
+docker run --net=host \
+           --ipc=host \
+           --shm-size=${SHM_SIZE}m \
+           --privileged=true \
+           --detach \
+           --volume /var/lib/huggingface:/root/.cache/huggingface \
+           --env HUGGING_FACE_HUB_TOKEN="${HF_TOKEN}" \
+           --env VLLM_PLUGINS="zentorch" \
+           --env VLLM_CPU_KVCACHE_SPACE=50 \
+           --env VLLM_CPU_OMP_THREADS_BIND=${CORES} \
+           --env VLLM_CPU_NUM_OF_RESERVED_CPU=1 \
+           --name vllm-server \
+           --rm \
+           vllm-cpu-zentorch:latest --dtype=bfloat16 \
+                                    --max-num-seqs=5 \
+                                    --model=${MODEL}
+```
+
+## Benchmark Results: What Can CPU Inference Actually Do?
+
+I ran vLLM's built-in benchmark suite across several model families with 128-token input/output sequences and 4 concurrent requests. Here's what the numbers look like.
+
+!!! example "Benchmark setup and command"
+
+    ```bash
+    # Install
+    apt install python3.12-venv
+    python3 -m venv ~/.venvs/vllm
+    ~/.venvs/vllm/bin/pip install vllm ijson
+
+    # Run benchmark
+    HUGGING_FACE_HUB_TOKEN=${HF_TOKEN:-"None"} ~/.venvs/vllm/bin/python3 \
+        -m vllm.entrypoints.cli.main bench serve --backend vllm \
+                                                 --base-url http://localhost:8000 \
+                                                 --model ${MODEL} \
+                                                 --tokenizer ${MODEL} \
+                                                 --random-input-len 128 \
+                                                 --random-output-len 128 \
+                                                 --num-prompts 20 \
+                                                 --max-concurrency 4 \
+                                                 --temperature 0.7
+    ```
+
+### Qwen3 Family
+
+| Model | Parameters | Output Tokens/sec | TTFT (median) | Tokens per Output (median) |
+|-------|------------|-------------------|---------------|---------------------------|
+| Qwen3-0.6B | 0.6B | 121.17 | 247ms | 29.74ms |
+| Qwen3-1.7B | 1.7B | 69.00 | 542ms | 52.55ms |
+| Qwen3-4B | 4B | 35.77 | 1,366ms | 99.59ms |
+| Qwen3-8B | 8B | 20.65 | 2,156ms | 176.40ms |
+
+### Llama 3.2 Family
+
+| Model | Parameters | Output Tokens/sec | TTFT (median) | Tokens per Output (median) |
+|-------|------------|-------------------|---------------|---------------------------|
+| Llama-3.2-1B | 1B | 93.89 | 385ms | 38.46ms |
+| Llama-3.2-3B | 3B | 43.61 | 934ms | 83.52ms |
+
+### Gemma 3 Family
+
+| Model | Parameters | Output Tokens/sec | TTFT (median) | Tokens per Output (median) |
+|-------|------------|-------------------|---------------|---------------------------|
+| Gemma-3-1b-it | 1B | 83.81 | 337ms | 43.66ms |
+| Gemma-3-4b-it | 4B | 36.38 | 1,050ms | 102.40ms |
+| Gemma-3-12b-it | 12B | 13.93 | 3,873ms | 260.42ms |
+
+## Resource Utilization: What the System Actually Does
+
+Beyond throughput numbers, understanding resource consumption patterns matters for capacity planning. Here's what the system looked like under load during these benchmarks.
+
+!!! info "Dashboard: System metrics showing CPU, memory, network, and load patterns during vLLM inference testing"
+
+    ![NewRelic Dashboard](assets/images/2025-12-17/dashboard.png){ align=left : style="max-width:512px;width:75%;" }
+
+    * CPU load patterns (1-minute load spiking to 5-6 during inference)
+    * Memory utilization bands (50-70% during active runs)
+    * Network traffic spikes during HuggingFace model downloads (16 MB/s peak)
+    * Process table data showing VLLM::EngineCore threads (50-2000% CPU, 106-151 threads)
+
+### CPU Behavior
+
+The load average tells the real story. During active inference, the 1-minute load spiked to 5-6 on this 24-vCPU system, significant but not saturated. The CPU usage percentage chart shows bursty patterns: idle between requests, then concentrated utilization during token generation.
+
+The process table captures vLLM's multi-threaded architecture in action. Multiple `VLLM::EngineCore` processes consumed 50-2000% CPU (remember, 100% = one core, so 2000% means 20 cores active). Thread counts ranged from 106 to 151 per engine process, reflecting the parallelized inference pipeline.
+
+### Memory Patterns
+
+Memory utilization climbed to 50-70% during model loading and sustained inference, consuming roughly 48-67GB of the 96GB available. This tracks with model size plus KV cache allocation (configured at 50GB via `VLLM_CPU_KVCACHE_SPACE`).
+
+Container-level metrics show memory consumption scaling with model complexity.
+
+| Model Size Class | Memory Consumption |
+|-----------------|-------------------|
+| Sub-1B models | ~27-57 GB |
+| 3-4B models | ~56-60 GB |
+| 8B+ models | ~69-74 GB |
+
+The larger memory footprint relative to model parameter count reflects vLLM's continuous batching and KV cache management overhead, memory traded for throughput optimization.
+
+### Network and Storage I/O
+
+Network traffic spiked dramatically during model downloads from HuggingFace Hub, reaching 16 MB/s receive rates. Once models cached locally in `/var/lib/huggingface`, subsequent runs showed minimal network activity.
+
+Disk I/O patterns were write-heavy during model caching (21GB+ written across test runs) with modest read activity. The root disk sat at 17% utilization, model weights and container layers fit comfortably within the 240GB allocation.
+
+### Container Resource Summary
+
+Across all benchmark runs, the vLLM containers exhibited these aggregate characteristics.
+
+| Metric | Range | Notes |
+|--------|-------|-------|
+| CPU % | 44-873% | Multi-core utilization during inference |
+| Memory | 682MB - 74GB | Scales with model size |
+| Thread Count | 73-253 | Parallel inference workers |
+| Network Rx | 46-97 GB | Model downloads from HuggingFace |
+
+The key insight: CPU inference is memory-bandwidth bound more than compute-bound. The EPYC 9454's 12-channel DDR5 memory architecture matters as much as its core count for this workload class.
+
+## Reading the Results
+
+Let's be direct about what these numbers mean for practical use cases.
+
+**Sub-2B models are genuinely usable.** The Qwen3-0.6B and 1.7B models deliver 69-121 tokens per second with sub-second time-to-first-token. That's responsive enough for interactive applications, chatbots, code completion, document summarization. You're not waiting around.
+
+**4B models hit a sweet spot for quality vs. speed.** At 35-43 tokens per second, models like Qwen3-4B and Llama-3.2-3B provide meaningfully better outputs than their smaller siblings while remaining practical for batch processing and near-real-time applications. A 1.3-second TTFT is noticeable but not painful.
+
+**8B+ models work but require patience.** The Qwen3-8B at ~21 tokens/sec and Gemma-3-12b at ~14 tokens/sec are slower but absolutely functional for use cases where quality trumps latency, document analysis, async processing, development and testing workflows.
+
+## The Economics: GPU-Free Doesn't Mean Value-Free
+
+Here's where this gets interesting from an infrastructure planning perspective.
+
+That `gp.5.24.96` flavor runs at $0.79/hour, roughly $575/month for continuous operation. Compare that to GPU instance pricing where you're looking at $2-4/hour for entry-level accelerator access, assuming availability.
+
+For development teams iterating on prompts, testing model behavior, or running moderate inference loads, CPU-based instances provide a dramatically lower barrier to entry. You can spin up the infrastructure in minutes without joining a GPU allocation queue.
+
+This isn't about replacing GPU infrastructure for training or high-throughput production inference. It's about recognizing that not every AI workload requires the same hardware profile, and that forcing GPU dependency on all AI workloads is both expensive and often unnecessary.
+
+## Practical Applications
+
+Where does CPU inference with ZenDNN actually make sense?
+
+**Development and testing environments.** Every AI application needs a place to iterate that doesn't burn through GPU budget. CPU inference lets teams test model behavior, refine prompts, and validate integrations without competing for accelerator resources.
+
+**Batch processing at moderate scale.** Processing thousands of documents overnight? Analyzing logs for anomalies? Generating embeddings for search indexing? These workloads often care more about cost-per-token than tokens-per-second.
+
+**Edge and hybrid deployments.** Not every deployment location has GPU infrastructure. Branch offices, on-premise installations, and resource-constrained environments can still run inference workloads.
+
+**Burst capacity.** When your GPU fleet is fully loaded, CPU instances can absorb overflow traffic rather than dropping requests or queuing indefinitely.
+
+## Running This Yourself
+
+The complete setup on Rackspace OpenStack Flex involves.
+
+1. Launch an AMD EPYC instance (gp.5 flavor family)
+2. Install Docker and clone the vLLM repository
+3. Build the CPU-optimized image with ZenTorch
+4. Configure CPU binding and memory allocation
+5. Deploy and test
+
+The vLLM server exposes an OpenAI-compatible API, so existing tooling and integrations work without modification:
+
+```bash
+curl http://localhost:8000/v1/models | jq
+```
+
+From there, your application code doesn't need to know whether inference is happening on a GPU or CPU, the API contract remains identical.
+
+## The Bigger Picture
+
+The AI infrastructure narrative has over-indexed on GPU scarcity and the assumption that meaningful work requires accelerators. That's true for training and high-throughput production inference, but it misses a substantial category of workloads where CPU-based solutions deliver genuine value.
+
+AMD's investment in ZenDNN, combined with vLLM's architecture that supports pluggable backends, creates a practical path for organizations to deploy AI capabilities without GPU dependency. Running this on OpenStack Flex demonstrates that cloud infrastructure doesn't need to be hyperscaler-specific to support modern AI workloads.
+
+The 24-core EPYC VM running inference at 120 tokens per second for a 0.6B model, or 35 tokens per second for a 4B model, isn't a compromise. It's the right tool for a substantial portion of the AI workload landscape.
+
+Sometimes the most expensive hardware isn't the most appropriate hardware. And sometimes, 24 CPU cores are exactly what you need.