PowerInfer with ROCm/HIP support for AMD Strix APUs.
| APU | Architecture | GPU Target | CUs |
|---|---|---|---|
| Strix Halo (Ryzen AI Max) | RDNA 3.5 | gfx1151 | 40 |
| Strix Point | RDNA 3.5 | gfx1150 | 16 |
# Strix Halo (gfx1151)
docker pull ghcr.io/cecil-the-coder/powerinfer-strix-halo-rocm:latest
# Strix Point (gfx1150)
docker pull ghcr.io/cecil-the-coder/powerinfer-strix-halo-rocm:gfx1150-latestdocker run --device=/dev/kfd --device=/dev/dri \
-v /path/to/models:/models \
ghcr.io/cecil-the-coder/powerinfer-strix-halo-rocm:latest \
./main -m /models/your-model.gguf \
-p "Hello, world!" \
-n 128docker run --device=/dev/kfd --device=/dev/dri \
-v /path/to/models:/models \
-p 8080:8080 \
ghcr.io/cecil-the-coder/powerinfer-strix-halo-rocm:latest \
./server -m /models/your-model.gguf \
--host 0.0.0.0 --port 8080# Download a model first
mkdir -p models
huggingface-cli download Tiiny/ReluLLaMA-7B-PowerInfer-GGUF \
--local-dir ./models
# Start the server
docker compose upEdit docker-compose.yaml to change the model path or use a different GPU target.
kubectl apply -f kubernetes/deployment.yamlEdit kubernetes/deployment.yaml to configure the model path and GPU target.
git clone https://github.com/cecil-the-coder/powerinfer-strix-halo-rocm.git
cd powerinfer-strix-halo-rocm
# Build for Strix Halo (default)
docker build -t powerinfer-rocm:latest .
# Build for Strix Point
docker build --build-arg AMDGPU_TARGETS=gfx1150 -t powerinfer-rocm:gfx1150 .| Variable | Default | Description |
|---|---|---|
HSA_OVERRIDE_GFX_VERSION |
11.5.1 |
GPU version override for ROCm |
ROCBLAS_USE_HIPBLASLT |
1 |
Use hipBLASLt for better performance |
| Parameter | Description |
|---|---|
--vram-budget N |
GB of GPU memory for hot neurons |
-t N |
CPU threads for cold neuron computation |
-ngl N |
GPU layers (999 = all) |
-c N |
Context size |
PowerInfer exploits activation sparsity in LLMs. During inference, ~70-90% of neurons are inactive. PowerInfer:
- Precomputes which neurons are "hot" (frequently active) vs "cold" (rarely active)
- Keeps hot neurons on GPU, cold neurons on CPU
- Skips calculations for inactive neurons
Note: PowerInfer requires models with ReLU/ReGLU activation and precomputed activation statistics. Standard models using SiLU/SwiGLU won't benefit from sparsity optimizations but will still run.
| Model | Size | HuggingFace |
|---|---|---|
| ReluLLaMA-7B | ~7GB | Tiiny/ReluLLaMA-7B-PowerInfer-GGUF |
| ReluLLaMA-13B | ~13GB | Tiiny/ReluLLaMA-13B-PowerInfer-GGUF |
| ReluLLaMA-70B | ~40GB | Tiiny/ReluLLaMA-70B-PowerInfer-GGUF |