A high-performance, fully local AI ecosystem optimized for AMD Radeon GPUs (specifically the RX 6700 XT / gfx1031). This setup integrates LLMs, Vision models, Text-to-Speech (TTS), Speech-to-Text (STT), and Document Parsing into a single, unified interface via Open WebUI.
- GPU: AMD Radeon RX 6700 XT (12GB VRAM)
- CPU: Ryzen 5 5600X / 16GB RAM
- Metric: ~22-25 t/s on
Ministral-3-14B-Instruct-Q5_K_XL(16k context)
| Component | Technology | Description |
|---|---|---|
| Interface | Open WebUI | Docker-based frontend for all services. |
| Inference | llama.cpp | Custom built for ROCm 7.2 & gfx1031. |
| Model Swap | llama-swap | Dynamic model loading/unloading for VRAM efficiency. |
| TTS | Kokoro-ONNX | High-speed, high-quality local text-to-speech. |
| STT | Whisper.cpp | Vulkan-accelerated transcription with VAD. |
| Embedding | Qwen3-Embedding | Local vector generation for RAG. |
| Parsing | Docling | Advanced document-to-markdown conversion. |
This setup uses a hybrid of ROCm 7.x and custom libraries for the gfx1031 architecture.
- ROCm Build: guinmoon/rocm7_builds
- Required Libraries: ROCmLibs-for-gfx1103-AMD780M-APU (Provides compatible ROCBlas).
To utilize the RX 6700 XT, you must compile llama.cpp manually using Ninja and LLVM.
:: Set Compiler Paths
set CC=C:\AMD\ROCm\7.2\lib\llvm\bin\clang.exe
set CXX=C:\AMD\ROCm\7.2\lib\llvm\bin\clang++.exe
set HIP_DEVICE_LIB_PATH=C:\AMD\ROCm\7.2\lib\llvm\amdgcn\bitcode
:: Configure with gfx1031 target
cmake -B build -G "Ninja" -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1031 ^
-DCMAKE_C_COMPILER="%CC%" -DCMAKE_CXX_COMPILER="%CXX%" ^
-DCMAKE_PREFIX_PATH="C:\AMD\ROCm\7.2" -DCMAKE_BUILD_TYPE=ReleaseInstall necessary wheels for ROCm-accelerated PyTorch:
pip install "rocm-7.2.0.tar.gz" "rocm_sdk_libraries_custom-7.2.0-py3-none-win_amd64.whl"
pip install "torch-2.9.1+rocmsdk20251203-cp312-cp312-win_amd64.whl"We use llama-swap to handle multiple models (Vision, Instruct, Reasoning) on a single 12GB GPU without crashing.
models:
glm-4.6v:
cmd: llama-server.exe --model GLM-4.6V-Flash-UD-Q6_K_XL.gguf --mmproj GLM_mmproj-F16.gguf --gpu-layers -1 -c 16384
ministral-3b:
cmd: llama-server.exe --model Ministral-3-3B-Instruct-2512-UD-Q6_K_XL.gguf --gpu-layers -1 -c 32768
llama-3.3-8b:
cmd: llama-server.exe --model Llama-3.3-8B-Instruct-Q6_K_L.gguf --gpu-layers -1 -c 16384The stack is managed via a master batch file (START_LlamaROCMCPP.bat) that launches:
- Llama-Swap (LLM Port 8000)
- Qwen Embedding (Port 8181)
- Docling Service (Document processing)
- Whisper STT (Vulkan backend)
- Kokoro TTS (Python ONNX backend)
To keep the UI updated while preserving your local data:
# Update and Restart
docker compose pull
docker compose up -d
# Cleanup
docker image prune -f- VRAM Management: Using
q8_0for K/V cache (as seen inconfig.yaml) is essential for maintaining 16k+ context windows on a 12GB card. - Vision Models: Ensure the
--mmprojflag is correctly pointed to the project file inllama-swapconfigurations to enable image analysis. - STT Modification: To make Whisper.cpp compatible with OpenAI-style endpoints, modify
server.cppchanging/inferenceto/v1/audio/transcriptionsbefore compiling. - Pathing: This configuration assumes a root directory of
C:\llamaROCM.
- ROCm Builds: guinmoon
- llama.cpp: ggml-org
- llama-swap: mostlygeek
- Kokoro-ONNX: thewh1teagle