Context: First real production use of DirectML with a full diffusers SDXL pipeline (Stable Diffusion XL image generation). Benchmark proved the GPU was active at 40.2x CPU speed. Production pipeline ran on CPU anyway. Root cause took ~30 min to identify.
Hardware: AMD Radeon RX 5700 XT, Windows 11, Python 3.11.9, torch-directml.
A diffusers StableDiffusionXLPipeline was loading the SDXL model and calling:
device_name = str(torch_directml.device()) # "privateuseone:0"
pipe = pipe.to(device_name)The pipeline silently fell back to CPU. The device badge in the UI showed cpu. Images took 10–15 minutes per panel instead of seconds. No error was raised. No warning was printed.
The GPU benchmark (benchmark_gpu.py) used:
device = torch_directml.device() # actual object
x = torch.randn(4096, 4096).to(device) # works fineThis passed the device object directly. The benchmark passed. GPU was confirmed at 40.2x speedup. But production code extracted the string with str() and passed that instead. The two code paths looked similar but behaved completely differently.
The diffusers pipeline's .to(device) method iterates all sub-modules (UNet, VAE, text encoders). When device is the string "privateuseone:0", PyTorch's device parsing fails internally. diffusers catches the exception in a try/except, falls back to CPU, and logs nothing visible.
This is NOT a bug in diffusers or torch-directml — it's a documentation gap. The PyTorch tensor.to(string) overload supports common string devices like "cuda:0", "cpu", "mps". Custom backends accessed via the privateuseone namespace are not parseable from strings.
Store and use the device object, never the string, for .to() calls:
dml_device = torch_directml.device() # object
pipe = pipe.to(dml_device) # works
# Also: Generator must use "cpu" device — DirectML doesn't support Generator on device
generator = torch.Generator(device="cpu")Three lines changed in _detect_device() and _load_pipeline(). The function now returns a 3-tuple (name_str, dtype, device_obj) so callers always have the object available without needing to re-instantiate it.
- Never pass
str(dml_device)to.to()— only the object works for pipelines - Use
variant="fp16"only on CUDA/ROCm — DirectML requires float32 throughout torch.Generatormust be on"cpu"— DirectML doesn't support device-side generators- Verify with
pipe.unet.device— should showprivateuseone:0after loading - Benchmark != production — simple tensor ops work with the string; full pipelines don't
After loading a pipeline, immediately check:
print(pipe.unet.device) # "cpu" = broken, "privateuseone:0" = workingAdd this check to any startup health monitor so silent CPU fallback is caught immediately.
Context: Real-world benchmarking of DirectML on AMD RX 5700 XT. Discovered critical Windows-only limitations and compatibility issues.
Hardware: AMD Radeon RX 5700 XT, Windows 11, Python 3.11.9.
1. Python 3.14 breaks torch-directml — must use Python 3.11
torch-directml is compiled against Python 3.11 ABI and does not support 3.12+. On Windows, you must maintain a separate Python 3.11 venv for any DirectML work. Always check py -3.11 --version first. This is a hard ceiling, not a soft requirement.
Workaround: Create .venv311 and keep it isolated. If your main project uses 3.12+, accept that DirectML is unavailable and use WSL2 + ROCm instead.
2. torch-directml 0.2.5 requires torch 2.4.1 — install DirectML first, not torch
Critical ordering mistake found during testing: Installing torch 2.3.0 first, then torch-directml, causes version conflicts. DirectML depends on torch 2.4.1, and pip's dependency resolver doesn't handle this gracefully.
Correct approach: Install torch-directml directly without pre-installing torch. Let it pull torch 2.4.1 automatically.
# WRONG (causes version hell)
pip install torch==2.3.0 --index-url https://download.pytorch.org/whl/cpu
pip install torch-directml # Pulls torch 2.4.1, conflicts with 2.3.0
# RIGHT (clean resolution)
pip install torch-directml # Pulls torch 2.4.1 automatically3. DirectML device shows as privateuseone:0, not dml:0
This was confusing at first because the diagnostics CLI has special-case logic to detect DirectML. When you call str(device), PyTorch shows privateuseone:0 for DirectML. This is expected and normal — it's PyTorch's way of representing custom backend devices. The GPU acceleration is working correctly.
For logging and debugging, use torch_directml.device_name(0) instead to get the human-readable GPU name.
4. CTranslate2 (faster-whisper backend) does NOT support DirectML — Whisper stays on CPU
This is a hard architectural limitation. CTranslate2, the C++ library that powers faster-whisper, only supports CUDA, ROCm, and CPU backends. It has zero DirectML support.
Consequence: On Windows with DirectML, Whisper inference runs entirely on CPU, completely negating GPU acceleration for speech-to-text workloads. This is a showstopper for audio pipeline projects on Windows DirectML.
Only solution: Use WSL2 + ROCm for GPU-accelerated Whisper on Windows AMD. There is no Windows DirectML path for Whisper.
5. Ollama on Windows doesn't respect OLLAMA_MODELS env var if set after the parent process starts
Tested with Ollama desktop app. Even if you set OLLAMA_MODELS=/d/ollama_models before launching, if the parent process (the Ollama GUI or service) started before the env var was set, Ollama will still use the default C:\Users\User.ollama location.
Workaround: Use directory junctions instead of env vars. This forces Ollama to use the new location regardless of when env vars were set:
mklink /J C:\Users\User\.ollama D:\ollama_modelsThis works because the filesystem junction is real to the OS, not dependent on environment variables.
Date: 2026-03-23
Context: Extracting torch-amd-setup from a private AI audio pipeline project.
Hardware: AMD Radeon RX 5700 XT (gfx1010 / Navi 10), Windows 11, WSL2 Ubuntu 22.04.
This document is a raw account of every mistake made, every dependency wall hit, and every workaround discovered while getting AMD GPU acceleration working with PyTorch and Seamless M4T. Written so you don't have to spend the same time.
The single biggest source of confusion: the AMD RX 5700 XT is a capable GPU, it's supported by the AMD Adrenalin driver, and it works fine for gaming. But ROCm (AMD's GPU compute stack) has an explicit list of officially supported GPU architectures, and gfx1010 is not on it.
When you install the ROCm version of PyTorch and run torch.cuda.is_available(), it returns False. No error, no explanation — just False. This led to hours of assuming the ROCm install was broken, when the actual issue was a single missing environment variable:
export HSA_OVERRIDE_GFX_VERSION=10.3.0This has to be set before Python imports torch. Setting it after import torch does nothing. The reason: ROCm checks the GPU architecture at init time and caches the result. If the env var isn't present at that moment, the GPU is invisible for the rest of the process.
Lesson: If torch.cuda.is_available() returns False on ROCm, check the env var before anything else. Don't re-install ROCm.
Ubuntu 22.04's default apt repos include rocminfo 5.0.0-1. This package exists to provide stub implementations of ROCm tools. When you add AMD's official ROCm 6.1 repository and try to install rocm-hip-sdk, apt sees the conflict and fails:
rocm-hip-runtime: Depends: rocminfo (= 1.0.0.60100-82~22.04)
but 5.0.0-1 is to be installed
The version numbers look backwards (5.0.0 > 1.0.0) but they're not comparable — AMD's rocminfo uses a different versioning scheme entirely. 5.0.0-1 is Ubuntu's stub; 1.0.0.60100 is AMD's real package at ROCm 6.1.
Fix: Remove Ubuntu's ROCm stubs before installing from AMD's repo, then pin the AMD repo to priority 1001 so it always wins in future apt operations. See Troubleshooting.
Lesson: Always purge Ubuntu's ROCm stubs before adding AMD's ROCm repo. Add the apt pin immediately.
When writing the automated setup script (wsl2_rocm_setup.sh), the script was configured with set -euo pipefail for safety. However, certain commands that pipe through grep -v would cause the entire script to silently exit with no error message.
The cause: when apt-get -qq runs with nothing to output (the package is already installed, or there are no packages matching), the grep -v that follows gets empty input and returns exit code 1 — "no lines matched the invert pattern." With set -e, exit code 1 from any command is fatal. The script dies silently at the first line that runs | grep -v anything on empty input.
The debug session was confusing because there was no error — just a prompt returning after printing one progress message.
Fix: || true after any grep in a pipeline where empty output is possible. Also drop -u from set -euo pipefail if you have variables that might be unset legitimately.
Lesson: When a set -e script exits silently, check every pipe for commands that could return non-zero on "no results" — grep, awk, wc -l comparisons, etc.
Installing packages with PyPI dependencies that pin specific PyTorch versions will overwrite your ROCm build. This happened twice:
fairseq2==0.3.0pinstorch==2.5.1. pip fetched that version from PyPI, which is the standard CUDA build. ROCm build gone.- After reinstalling ROCm torch 2.6.0, torchaudio 2.2.2 was installed separately, causing a version mismatch (
libcudart.so.13error from the torchaudio build expecting torch 2.2.x).
Each iteration added 10–20 minutes of reinstall time and debugging.
Lesson: Install torch last, always. Use --no-deps for packages that try to pull their own torch. After any package install, verify torch.version.hip is still set. Consider using pip constraints or a lock file.
seamless_communication 1.0.0 requires fairseq2==0.2.*. fairseq2 0.2.1 requires fairseq2n==0.2.1 (a C extension binary). The fairseq2n package on PyPI ships a CUDA-linked binary — it needs libcudart.so.12 to import.
Meta provides a CPU build server: https://fair-src-fairseq2-build-publish.s3.amazonaws.com/whl/cpu/index.html — but it only has builds for fairseq2n 0.3.x, not 0.2.1. So the official CPU binary for the version required by seamless_communication simply does not exist.
The solution that worked: install nvidia-cuda-runtime-cu12 via pip, which provides libcudart.so.12 inside the venv's site-packages, then set LD_LIBRARY_PATH to point at it. This lets the CUDA-linked fairseq2n.so load correctly even on a machine with no NVIDIA GPU.
Lesson: When a package claims to need CUDA but you don't have CUDA, try installing the CUDA runtime stub wheel first before assuming you need to rebuild from source.
torch-directml is Microsoft's DirectML backend for PyTorch. It provides AMD (and any DirectX 12) GPU acceleration on Windows without needing ROCm. It's genuinely useful and easy to install — but it has a hard Python version ceiling of 3.11.
This is a significant limitation because many projects now target Python 3.12+. The workaround is to maintain a separate venv311 environment specifically for DirectML workloads. This is awkward but workable.
The underlying reason is that torch-directml contains compiled C extensions that were built against Python 3.11's ABI. Microsoft hasn't released 3.12 wheels as of the time of writing.
Lesson: Plan for a separate Python 3.11 venv on Windows if DirectML is on your path. Build your code to be venv-agnostic so switching is easy.
fairseq2 0.2.1 was compiled against NumPy 1.x. NumPy 2.0 introduced breaking C extension ABI changes. If pip installs NumPy 2.x (which it does by default now), importing fairseq2 crashes:
A module that was compiled using NumPy 1.x cannot be run in NumPy 2.2.6
_ARRAY_API not found
Fix: pip install "numpy~=1.23" --force-reinstall.
Lesson: Any package with compiled C extensions and a numpy~=1.x pin is going to break if pip installs numpy 2.x before it. Add an explicit numpy pin to your requirements file before installing such packages.
Even with a recent AMD driver, /dev/kfd (the AMD GPU compute device node) may not appear in WSL2 if:
- The AMD Adrenalin driver version is below 22.20
- The Windows version is below 10 21H2
In our case, /dev/kfd was missing because the driver hadn't been verified yet. This caused rocminfo inside WSL2 to report no agents, even though ROCm was installed correctly.
Lesson: Verify /dev/kfd exists before troubleshooting anything else. If it doesn't exist, the fix is a driver update in Windows — nothing inside WSL2 will help.
ROCm PyTorch exposes AMD GPUs through a CUDA compatibility layer. From the Python API perspective, torch.cuda.is_available() returns True, torch.cuda.get_device_name(0) returns the AMD card name, and model.to("cuda:0") puts the model on the AMD GPU.
This is intentional and by design. The practical consequence: code written for NVIDIA CUDA often works on AMD ROCm with zero changes. The catch is that some CUDA-specific operations (torch.cuda.amp, certain custom CUDA kernels) may not be supported.
Lesson: Don't create separate "CUDA" and "ROCm" code paths. Use get_torch_device() which returns torch.device("cuda:0") for both — the ROCm PyTorch build handles the rest.
The SeamlessM4T v2 large model is ~8.5GB for the main checkpoint plus ~160MB for the vocoder. On first run, it downloads to ~/.cache/huggingface/hub/. This is inside the WSL2 virtual disk, which lives on the Windows C: drive.
On a machine with limited C: drive space, this is immediately a problem. The WSL2 virtual disk is also not easily inspectable from Windows Explorer, so users may not realize a 9GB file just appeared.
Lesson: Warn users about the model download size before first run. Consider setting HF_HOME to redirect the cache to a larger drive. On a machine with an external drive, this is essential.
Getting AMD ROCm + fairseq2 + seamless_communication working requires touching at least 8 separate failure points that are not documented together anywhere:
- Remove Ubuntu's conflicting ROCm stubs
- Pin the ROCm apt repo to priority 1001
- Set
HSA_OVERRIDE_GFX_VERSION=10.3.0before importing torch - Add user to
renderandvideogroups - Install PyTorch from the ROCm-specific index URL
- Install
nvidia-cuda-runtime-cu12for the CUDA stub - Set
LD_LIBRARY_PATHto the stub's lib directory - Pin numpy to
~=1.23before installing fairseq2
None of these steps are individually complex. But they're scattered across AMD documentation, Meta's fairseq2 GitHub issues, Ubuntu Launchpad bug reports, and Stack Overflow threads. The goal of torch-amd-setup is to encode as much of this as possible so future projects don't start from scratch.