-
-
Notifications
You must be signed in to change notification settings - Fork 551
AMD MIOpen
If you're running SD.Next on an AMD GPU, this guide will help you unlock speed improvements that make your image generation noticeably faster. It also explains what MIOpen is and how it works under the hood, so the tuning process makes sense rather than feeling like magic.
MIOpen is AMD's open-source library for deep learning primitives — AMD's equivalent of NVIDIA's cuDNN. It provides optimized implementations of the operations that neural networks rely on: convolutions, pooling, normalization, activations, and more.
When SD.Next generates an image, it doesn't talk to your GPU directly. The request travels through several layers:
┌─────────────────────────────────┐
│ SD.Next │ ← User Interface
└────────────┬────────────────────┘
│
┌────────────▼────────────────────┐
│ PyTorch (with ROCm backend) │ ← Deep Learning Framework
└────────────┬────────────────────┘
│
┌────────────▼────────────────────┐
│ MIOpen Library │ ← Optimized GPU Kernels
└────────────┬────────────────────┘
│
┌────────────▼────────────────────┐
│ AMD GPU (7900 XTX, 9070 XT…) │ ← Hardware
└─────────────────────────────────┘
MIOpen sits between PyTorch and your GPU. Every time a convolution runs in the U-Net or VAE, MIOpen is the one deciding how to execute it.
For any given operation, MIOpen has multiple solvers (algorithm implementations) that can produce the same result, but at very different speeds. For example, a single 3×3 convolution might be handled by:
- Winograd — fast for small kernels, low memory usage
- Implicit GEMM — very fast on modern AMD GPUs using matrix cores
- Direct GEMM — general purpose, moderate speed
- Naive Direct — always works, extremely slow (100× slower in some cases)
The "right" choice depends on your specific GPU, the tensor dimensions, and the data type. MIOpen's job is to pick the fastest one.
You generate an image
│
▼
┌────────────────────────────────┐
│ MIOpen receives GPU operation │
│ (e.g. U-Net convolution) │
└────────────────┬───────────────┘
│
▼
┌────────────────────────────────┐
│ Check local tuning cache │
│ (~/.miopen/ or AppData) │
└──────────┬─────────────────────┘
│
┌──────────┴──────────┐
│ │
FOUND ✓ NOT FOUND ✗
(warm run) (cold run)
│ │
▼ ▼
┌─────────────────┐ ┌────────────────────────────┐
│ Load optimal │ │ Benchmark all solvers: │
│ algorithm from │ │ │
│ cache instantly│ │ Winograd: 0.12 ms ✓ │
└────────┬────────┘ │ Implicit GEMM: 0.15 ms │
│ │ Direct GEMM: 2.15 ms │
│ │ Naive Direct: 156.30 ms │
│ │ │
│ │ → Save winner to cache │
│ └────────────────┬───────────┘
│ │
└─────────────┬───────────────┘
│
▼
┌─────────────────────────────┐
│ Execute optimal GPU kernel │
└─────────────────────────────┘
This is why the first run is slow. MIOpen has to benchmark every solver for every unique convolution it encounters — and a single image generation touches hundreds of different convolution configurations across the text encoder, U-Net, and VAE. Each benchmark takes a fraction of a second, but they add up.
Once the cache is built, every subsequent run skips benchmarking entirely and jumps straight to the winner.
- Your first generation will be much slower than usual — potentially several minutes
- The GPU will be active but no image progress is visible yet
- SD.Next may appear "stuck"
- Console messages like
Find Start,Evaluating Solver, orcompiling kernelsare normal
This only happens once per unique operation shape. A "shape" is defined by the combination of:
- The model (SD 1.5, SDXL, Flux, etc.)
- The image resolution (512×512 and 768×768 each need their own tuning)
- The batch size
- Image generation is noticeably faster
- Performance is consistent
- The cache is reused automatically on every subsequent launch — you don't need to redo it
Set these two environment variables before launching SD.Next:
Controls how MIOpen searches for the best solver:
| Value | Mode | Behavior |
|---|---|---|
2 |
Fast | Uses heuristics — guesses a good solver without testing. Fast startup, suboptimal performance. |
3 |
Hybrid | Checks the cache first; if not found, benchmarks all solvers and saves the winner. Recommended. |
Controls when MIOpen is allowed to update the cache:
| Value | Behavior |
|---|---|
1 |
Never update the cache (read-only) |
3 |
Update the cache when new results are found. Recommended for tuning. |
Linux — temporary (current session only):
export MIOPEN_FIND_MODE=3
export MIOPEN_FIND_ENFORCE=3Linux — permanent (add to ~/.bashrc or ~/.profile):
echo 'export MIOPEN_FIND_MODE=3' >> ~/.bashrc
echo 'export MIOPEN_FIND_ENFORCE=3' >> ~/.bashrc
source ~/.bashrcWindows — Command Prompt (temporary):
set MIOPEN_FIND_MODE=3
set MIOPEN_FIND_ENFORCE=3Windows — PowerShell (temporary):
$env:MIOPEN_FIND_MODE = "3"
$env:MIOPEN_FIND_ENFORCE = "3"Windows — Permanent (System Environment Variables):
- Open Start → search "Edit the system environment variables"
- Click Environment Variables
- Under User variables, click New
- Add
MIOPEN_FIND_MODE=3, then repeat forMIOPEN_FIND_ENFORCE=3
Windows — Alternatively, edit webui.bat and add the set lines before the launch command:
set MIOPEN_FIND_MODE=3
set MIOPEN_FIND_ENFORCE=3
.\python_embeded\python.exe -m streamlit run ...Then launch SD.Next as normal. On Windows with ROCm:
.\webui.bat --use-rocmMIOpen stores its results in a local database so tuning survives across restarts.
Default cache location:
-
Linux:
~/.config/miopen/ -
Windows:
%USERPROFILE%\.config\miopen\
The cache contains two file types:
-
.udb(User Database) — stores the winning solver per operation. Used for fast lookup on every run. -
.ufdb(User Find Database) — stores full benchmark results for all tested solvers. Used for analysis.
You can override the cache location with:
export MIOPEN_USER_DB_PATH=/path/to/your/cacheThis is useful if you want to share a pre-built cache across multiple installs, or keep it on a fast drive.
To reset the cache (forces re-tuning from scratch), delete the .udb and .ufdb files in the cache directory.
MIOpen caches results per unique operation shape. You'll see a slow first run again when:
| Trigger | Why |
|---|---|
| New model (SD 1.5 → SDXL) | Different layer sizes = new convolution shapes |
| Different resolution | Tensor dimensions change |
| Different batch size | Affects solver selection |
| Cache was deleted | Fresh start |
| Model updated/modified | Some shapes may differ |
Switching between two resolutions you've already tuned (e.g. 512 and 1024) will not re-tune — results for both are already cached.
| Find Mode 2 (Fast) | Find Mode 3 + Enforce 3 | |
|---|---|---|
| First run speed | Normal | Slow (benchmarking) |
| Subsequent speed | Suboptimal | Optimal |
| Cache written | No | Yes |
| Best for | Debugging, stability issues | Production use |
Use Mode 2 if: you're seeing crashes, driver timeouts, or unusually long hangs during the first run. Mode 2 never benchmarks, so it can work around instability.
Use Mode 3 if: you want the best possible performance and can tolerate a slow first run per model/resolution.
-
Be patient on first run. With larger models (SDXL, Flux), the cold start can take 5–20 minutes. As long as GPU activity is visible in Task Manager or
rocm-smi, it's working. - Tune at the resolution you actually use. If you mostly generate at 1024×1024, run your first image at that size. A cache built at 512×512 won't cover the larger shapes.
-
Back up your cache. Once built, the cache is reusable across SD.Next updates. Copy the
miopen/directory somewhere safe. -
Enable logging to watch progress. Add
MIOPEN_LOG_LEVEL=4temporarily to see solver benchmarking in the console. Remove it afterward to reduce noise.
"It's been 10+ minutes and nothing's happening!"
This is normal for the first run with large models. Check that your GPU is active (rocm-smi on Linux, Task Manager on Windows). As long as GPU load or memory usage is nonzero, MIOpen is working.
"Second run is still slow"
The cache may not have been written. Verify MIOPEN_FIND_ENFORCE=3 is set (not just MIOPEN_FIND_MODE). Check that the cache directory exists and is writable.
"Performance got worse after a driver update"
Driver updates can invalidate cached kernels. Delete the .udb/.ufdb files to force a fresh tuning session.
"It crashes during the first run"
Switch to MIOPEN_FIND_MODE=2 to skip benchmarking entirely. This avoids the long solver search that can trigger driver timeouts on some systems.
"I switched resolutions and it's slow again" Expected — each resolution has its own set of convolution shapes that need tuning. Run one generation at the new resolution to build the cache for it.