Skip to content

AMD MIOpen

Vladimir Mandic edited this page Feb 18, 2026 · 1 revision

Making Your SD.Next Faster with MIOpen Tuning

If you're running SD.Next on an AMD GPU, this guide will help you unlock speed improvements that make your image generation noticeably faster. It also explains what MIOpen is and how it works under the hood, so the tuning process makes sense rather than feeling like magic.


What is MIOpen?

MIOpen is AMD's open-source library for deep learning primitives — AMD's equivalent of NVIDIA's cuDNN. It provides optimized implementations of the operations that neural networks rely on: convolutions, pooling, normalization, activations, and more.

When SD.Next generates an image, it doesn't talk to your GPU directly. The request travels through several layers:

┌─────────────────────────────────┐
│   SD.Next                       │  ← User Interface
└────────────┬────────────────────┘
             │
┌────────────▼────────────────────┐
│   PyTorch (with ROCm backend)   │  ← Deep Learning Framework
└────────────┬────────────────────┘
             │
┌────────────▼────────────────────┐
│   MIOpen Library                │  ← Optimized GPU Kernels
└────────────┬────────────────────┘
             │
┌────────────▼────────────────────┐
│   AMD GPU (7900 XTX, 9070 XT…)  │  ← Hardware
└─────────────────────────────────┘

MIOpen sits between PyTorch and your GPU. Every time a convolution runs in the U-Net or VAE, MIOpen is the one deciding how to execute it.


How MIOpen Selects an Algorithm

For any given operation, MIOpen has multiple solvers (algorithm implementations) that can produce the same result, but at very different speeds. For example, a single 3×3 convolution might be handled by:

  • Winograd — fast for small kernels, low memory usage
  • Implicit GEMM — very fast on modern AMD GPUs using matrix cores
  • Direct GEMM — general purpose, moderate speed
  • Naive Direct — always works, extremely slow (100× slower in some cases)

The "right" choice depends on your specific GPU, the tensor dimensions, and the data type. MIOpen's job is to pick the fastest one.

The Solver Selection Flow

                 You generate an image
                          │
                          ▼
         ┌────────────────────────────────┐
         │  MIOpen receives GPU operation  │
         │  (e.g. U-Net convolution)       │
         └────────────────┬───────────────┘
                          │
                          ▼
         ┌────────────────────────────────┐
         │    Check local tuning cache    │
         │    (~/.miopen/ or AppData)     │
         └──────────┬─────────────────────┘
                    │
         ┌──────────┴──────────┐
         │                     │
     FOUND ✓               NOT FOUND ✗
   (warm run)              (cold run)
         │                     │
         ▼                     ▼
┌─────────────────┐   ┌────────────────────────────┐
│  Load optimal   │   │  Benchmark all solvers:    │
│  algorithm from │   │                            │
│  cache instantly│   │  Winograd:      0.12 ms ✓  │
└────────┬────────┘   │  Implicit GEMM: 0.15 ms    │
         │            │  Direct GEMM:   2.15 ms    │
         │            │  Naive Direct: 156.30 ms   │
         │            │                            │
         │            │  → Save winner to cache    │
         │            └────────────────┬───────────┘
         │                             │
         └─────────────┬───────────────┘
                       │
                       ▼
         ┌─────────────────────────────┐
         │  Execute optimal GPU kernel │
         └─────────────────────────────┘

This is why the first run is slow. MIOpen has to benchmark every solver for every unique convolution it encounters — and a single image generation touches hundreds of different convolution configurations across the text encoder, U-Net, and VAE. Each benchmark takes a fraction of a second, but they add up.

Once the cache is built, every subsequent run skips benchmarking entirely and jumps straight to the winner.


What to Expect: Before and After

First Run (Cold Start — Slow)

  • Your first generation will be much slower than usual — potentially several minutes
  • The GPU will be active but no image progress is visible yet
  • SD.Next may appear "stuck"
  • Console messages like Find Start, Evaluating Solver, or compiling kernels are normal

This only happens once per unique operation shape. A "shape" is defined by the combination of:

  • The model (SD 1.5, SDXL, Flux, etc.)
  • The image resolution (512×512 and 768×768 each need their own tuning)
  • The batch size

After First Run (Warm Cache — Fast)

  • Image generation is noticeably faster
  • Performance is consistent
  • The cache is reused automatically on every subsequent launch — you don't need to redo it

How to Enable Tuning

Set these two environment variables before launching SD.Next:

MIOPEN_FIND_MODE

Controls how MIOpen searches for the best solver:

Value Mode Behavior
2 Fast Uses heuristics — guesses a good solver without testing. Fast startup, suboptimal performance.
3 Hybrid Checks the cache first; if not found, benchmarks all solvers and saves the winner. Recommended.

MIOPEN_FIND_ENFORCE

Controls when MIOpen is allowed to update the cache:

Value Behavior
1 Never update the cache (read-only)
3 Update the cache when new results are found. Recommended for tuning.

Commands

Linux — temporary (current session only):

export MIOPEN_FIND_MODE=3
export MIOPEN_FIND_ENFORCE=3

Linux — permanent (add to ~/.bashrc or ~/.profile):

echo 'export MIOPEN_FIND_MODE=3' >> ~/.bashrc
echo 'export MIOPEN_FIND_ENFORCE=3' >> ~/.bashrc
source ~/.bashrc

Windows — Command Prompt (temporary):

set MIOPEN_FIND_MODE=3
set MIOPEN_FIND_ENFORCE=3

Windows — PowerShell (temporary):

$env:MIOPEN_FIND_MODE = "3"
$env:MIOPEN_FIND_ENFORCE = "3"

Windows — Permanent (System Environment Variables):

  1. Open Start → search "Edit the system environment variables"
  2. Click Environment Variables
  3. Under User variables, click New
  4. Add MIOPEN_FIND_MODE = 3, then repeat for MIOPEN_FIND_ENFORCE = 3

Windows — Alternatively, edit webui.bat and add the set lines before the launch command:

set MIOPEN_FIND_MODE=3
set MIOPEN_FIND_ENFORCE=3
.\python_embeded\python.exe -m streamlit run ...

Then launch SD.Next as normal. On Windows with ROCm:

.\webui.bat --use-rocm

The Tuning Cache

MIOpen stores its results in a local database so tuning survives across restarts.

Default cache location:

  • Linux: ~/.config/miopen/
  • Windows: %USERPROFILE%\.config\miopen\

The cache contains two file types:

  • .udb (User Database) — stores the winning solver per operation. Used for fast lookup on every run.
  • .ufdb (User Find Database) — stores full benchmark results for all tested solvers. Used for analysis.

You can override the cache location with:

export MIOPEN_USER_DB_PATH=/path/to/your/cache

This is useful if you want to share a pre-built cache across multiple installs, or keep it on a fast drive.

To reset the cache (forces re-tuning from scratch), delete the .udb and .ufdb files in the cache directory.


When Does Re-Tuning Happen?

MIOpen caches results per unique operation shape. You'll see a slow first run again when:

Trigger Why
New model (SD 1.5 → SDXL) Different layer sizes = new convolution shapes
Different resolution Tensor dimensions change
Different batch size Affects solver selection
Cache was deleted Fresh start
Model updated/modified Some shapes may differ

Switching between two resolutions you've already tuned (e.g. 512 and 1024) will not re-tune — results for both are already cached.


Mode Comparison

Find Mode 2 (Fast) Find Mode 3 + Enforce 3
First run speed Normal Slow (benchmarking)
Subsequent speed Suboptimal Optimal
Cache written No Yes
Best for Debugging, stability issues Production use

Use Mode 2 if: you're seeing crashes, driver timeouts, or unusually long hangs during the first run. Mode 2 never benchmarks, so it can work around instability.

Use Mode 3 if: you want the best possible performance and can tolerate a slow first run per model/resolution.


Tips

  • Be patient on first run. With larger models (SDXL, Flux), the cold start can take 5–20 minutes. As long as GPU activity is visible in Task Manager or rocm-smi, it's working.
  • Tune at the resolution you actually use. If you mostly generate at 1024×1024, run your first image at that size. A cache built at 512×512 won't cover the larger shapes.
  • Back up your cache. Once built, the cache is reusable across SD.Next updates. Copy the miopen/ directory somewhere safe.
  • Enable logging to watch progress. Add MIOPEN_LOG_LEVEL=4 temporarily to see solver benchmarking in the console. Remove it afterward to reduce noise.

Troubleshooting

"It's been 10+ minutes and nothing's happening!" This is normal for the first run with large models. Check that your GPU is active (rocm-smi on Linux, Task Manager on Windows). As long as GPU load or memory usage is nonzero, MIOpen is working.

"Second run is still slow" The cache may not have been written. Verify MIOPEN_FIND_ENFORCE=3 is set (not just MIOPEN_FIND_MODE). Check that the cache directory exists and is writable.

"Performance got worse after a driver update" Driver updates can invalidate cached kernels. Delete the .udb/.ufdb files to force a fresh tuning session.

"It crashes during the first run" Switch to MIOPEN_FIND_MODE=2 to skip benchmarking entirely. This avoids the long solver search that can trigger driver timeouts on some systems.

"I switched resolutions and it's slow again" Expected — each resolution has its own set of convolution shapes that need tuning. Run one generation at the new resolution to build the cache for it.

Clone this wiki locally