-
Notifications
You must be signed in to change notification settings - Fork 184
Description
Problem Description
When using cudnn.benchmark=True (via PyTorch on ROCm), MIOpen's find-db cache does not prevent re-benchmarking across process restarts. Each new process takes ~15-16 seconds per convolution layer despite:
- Find-db entries existing in
~/.config/miopen/*.ufdb.txt - Kernel binaries being cached in
~/.cache/miopen/*.ukdb
This results in ~280-300 second startup times for typical CNNs on every process launch.
Timing Evidence:
Within single process (cache works):
First forward: 15.96s
Second forward: 0.0023s
Third forward: 0.0022s
Across processes (cache fails):
Process 1: 15.74s
Process 2: 15.54s <- Should reuse cache, but doesn't
Process 3: 15.61s
Root Cause (traced through source):
The TryLoad() function in src/include/miopen/find_db.hpp:165-172 calls Validate() which checks if kernel invokers exist via handle.GetInvoker(). However, InvokerCache (in src/include/miopen/invoker_cache.hpp) is an in-memory std::map<> with no disk persistence.
When a new process starts:
- Find-db loads successfully (
in_sync = true) Validate()checks for invokers - none exist (they were in-memory only)Validate()returns TRUE ("rebuild needed")- Condition
in_sync && !TRUE= FALSE, falls through to full re-benchmark
The kernel binaries ARE cached in .ukdb, and code to rebuild invokers without benchmarking exists in Solution::Run() (src/solution.cpp:259-266), but TryLoad() doesn't use this path.
Workaround:
torch.backends.cudnn.benchmark = False # 280s -> 0.05s startup, ~3% throughput costOperating System
Ubuntu 25.10 (Questing Quokka)
CPU
AND Ryzen 9 7940HS w/ Radeon 780M Graphics
GPU
AMD Radeon RX 7700S (gfx1102, RDNA3, 32 CUs)
ROCm Version
ROCm 7.1.0
ROCm Component
MIOpen
Steps to Reproduce
# reproducer.py - Run twice in separate Python processes
import torch
import torch.nn as nn
import time
torch.backends.cudnn.benchmark = True
model = nn.Conv2d(64, 128, 3, padding=1).cuda()
x = torch.randn(32, 64, 64, 64, device='cuda')
torch.cuda.synchronize()
start = time.perf_counter()
y = model(x)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
print(f"First conv forward: {elapsed:.2f}s")
# Run 1: python reproducer.py -> ~16s (expected, benchmarking)
# Run 2: python reproducer.py -> ~16s (BUG: should use cached results)(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.18
Runtime Ext Version: 1.14
System Timestamp Freq.: 1000.000000MHz
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
XNACK enabled: NO
DMAbuf Support: YES
VMM Support: YES
==========
HSA Agents
==========
Agent 1: AMD Ryzen 9 7940HS w/ Radeon 780M Graphics (CPU)
Agent 2: gfx1102 / AMD Radeon RX 7700S (dGPU, 16 CUs)
Agent 3: gfx1103 / AMD Radeon Graphics (iGPU)
(Truncated for brevity - full output available on request)
Additional Information
Other version numbers:
- MIOpen: 3.5.1.70100-20~24.04 (miopen-hip package)
- PyTorch: 2.11.0.dev20251220+rocm7.1
Detailed root cause analysis:
TryLoad()atsrc/include/miopen/find_db.hpp:165:
if(record.in_sync && !record.Validate(handle, network_config))
{
// Only reaches here if invokers exist in memory
auto solutions = std::vector<Solution>{};
record.CopyTo(solutions);
return solutions;
}
MIOPEN_LOG_I("Find-db regenerating."); // Falls through to benchmarkValidate()atsrc/find_db.cpp:244:
if(!handle.GetInvoker(config, {{pair.first}}))
{
unbuilt = true; // No invoker in cache
break;
}InvokerCacheatsrc/include/miopen/invoker_cache.hpp:70:
std::map<std::string, Item> invokers; // In-memory only, no Save/LoadPotential fix direction:
When in_sync is true but Validate() fails (invokers missing), use the existing FindSolution() + PrepareInvoker() path (as seen in Solution::Run()) to rebuild invokers from cached .ukdb binaries without re-running the benchmark.
Questions:
- Is this behavior intentional?
- Would a PR implementing the fix be welcome?
Cache file evidence (local):
~/.config/miopen/gfx1102_16.HIP.3_5_1_*.ufdb.txt (find-db, ~900KB)
~/.cache/miopen/3.5.1.*/gfx1102_16.ukdb (kernel cache, ~800KB)
Both files exist and have valid content, but are not effectively used across processes.