Skip to content

[Issue]: Find-db cache ineffective across processes, invokers not rebuilt from cached kernels #3553

@FIM43-Redeye

Description

@FIM43-Redeye

Problem Description

When using cudnn.benchmark=True (via PyTorch on ROCm), MIOpen's find-db cache does not prevent re-benchmarking across process restarts. Each new process takes ~15-16 seconds per convolution layer despite:

  • Find-db entries existing in ~/.config/miopen/*.ufdb.txt
  • Kernel binaries being cached in ~/.cache/miopen/*.ukdb

This results in ~280-300 second startup times for typical CNNs on every process launch.

Timing Evidence:

Within single process (cache works):

First forward:  15.96s
Second forward: 0.0023s
Third forward:  0.0022s

Across processes (cache fails):

Process 1: 15.74s
Process 2: 15.54s  <- Should reuse cache, but doesn't
Process 3: 15.61s

Root Cause (traced through source):

The TryLoad() function in src/include/miopen/find_db.hpp:165-172 calls Validate() which checks if kernel invokers exist via handle.GetInvoker(). However, InvokerCache (in src/include/miopen/invoker_cache.hpp) is an in-memory std::map<> with no disk persistence.

When a new process starts:

  1. Find-db loads successfully (in_sync = true)
  2. Validate() checks for invokers - none exist (they were in-memory only)
  3. Validate() returns TRUE ("rebuild needed")
  4. Condition in_sync && !TRUE = FALSE, falls through to full re-benchmark

The kernel binaries ARE cached in .ukdb, and code to rebuild invokers without benchmarking exists in Solution::Run() (src/solution.cpp:259-266), but TryLoad() doesn't use this path.

Workaround:

torch.backends.cudnn.benchmark = False  # 280s -> 0.05s startup, ~3% throughput cost

Operating System

Ubuntu 25.10 (Questing Quokka)

CPU

AND Ryzen 9 7940HS w/ Radeon 780M Graphics

GPU

AMD Radeon RX 7700S (gfx1102, RDNA3, 32 CUs)

ROCm Version

ROCm 7.1.0

ROCm Component

MIOpen

Steps to Reproduce

# reproducer.py - Run twice in separate Python processes
import torch
import torch.nn as nn
import time

torch.backends.cudnn.benchmark = True

model = nn.Conv2d(64, 128, 3, padding=1).cuda()
x = torch.randn(32, 64, 64, 64, device='cuda')

torch.cuda.synchronize()
start = time.perf_counter()
y = model(x)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start

print(f"First conv forward: {elapsed:.2f}s")

# Run 1: python reproducer.py  -> ~16s (expected, benchmarking)
# Run 2: python reproducer.py  -> ~16s (BUG: should use cached results)

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.18
Runtime Ext Version:     1.14
System Timestamp Freq.:  1000.000000MHz
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
XNACK enabled:           NO
DMAbuf Support:          YES
VMM Support:             YES

==========
HSA Agents
==========
Agent 1: AMD Ryzen 9 7940HS w/ Radeon 780M Graphics (CPU)
Agent 2: gfx1102 / AMD Radeon RX 7700S (dGPU, 16 CUs)
Agent 3: gfx1103 / AMD Radeon Graphics (iGPU)

(Truncated for brevity - full output available on request)

Additional Information

Other version numbers:

  • MIOpen: 3.5.1.70100-20~24.04 (miopen-hip package)
  • PyTorch: 2.11.0.dev20251220+rocm7.1

Detailed root cause analysis:

  1. TryLoad() at src/include/miopen/find_db.hpp:165:
if(record.in_sync && !record.Validate(handle, network_config))
{
    // Only reaches here if invokers exist in memory
    auto solutions = std::vector<Solution>{};
    record.CopyTo(solutions);
    return solutions;
}
MIOPEN_LOG_I("Find-db regenerating.");  // Falls through to benchmark
  1. Validate() at src/find_db.cpp:244:
if(!handle.GetInvoker(config, {{pair.first}}))
{
    unbuilt = true;  // No invoker in cache
    break;
}
  1. InvokerCache at src/include/miopen/invoker_cache.hpp:70:
std::map<std::string, Item> invokers;  // In-memory only, no Save/Load

Potential fix direction:

When in_sync is true but Validate() fails (invokers missing), use the existing FindSolution() + PrepareInvoker() path (as seen in Solution::Run()) to rebuild invokers from cached .ukdb binaries without re-running the benchmark.

Questions:

  1. Is this behavior intentional?
  2. Would a PR implementing the fix be welcome?

Cache file evidence (local):

~/.config/miopen/gfx1102_16.HIP.3_5_1_*.ufdb.txt  (find-db, ~900KB)
~/.cache/miopen/3.5.1.*/gfx1102_16.ukdb           (kernel cache, ~800KB)

Both files exist and have valid content, but are not effectively used across processes.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions