[Issue]: Find-db cache ineffective across processes, invokers not rebuilt from cached kernels

### Problem Description

When using `cudnn.benchmark=True` (via PyTorch on ROCm), MIOpen's find-db cache does not prevent re-benchmarking across process restarts. Each new process takes ~15-16 seconds per convolution layer despite:
- Find-db entries existing in `~/.config/miopen/*.ufdb.txt`
- Kernel binaries being cached in `~/.cache/miopen/*.ukdb`

This results in ~280-300 second startup times for typical CNNs on every process launch.

**Timing Evidence:**

Within single process (cache works):
```
First forward:  15.96s
Second forward: 0.0023s
Third forward:  0.0022s
```

Across processes (cache fails):
```
Process 1: 15.74s
Process 2: 15.54s  <- Should reuse cache, but doesn't
Process 3: 15.61s
```

**Root Cause (traced through source):**

The `TryLoad()` function in `src/include/miopen/find_db.hpp:165-172` calls `Validate()` which checks if kernel invokers exist via `handle.GetInvoker()`. However, `InvokerCache` (in `src/include/miopen/invoker_cache.hpp`) is an in-memory `std::map<>` with no disk persistence.

When a new process starts:
1. Find-db loads successfully (`in_sync = true`)
2. `Validate()` checks for invokers - none exist (they were in-memory only)
3. `Validate()` returns TRUE ("rebuild needed")
4. Condition `in_sync && !TRUE` = FALSE, falls through to full re-benchmark

The kernel binaries ARE cached in `.ukdb`, and code to rebuild invokers without benchmarking exists in `Solution::Run()` (`src/solution.cpp:259-266`), but `TryLoad()` doesn't use this path.

**Workaround:**
```python
torch.backends.cudnn.benchmark = False  # 280s -> 0.05s startup, ~3% throughput cost
```

### Operating System

Ubuntu 25.10 (Questing Quokka)

### CPU

AND Ryzen 9 7940HS w/ Radeon 780M Graphics

### GPU

AMD Radeon RX 7700S (gfx1102, RDNA3, 32 CUs)

### ROCm Version

ROCm 7.1.0

### ROCm Component

MIOpen

### Steps to Reproduce

```python
# reproducer.py - Run twice in separate Python processes
import torch
import torch.nn as nn
import time

torch.backends.cudnn.benchmark = True

model = nn.Conv2d(64, 128, 3, padding=1).cuda()
x = torch.randn(32, 64, 64, 64, device='cuda')

torch.cuda.synchronize()
start = time.perf_counter()
y = model(x)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start

print(f"First conv forward: {elapsed:.2f}s")

# Run 1: python reproducer.py  -> ~16s (expected, benchmarking)
# Run 2: python reproducer.py  -> ~16s (BUG: should use cached results)
```

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

```
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.18
Runtime Ext Version:     1.14
System Timestamp Freq.:  1000.000000MHz
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
XNACK enabled:           NO
DMAbuf Support:          YES
VMM Support:             YES

==========
HSA Agents
==========
Agent 1: AMD Ryzen 9 7940HS w/ Radeon 780M Graphics (CPU)
Agent 2: gfx1102 / AMD Radeon RX 7700S (dGPU, 16 CUs)
Agent 3: gfx1103 / AMD Radeon Graphics (iGPU)
```

(Truncated for brevity - full output available on request)

### Additional Information

**Other version numbers:**
- MIOpen: 3.5.1.70100-20~24.04 (miopen-hip package)
- PyTorch: 2.11.0.dev20251220+rocm7.1

**Detailed root cause analysis:**

1. `TryLoad()` at `src/include/miopen/find_db.hpp:165`:
```cpp
if(record.in_sync && !record.Validate(handle, network_config))
{
    // Only reaches here if invokers exist in memory
    auto solutions = std::vector<Solution>{};
    record.CopyTo(solutions);
    return solutions;
}
MIOPEN_LOG_I("Find-db regenerating.");  // Falls through to benchmark
```

2. `Validate()` at `src/find_db.cpp:244`:
```cpp
if(!handle.GetInvoker(config, {{pair.first}}))
{
    unbuilt = true;  // No invoker in cache
    break;
}
```

3. `InvokerCache` at `src/include/miopen/invoker_cache.hpp:70`:
```cpp
std::map<std::string, Item> invokers;  // In-memory only, no Save/Load
```

**Potential fix direction:**

When `in_sync` is true but `Validate()` fails (invokers missing), use the existing `FindSolution()` + `PrepareInvoker()` path (as seen in `Solution::Run()`) to rebuild invokers from cached `.ukdb` binaries without re-running the benchmark.

**Questions:**
1. Is this behavior intentional?
2. Would a PR implementing the fix be welcome?

**Cache file evidence (local):**
```
~/.config/miopen/gfx1102_16.HIP.3_5_1_*.ufdb.txt  (find-db, ~900KB)
~/.cache/miopen/3.5.1.*/gfx1102_16.ukdb           (kernel cache, ~800KB)
```

Both files exist and have valid content, but are not effectively used across processes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: Find-db cache ineffective across processes, invokers not rebuilt from cached kernels #3553

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: Find-db cache ineffective across processes, invokers not rebuilt from cached kernels #3553

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions