Misc. bug: Vulkan test-opt test_gradient_accumulation failures on AMD

### Name and Version

version: 6240 (54a241f50)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

Other (Please specify in the next section)

### Command line

```shell
./bin/test-opt
```

### Problem description & steps to reproduce

Certain test_gradient_accumulation tests are failing on the Vulkan backend with my RX 470 and W8100 after the test got enabled in https://github.com/ggml-org/llama.cpp/pull/13873. test-backend-ops is passing fine and I get the same failures on both RADV and AMDVLK.

```
  test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=3, subtest=weights, optimizer=adamw): FAIL
  test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=4, subtest=weights, optimizer=adamw): FAIL
  test_gradient_accumulation(high_level=no, nbatch_physical=1, loss_type=sum, epoch=3, subtest=weights, optimizer=adamw): FAIL
  test_gradient_accumulation(high_level=no, nbatch_physical=1, loss_type=sum, epoch=4, subtest=weights, optimizer=adamw): FAIL
  test_gradient_accumulation(high_level=no, nbatch_physical=1, loss_type=sum, epoch=4, subtest=results, optimizer=adamw): FAIL
```

If I print out the actual values that are being compared in the test using the code below I see very small differences in the numbers which are causing the errors. This might be due to floating point errors (test-backend-ops has a small margin for this, test-opt currently expects an exact match).

```
------------------------------ tests/test-opt.cpp ------------------------------
index f02b4cad8..6d96e0e12 100644
@@ -690,6 +690,8 @@ static std::pair<int, int> test_gradient_accumulation(
             float weights;
             ggml_backend_tensor_get(cd.weights, &weights, 0, sizeof(float));
             const bool subtest_ok = weights == (ndata/2) - epoch;
+            if (!subtest_ok)
+                printf("weights %.30f %ld\n", weights, (ndata/2) - epoch);
             helper_after_test_gradient_accumulation(optim, __func__, nbatch_physical, loss_type, epoch, "weights", subtest_ok, ntest, npass);
         }
         {
@@ -701,6 +703,8 @@ static std::pair<int, int> test_gradient_accumulation(
             ggml_opt_result_loss(cd.result, &loss, /*loss_unc =*/ nullptr);
             if (loss_type == GGML_OPT_LOSS_TYPE_SUM) {
                 subtest_ok = subtest_ok && loss == (39.0 - epoch*6.0);
+                if (!subtest_ok)
+                    printf("results %.30f, %.30f\n", loss, 39.0 - epoch*6.0);
             } else if (loss_type == GGML_OPT_LOSS_TYPE_MEAN) {
                 subtest_ok = subtest_ok && almost_equal(loss, (39.0 - epoch*6.0) / ndata, 1e-6);
             } else {
```
```
weights 0.000000059604644775390625000000 0
  test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=3, subtest=weights, optimizer=adamw): FAIL
weights -0.999999880790710449218750000000 -1
  test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=4, subtest=weights, optimizer=adamw): FAIL
weights 0.000000059604644775390625000000 0
  test_gradient_accumulation(high_level=no, nbatch_physical=1, loss_type=sum, epoch=3, subtest=weights, optimizer=adamw): FAIL
weights -0.999999880790710449218750000000 -1
  test_gradient_accumulation(high_level=no, nbatch_physical=1, loss_type=sum, epoch=4, subtest=weights, optimizer=adamw): FAIL
results 15.000000059604644775390625000000, 15.000000000000000000000000000000
  test_gradient_accumulation(high_level=no, nbatch_physical=1, loss_type=sum, epoch=4, subtest=results, optimizer=adamw): FAIL
```

Meanwhile on Intel integrated graphics with Mesa the same tests pass in both fp16 and fp32 modes with absolutely no difference between the calculated and expected values.

### First Bad Commit

_No response_

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Vulkan test-opt test_gradient_accumulation failures on AMD #15491

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: Vulkan test-opt test_gradient_accumulation failures on AMD #15491

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions