Skip to content

Misc. bug: Vulkan test-opt test_gradient_accumulation failures on AMD #15491

@netrunnereve

Description

@netrunnereve

Name and Version

version: 6240 (54a241f)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

Other (Please specify in the next section)

Command line

./bin/test-opt

Problem description & steps to reproduce

Certain test_gradient_accumulation tests are failing on the Vulkan backend with my RX 470 and W8100 after the test got enabled in #13873. test-backend-ops is passing fine and I get the same failures on both RADV and AMDVLK.

  test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=3, subtest=weights, optimizer=adamw): FAIL
  test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=4, subtest=weights, optimizer=adamw): FAIL
  test_gradient_accumulation(high_level=no, nbatch_physical=1, loss_type=sum, epoch=3, subtest=weights, optimizer=adamw): FAIL
  test_gradient_accumulation(high_level=no, nbatch_physical=1, loss_type=sum, epoch=4, subtest=weights, optimizer=adamw): FAIL
  test_gradient_accumulation(high_level=no, nbatch_physical=1, loss_type=sum, epoch=4, subtest=results, optimizer=adamw): FAIL

If I print out the actual values that are being compared in the test using the code below I see very small differences in the numbers which are causing the errors. This might be due to floating point errors (test-backend-ops has a small margin for this, test-opt currently expects an exact match).

------------------------------ tests/test-opt.cpp ------------------------------
index f02b4cad8..6d96e0e12 100644
@@ -690,6 +690,8 @@ static std::pair<int, int> test_gradient_accumulation(
             float weights;
             ggml_backend_tensor_get(cd.weights, &weights, 0, sizeof(float));
             const bool subtest_ok = weights == (ndata/2) - epoch;
+            if (!subtest_ok)
+                printf("weights %.30f %ld\n", weights, (ndata/2) - epoch);
             helper_after_test_gradient_accumulation(optim, __func__, nbatch_physical, loss_type, epoch, "weights", subtest_ok, ntest, npass);
         }
         {
@@ -701,6 +703,8 @@ static std::pair<int, int> test_gradient_accumulation(
             ggml_opt_result_loss(cd.result, &loss, /*loss_unc =*/ nullptr);
             if (loss_type == GGML_OPT_LOSS_TYPE_SUM) {
                 subtest_ok = subtest_ok && loss == (39.0 - epoch*6.0);
+                if (!subtest_ok)
+                    printf("results %.30f, %.30f\n", loss, 39.0 - epoch*6.0);
             } else if (loss_type == GGML_OPT_LOSS_TYPE_MEAN) {
                 subtest_ok = subtest_ok && almost_equal(loss, (39.0 - epoch*6.0) / ndata, 1e-6);
             } else {
weights 0.000000059604644775390625000000 0
  test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=3, subtest=weights, optimizer=adamw): FAIL
weights -0.999999880790710449218750000000 -1
  test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=4, subtest=weights, optimizer=adamw): FAIL
weights 0.000000059604644775390625000000 0
  test_gradient_accumulation(high_level=no, nbatch_physical=1, loss_type=sum, epoch=3, subtest=weights, optimizer=adamw): FAIL
weights -0.999999880790710449218750000000 -1
  test_gradient_accumulation(high_level=no, nbatch_physical=1, loss_type=sum, epoch=4, subtest=weights, optimizer=adamw): FAIL
results 15.000000059604644775390625000000, 15.000000000000000000000000000000
  test_gradient_accumulation(high_level=no, nbatch_physical=1, loss_type=sum, epoch=4, subtest=results, optimizer=adamw): FAIL

Meanwhile on Intel integrated graphics with Mesa the same tests pass in both fp16 and fp32 modes with absolutely no difference between the calculated and expected values.

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions