-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Description
Name and Version
version: 6240 (54a241f)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
Other (Please specify in the next section)
Command line
./bin/test-opt
Problem description & steps to reproduce
Certain test_gradient_accumulation tests are failing on the Vulkan backend with my RX 470 and W8100 after the test got enabled in #13873. test-backend-ops is passing fine and I get the same failures on both RADV and AMDVLK.
test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=3, subtest=weights, optimizer=adamw): FAIL
test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=4, subtest=weights, optimizer=adamw): FAIL
test_gradient_accumulation(high_level=no, nbatch_physical=1, loss_type=sum, epoch=3, subtest=weights, optimizer=adamw): FAIL
test_gradient_accumulation(high_level=no, nbatch_physical=1, loss_type=sum, epoch=4, subtest=weights, optimizer=adamw): FAIL
test_gradient_accumulation(high_level=no, nbatch_physical=1, loss_type=sum, epoch=4, subtest=results, optimizer=adamw): FAIL
If I print out the actual values that are being compared in the test using the code below I see very small differences in the numbers which are causing the errors. This might be due to floating point errors (test-backend-ops has a small margin for this, test-opt currently expects an exact match).
------------------------------ tests/test-opt.cpp ------------------------------
index f02b4cad8..6d96e0e12 100644
@@ -690,6 +690,8 @@ static std::pair<int, int> test_gradient_accumulation(
float weights;
ggml_backend_tensor_get(cd.weights, &weights, 0, sizeof(float));
const bool subtest_ok = weights == (ndata/2) - epoch;
+ if (!subtest_ok)
+ printf("weights %.30f %ld\n", weights, (ndata/2) - epoch);
helper_after_test_gradient_accumulation(optim, __func__, nbatch_physical, loss_type, epoch, "weights", subtest_ok, ntest, npass);
}
{
@@ -701,6 +703,8 @@ static std::pair<int, int> test_gradient_accumulation(
ggml_opt_result_loss(cd.result, &loss, /*loss_unc =*/ nullptr);
if (loss_type == GGML_OPT_LOSS_TYPE_SUM) {
subtest_ok = subtest_ok && loss == (39.0 - epoch*6.0);
+ if (!subtest_ok)
+ printf("results %.30f, %.30f\n", loss, 39.0 - epoch*6.0);
} else if (loss_type == GGML_OPT_LOSS_TYPE_MEAN) {
subtest_ok = subtest_ok && almost_equal(loss, (39.0 - epoch*6.0) / ndata, 1e-6);
} else {
weights 0.000000059604644775390625000000 0
test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=3, subtest=weights, optimizer=adamw): FAIL
weights -0.999999880790710449218750000000 -1
test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=4, subtest=weights, optimizer=adamw): FAIL
weights 0.000000059604644775390625000000 0
test_gradient_accumulation(high_level=no, nbatch_physical=1, loss_type=sum, epoch=3, subtest=weights, optimizer=adamw): FAIL
weights -0.999999880790710449218750000000 -1
test_gradient_accumulation(high_level=no, nbatch_physical=1, loss_type=sum, epoch=4, subtest=weights, optimizer=adamw): FAIL
results 15.000000059604644775390625000000, 15.000000000000000000000000000000
test_gradient_accumulation(high_level=no, nbatch_physical=1, loss_type=sum, epoch=4, subtest=results, optimizer=adamw): FAIL
Meanwhile on Intel integrated graphics with Mesa the same tests pass in both fp16 and fp32 modes with absolutely no difference between the calculated and expected values.
First Bad Commit
No response