-
Notifications
You must be signed in to change notification settings - Fork 15.2k
Description
Hi,
The smaller reproductions were provided by dzaima following
my initial issue report (see at the end)
There's a wrong code generation with the following 2 (3) reproductions, which
leads to squares[0] having a garbage value after a do while loop.
Reproduction 1
#include <cstdlib>
#include <cstdint>
#include <cstdio>
__attribute__((noinline,optnone))
void print_failure(int* f, int z) {
printf("FAIL! ptr=%p, squares[0] =? %d\n", (void*)f, z);
for (int i = 0; i < 7; i++) {
printf("squares[%d] = %d\n", i, f[i]);
}
printf("\n");
}
__attribute__((noinline))
int do_table(uint64_t* board, int cond) {
int squares[7];
int size = 0;
if (cond==123) {
printf("the untaken if\n");
uint64_t b = *board;
do { squares[size++] = 123; b&=b-1; } while (b);
squares[0] = squares[1];
}
uint64_t b = *board;
do { squares[size++] = 123; b&=b-1; } while (b);
if (squares[0] < 0) {
print_failure(squares, squares[0]);
exit(20);
}
for (int i = 0; i < size; ++i) squares[i]^= 123;
return 0;
}
__attribute__((optnone))
int main(int argc, char* argv[]) {
uint64_t v = 0x20202;
int cond = -12345;
do_table(&v, cond);
printf("didn't fail");
}
// -O3 -march=znver4 -flto=fullReproduction 2
#include <cstdlib>
#include <cstring>
#include <cstdint>
#include <cstdio>
__attribute__((noinline,optnone))
void print_failure(int* f, int z) {
printf("FAIL! ptr=%p, squares[0] =? %d\n", (void*)f, z);
for (int i = 0; i < 7; i++) {
printf("squares[%d] = %d\n", i, f[i]);
}
printf("\n");
}
__attribute__((noinline))
int do_table(uint64_t* board, int cond) {
int squares[7];
int size = 0;
if (cond==123) {
printf("the untaken if\n");
uint64_t b = *board;
do { squares[size++] = 123; b&=b-1; } while (b);
squares[0] = squares[1];
}
uint64_t b = *board;
do { squares[size++] = 123; b&=b-1; } while (b);
if (squares[0] < 0) {
print_failure(squares, squares[0]);
exit(20);
}
#pragma clang loop vectorize_width(16)
for (int i = 0; i < size; ++i) squares[i]^= 123;
return 0;
}
__attribute__((optnone))
int main(int argc, char* argv[]) {
uint64_t v = 0x20202;
int cond = -12345;
do_table(&v, cond);
printf("didn't fail");
}
// -O3 -flto=fullInitial Issue description
in advance, I'm sorry that I can't provide you with a smaller reproduction as of this moment, it's hard for us to diagnose why this behaves the way it does.
Following I will explain the issue that we, ran into here official-stockfish/Stockfish#4450.
We later merged a temporary workaround here but ultimately we believe that we've come across a compiler bug.
Prerequisites:
-
AVX512 CPU or Intel SDE
-
clone
https://github.com/Disservin/Stockfish.gitand git checkoutminimal-repo
Reproduction:
-
cd src && make -j build ARCH=x86-64-avx512 COMP=clang CXX=clang++-18 EXTRACXXFLAGS="-g3 -fno-omit-frame-pointer"(or clang 17) -
Run
./stockfish
or when using SDE
sde -spr -- ./stockfish
➜ src git:(minimal-repo) ✗ sde -spr -- ./stockfish
info string Found 1 tablebases
[1] 708806 segmentation fault (core dumped) sde -spr -- ./stockfish
We are currently under the impression that this might be a compiler bug in clang.
What we have tested so far:
- does not crash with
-O1 - does not crash with
debug=yesoroptimize=no - does not crash if LTO (link-time optimization) is disabled.
- does not crash when compiled with gcc 12.2.0 (LTO enabled).
- does not reproduce under most sanitizers (excluding
-fsanitize=nullability-assign)
Only the architectures below are problematic; others do not crash.
x86-64-vnni512
x86-64-avx512
What we have so far diagnosed:
diff --git a/src/syzygy/tbprobe.cpp b/src/syzygy/tbprobe.cpp
index ad15e751..95aefbfe 100644
--- a/src/syzygy/tbprobe.cpp
+++ b/src/syzygy/tbprobe.cpp
@@ -730,8 +730,8 @@ Ret do_probe_table(const Position& pos, T* entry, WDLScore wdl, ProbeState* resu
// THIS EXITS??!!
- // if (squares[0] < 0)
- // _exit(20);
+ if (squares[0] < 0)
+ _exit(20);
d = entry->get(stm, tbFile);-
The exit here is triggered,
squares[0]seems to have a garbage value at this point, however
we are not sure at all why this would be the case. The do while loop should be executed and have set it to at least some positive non-garbage value. -
Running sde with the align checker reported, the following, though I'm not sure if the two things are related.
TID: 0 executed instruction with an unaligned memory reference to address 0x7ffe90e6adad INSTR: 0x7f6ade5923dc: IFORM: VMOVDQU64_YMMu64_MASKmskw_MEMu64_AVX512 :: vmovdqu64 ymm17, ymmword ptr [rdi]
IMAGE: /lib/x86_64-linux-gnu/libc.so.6
FUNCTION: __strrchr_evex
FUNCTION ADDR: 0x7f6ade5923c0
# $eof
- Running sde with
sde -spr -null_check 1 -ptr-check 1 -- ./stockfish, returned
➜ src git:(minimal-repo) ✗ sde -spr -null_check 1 -ptr-check 1 -- ./stockfish
info string Found 1 tablebases
SDE ERROR: DEREFERENCING BAD MEMORY POINTER PC=0x559415e9f021 MEMEA=0x55931c9ec8f4 mov eax, dword ptr [r8+rax*4]
Image: /home/max/Documents/Github/Stockfish/src/stockfish+0x6021 (in multi-region image, region# 1)
Function: main
- The crash happens when using
squares[0]as an index
Program received signal SIGSEGV, Segmentation fault.
0x000055b3c3aca021 in Stockfish::(anonymous namespace)::do_probe_table<Stockfish::(anonymous namespace)::TBTable<(Stockfish::(anonymous namespace)::TBType)0>, Stockfish::Tablebases::WDLScore> (pos=...,
wdl=Stockfish::Tablebases::WDLDraw, entry=<optimized out>, result=<optimized out>) at syzygy/tbprobe.cpp:825
825 idx = (MapA1D1D4[squares[0]] * 63 + (squares[1] - adjust1)) * 62 + squares[2] - adjust2;
entry->hasPawnsshould befalseand is also false by looking through the debugger (gdb)- Making random changes to the body of the following if statement also fix it
if (entry->hasPawns)(i.e. comment out something). This branch is not taken. - Initializing
Square squares[TBPIECES]to some value, i.e.= {};seems to fix it, howeversquaresshouldn't need this initialization since it's not accessed in a non safe way, this also seems like a random changes, similar to the one mentioned earlier - The mentioned
max_elementfunction in the issue, is probably completely unrelated because this branch isn't taken, but perhaps this branch is causing weird optimizations - The mentioned repo from above is a reduced version of the official one, our master branch currently has a workaround by disabling the optimizations for that function, https://github.com/official-stockfish/Stockfish/blob/master/src/syzygy/tbprobe.cpp#L711
- We have a somewhat smaller reproduction on godbolt, however this (unfortunately) still includes a null pointer dereference, which our code does not have... though maybe it is of help? https://godbolt.org/z/MqxcW671j
- With the trimmed down repo, I was able to reproduce it with clang 16 too after removing
-fno-omit-frame-pointer, the original issue mentioned clang 15 as well though.
I will continue with trying to come up with a smaller reproduction, if this is too vague for you.
Any help is much appreciated :D
You can also reproduce this on the master repository if you want, by doing the following:
- Get syzygy tablebases
wget -r -nH --cut-dirs=2 --no-parent --reject="index.html*" -e robots=off https://tablebase.lichess.ovh/tables/standard/3-4-5/ - clone https://github.com/official-stockfish/Stockfish
Reproduction:
-
Remove current workaround
CLANG_AVX512_BUG_FIXfromdo_probe_tableinsidesrc/syzygy/tbprobe.cppL711. -
cd src && make -j build ARCH=x86-64-avx512 COMP=clang CXX=clang++-17 EXTRACXXFLAGS="-g3"(or clang 15-18) -
Create a text file with this content, replace
PATHwith the syzygy tablebases directory, which you got earlier
setoption name SyzygyPath value PATH
position fen 8/8/3K4/1r6/8/8/4k3/2R5 b - - 0 18
go
ucinewgame
- Run
./stockfish < input
or when using SDE
sde -spr -- ./stockfish < input
➜ src git:(master) ✗ ./stockfish < input
Stockfish dev-20240126-fcbb02ff by the Stockfish developers (see AUTHORS file)
info string Found 145 tablebases
info string NNUE evaluation using nn-baff1ede1f90.nnue
info string NNUE evaluation using nn-baff1edbea57.nnue
[1] 291189 segmentation fault (core dumped) ./stockfish < input