-
Notifications
You must be signed in to change notification settings - Fork 15.3k
Closed
Description
The following input example: https://gist.github.com/qedawkins/c620832f96a5c504295f9694cc8956e2
Which is approximately compiled from
// int8_t a[144]; // [2 x 8 x 3 x 3]
// int8_t b[144]; // [2 x 8 x 3 x 3]
// int c[2];
__global__ void helloworld(int8_t* a, int8_t* b, int* c)
{
int id = hipThreadIdx_x
if (id < 2) {
for (int c = 0; c < 8; ++c) {
for (int h = 0; h < 3; ++h) {
for (int w = 0; w < 3; ++w) {
int x = a[id * 72 + c * 9 + h * 3 + w];
int y = b[id * 72 + c * 9 + h * 3 + w];
c[id] += x * y;
}
}
}
}
}
When compiled with
llc -O3 module_test_dispatch_0_rocm_hsaco_fb.optimized.ll -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100
Produces incorrect numerics. For inputs
a = np.ones((2, 8, 3, 3), dtype=np.int8)
b = np.broadcast_to(np.array([1, 2, 1], dtype=np.int8), (2, 8, 3, 3))
In other words only the values of b vary along the inner most loop, this gives c = [88, 88] when the correct values should be (1 + 2 + 1) * 3 * 8 = [96, 96]
Compiling with
llc -O3 module_test_dispatch_0_rocm_hsaco_fb.optimized.ll -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100 -amdgpu-load-store-vectorizer=false
(turning off the load/store vectorizer) or just with -O1 instead of -O3 gives the correct values. I am not sure exactly how the generated code is miscompiling and don't know whether load/store vectorization is the root cause or if disabling it is just masking the problem. Further information about the downstream issue here: iree-org/iree#18798