Skip to content

Commit 60404a8

Browse files
committed
Always use tinyBLAS with AMD GPUs on Windows
When llamafile uses hipBLAS with ROCm SDK 5.7.1 on Windows10 the process crashes shortly after tokens start getting printed. This is possibly the worst heisenbug I've ever seen in my career. It seems to to crash in AMD code, in a separate thread, inside hipGraphicsUnregisterResource, when a vqmovdqu instruction is being executed. While this happens, cosmo's main thread is usually doing something like std::string and std::locale stuff which appears unrelated. Could possibly be related to C++ exceptions and thread-local storage. Using --tinyblas appears to make it go away, but I can't say for certain it has anything to do with hipBLAS, since it might simply not manifest itself, because the binary footprint, stack, or heap memory layout changed. Let's keep our fingers crossed that tinyBLAS will save us from this issue. Note also that no one else has reported the bug even though it's been impacting me for months.
1 parent a28250b commit 60404a8

File tree

2 files changed

+16
-7
lines changed

2 files changed

+16
-7
lines changed

llama.cpp/ggml-cuda.cu

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10884,8 +10884,8 @@ static ggml_cuda_device_info ggml_cuda_init() {
1088410884
// Workaround for a rocBLAS bug when using multiple graphics cards:
1088510885
// https://github.com/ROCmSoftwarePlatform/rocBLAS/issues/1346
1088610886
#ifndef GGML_USE_TINYBLAS
10887-
rocblas_initialize();
10888-
CUDA_CHECK(cudaDeviceSynchronize());
10887+
// rocblas_initialize(); // already called
10888+
// CUDA_CHECK(cudaDeviceSynchronize());
1088910889
#endif
1089010890
#endif
1089110891

@@ -13507,7 +13507,9 @@ GGML_CALL static enum ggml_status ggml_backend_cuda_graph_compute(ggml_backend_t
1350713507
GGML_ASSERT(stat == cudaSuccess);
1350813508
}
1350913509
// Launch graph
13510+
printf("cudaGraphLaunch begin\n");
1351013511
CUDA_CHECK(cudaGraphLaunch(cuda_ctx->cuda_graph->instance, cuda_ctx->stream()));
13512+
printf("cudaGraphLaunch done\n");
1351113513
#else
1351213514
graph_evaluated_or_captured = true;
1351313515
#endif // USE_CUDA_GRAPH

llamafile/cuda.c

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -559,7 +559,14 @@ static bool compile_amd_windows(const char *clangxx, const char *dso, const char
559559
(char *)offload_arch,
560560
"-Wno-ignored-attributes",
561561
"-D_CRT_SECURE_NO_WARNINGS",
562-
COMMON_FLAGS,
562+
"-DGGML_BUILD=1",
563+
"-DGGML_SHARED=1",
564+
"-DGGML_MULTIPLATFORM",
565+
"-DGGML_CUDA_DMMV_X=32",
566+
"-DK_QUANTS_PER_ITERATION=2",
567+
"-DGGML_CUDA_PEER_MAX_BATCH_SIZE=128",
568+
"-DGGML_CUDA_MMV_Y=1",
569+
"-DGGML_USE_TINYBLAS",
563570
"-o",
564571
(char *)tmpdso,
565572
(char *)src,
@@ -571,10 +578,10 @@ static bool compile_amd_windows(const char *clangxx, const char *dso, const char
571578
"-amdgpu-early-inline-all=true",
572579
"-isystem",
573580
gc(xasprintf("%s/include", hip_path)),
574-
BLAS_ONLY("-l"),
575-
BLAS_ONLY(gc(xasprintf("%s/lib/hipblas.%s", hip_path, lib))),
576-
BLAS_ONLY("-l"),
577-
BLAS_ONLY(gc(xasprintf("%s/lib/rocblas.%s", hip_path, lib))),
581+
/* BLAS_ONLY("-l"), */
582+
/* BLAS_ONLY(gc(xasprintf("%s/lib/hipblas.%s", hip_path, lib))), */
583+
/* BLAS_ONLY("-l"), */
584+
/* BLAS_ONLY(gc(xasprintf("%s/lib/rocblas.%s", hip_path, lib))), */
578585
"-l",
579586
gc(xasprintf("%s/lib/amdhip64.%s", hip_path, lib)),
580587
"-lkernel32",

0 commit comments

Comments
 (0)