Skip to content

Commit 698de91

Browse files
committed
Disable GGML_CUDA_FORCE_MMQ
We enabled this option earlier because it goes faster than tinyBLAS. It is however enormous. It's grown so much in size over the past month, it is going to cause the llamafile executable to increase from 30mb to 219 megabytes which is unbelievable. Unfortunately, disabling this will not save us from this bloat. Despite adding so much bloat, the new MMQ code appears to have dropped support for my five year old GeForce RTX 2080Ti which is probably an unintended bug but we need this to unblock release
1 parent 65745b0 commit 698de91

File tree

1 file changed

+8
-1
lines changed

1 file changed

+8
-1
lines changed

llama.cpp/ggml-cuda.cu

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -341,8 +341,15 @@ void ggml_abort(const char * file, int line, const char * fmt, ...) {
341341
// - 13B quantum model: +200-400 MB
342342
//
343343
// [jart] https://github.com/Mozilla-Ocho/llamafile/issues/403#issuecomment-2103687594
344+
//
345+
// TODO(jart): oops looks like we can't use this anymore, because my
346+
// five year old NVIDIA GeForce RTX 2080 Ti card stopped
347+
// working with "ggml-cuda.cu:11460: ERROR: CUDA kernel
348+
// mul_mat_q has no device code compatible with CUDA arch
349+
// 700. ggml-cuda.cu was compiled for: 600,700,800,900"!
350+
//
344351
#ifdef GGML_USE_TINYBLAS
345-
#define GGML_CUDA_FORCE_MMQ // [jart] want this
352+
// #define GGML_CUDA_FORCE_MMQ // [jart] want this
346353
#endif
347354

348355
GGML_CALL bool ggml_cuda_link(const struct ggml_backend_api *backend_api) {

0 commit comments

Comments
 (0)