Disable GGML_CUDA_FORCE_MMQ

jart · jart · commit 698de9147dfd · 2024-08-18T09:17:47.000-07:00
We enabled this option earlier because it goes faster than tinyBLAS. It
is however enormous. It's grown so much in size over the past month, it
is going to cause the llamafile executable to increase from 30mb to 219
megabytes which is unbelievable. Unfortunately, disabling this will not
save us from this bloat. Despite adding so much bloat, the new MMQ code
appears to have dropped support for my five year old GeForce RTX 2080Ti
which is probably an unintended bug but we need this to unblock release
diff --git a/llama.cpp/ggml-cuda.cu b/llama.cpp/ggml-cuda.cu
@@ -341,8 +341,15 @@ void ggml_abort(const char * file, int line, const char * fmt, ...) {
 // - 13B quantum model: +200-400 MB
 //
 // [jart] https://github.com/Mozilla-Ocho/llamafile/issues/403#issuecomment-2103687594
+//
+// TODO(jart): oops looks like we can't use this anymore, because my
+//             five year old NVIDIA GeForce RTX 2080 Ti card stopped
+//             working with "ggml-cuda.cu:11460: ERROR: CUDA kernel
+//             mul_mat_q has no device code compatible with CUDA arch
+//             700. ggml-cuda.cu was compiled for: 600,700,800,900"!
+//
 #ifdef GGML_USE_TINYBLAS
-#define GGML_CUDA_FORCE_MMQ // [jart] want this
+// #define GGML_CUDA_FORCE_MMQ // [jart] want this
 #endif
 
 GGML_CALL bool ggml_cuda_link(const struct ggml_backend_api *backend_api) {