- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.5k
llama : fix buffer checks for mamba and rwk #10111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| op_tensor = ggml_ssm_conv(ctx, nullptr, w); | ||
| // FIXME | ||
| ggml_tensor * conv_x = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, 12345, w->ne[1], 6789); | ||
| op_tensor = ggml_ssm_conv(ctx, conv_x, w); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be the other way around, the convolution/filter as the third argument:
op_tensor = ggml_ssm_conv(ctx, w, conv_x);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is important that the weight w is in the same position as when used during inference so that the backend supports_op can check it. This function is called as ggml_ssm_conv(ctx, conv_x, model.layers[il].ssm_conv1d) during inference, so the weight is the third argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I did not realize that, thanks for clarifying!
| When running this, the error seems to happen when building the graph for  19696             // reserve again with pp graph to avoid ggml-alloc reallocations during inference
19697             gf_pp = llama_build_graph(*ctx, ubatch_pp, false);                       
19698             if (!ggml_backend_sched_reserve(ctx->sched, gf_pp)) {              
19699                 LLAMA_LOG_ERROR("%s: failed to allocate compute buffers\n", __func__);
19700                 llama_free(ctx);                                               
19701                 return nullptr;                                                
19702             }                                                                 Setting a breakpoint on this line and inspecting and stepping through  (gdb) p llm.n_kv                                                                    
$71 = 0                                                                             And this will cause the          lctx.inp_s_mask = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, 1, n_kv);         (gdb) p n_kv                                                                    
$72 = 0                                                                         
(gdb) p lctx.inp_s_mask->ne                                                     
$73 = {1, 0, 1, 1}                                                              Could this be the cause of the error perhaps? I've tried adding the following to  diff --git a/src/llama.cpp b/src/llama.cpp                                      
index bedacfcb..517b1eb6 100644                                                 
--- a/src/llama.cpp                                                             
+++ b/src/llama.cpp                                                             
@@ -10257,7 +10257,7 @@ struct llm_build_context {                              
         norm_eps         (hparams.f_norm_eps),                                 
         norm_rms_eps     (hparams.f_norm_rms_eps),                             
         n_tokens         (ubatch.n_tokens),                                    
-        n_kv             (worst_case ? kv_self.size : kv_self.n),              
+        n_kv             (worst_case ? kv_self.size : (kv_self.recurrent ? 1 : kv_self.n)),
         n_outputs        (worst_case ? n_tokens : lctx.n_outputs),             
         n_outputs_enc    (worst_case ? n_tokens : lctx.embd_enc.size() / hparams.n_embd),
         kv_head          (worst_case ? (kv_self.recurrent ? 0 : kv_self.size - n_tokens) : kv_self.head),With this I'm able to run inference using  I'm not sure if this is a proper fix or not, but as I'm running out of time today I thought I'd let you know in case this sparks some ideas for you about this issue. I'd be happy to continue investigating this tomorrow if needed/wanted. | 
| Thanks, the  | 
Similar to LCPP ggml-org#10111
* llama : fix buffer checks for mamba and rwk * llama : fix missing worst case flag during reserve * cuda : fix supports_op for norm * disable sched SET_CAUSE
* llama : fix buffer checks for mamba and rwk * llama : fix missing worst case flag during reserve * cuda : fix supports_op for norm * disable sched SET_CAUSE
Added random values to pass the asserts, but I don't know if they make sense.
The models load, but I wasn't able to run them. I tried a mamba and RWK models that I found on HF, and both crash during inference in
llm_build_copy_mask_state.Fixes #10109