- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.4k
CUDA: always create events for split buffers #10185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: always create events for split buffers #10185
Conversation
bde4116    to
    38d11f5      
    Compare
  
    | Qwen2.5-0.5B does not work with this change alone, it still crashes in the memcpy later:  | 
| It would also be possible to prevent using a split buffer entirely if the matrix is too small by returning  | 
38d11f5    to
    e151321      
    Compare
  
    LCPP PR by Johannes Gaessler.
        
          
                ggml/src/ggml-cuda.cu
              
                Outdated
          
        
      | // only use row split if the weight matrix is large enough for every GPU to get data (this solves some edge cases) | ||
| // also for small matrices the overhead is very large anyways so splitting is slow | ||
| if (a->buffer && ggml_backend_buft_is_cuda_split(a->buffer->buft)) { | ||
| ggml_backend_cuda_split_buffer_type_context * buft_ctx = (ggml_backend_cuda_split_buffer_type_context *) a->buffer->buft->context; | ||
| int64_t active_devices = 0; | ||
| for (int id = 0; id < ggml_backend_cuda_get_device_count(); ++id) { | ||
| int64_t row_low; | ||
| int64_t row_high; | ||
| get_row_split(&row_low, &row_high, a, buft_ctx->tensor_split, id); | ||
| active_devices += row_low == row_high; | ||
| } | ||
| const int64_t rounding = get_row_rounding(buft_ctx->tensor_split); | ||
| if (rounding*active_devices < a->ne[1]) { | ||
| return false; | ||
| } | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems too expensive to do in this function, since this is called many times during inference by ggml_backend_sched. I think it should be possible to compute the minimum tensor size in ggml_backend_cuda_split_buffer_type, and store it in ggml_backend_cuda_split_buffer_type_context, then this function would only need to compare the tensor size to this value.
e151321    to
    84bcad6      
    Compare
  
    
Fixes #10176 .
I think the correct way to fix it is to just create the events unconditionally. Regardless of how the data is split you always need the events on the currently active device for the other devices to wait on. You could maybe reduce the number of events by only initializing those that are actually needed but I don't think that would be worthwhile since for the vast majority of use cases all events are already being created and used anyways.