-
-
Notifications
You must be signed in to change notification settings - Fork 843
Open
Description
block_reduce_max_f32 和block_reduce_sum_f32在我看来只能每个block内的每个thread获得所属block内规约的数据,拿不到全局的规约结果。但是softmax规约维度很大的话要分block,作者是不是没有实现完整规约的核函数。
In my view, functions like block_reduce_max_f32 and block_reduce_sum_f32 only allow each thread to obtain the reduced data within its own block, without access to the global reduction result. However, when the reduction dimension for softmax is very large and requires splitting across blocks, I'm wondering if the author might not have implemented a kernel function for complete reduction. Would you kindly share your thoughts on this?
Metadata
Metadata
Assignees
Labels
No labels