Hi, thank you for your wonderful work!
I’ve been exploring the implementation of sparsity in the codebase and noticed that sparsity is achieved through masks such as attn_weight_mask, mlp_weight_mask, and token_select. However, during the forward pass, these masks are represented as binary values (0s and 1s), and the tensor dimensions remain unchanged.
Could you kindly clarify how the actual acceleration effect is achieved during runtime?