Shape parameters:
B: batch sizeT: sequence lengthC: model dimension/embedding sizen_heads: number of attention headsn_kv_heads: number of key-value heads (used in grouped query attention)head_dim: dimension of each attention head (= C / n_heads)