在bert源码create_attention_mask_from_input_mask中,We don't assume that from_tensor is a mask (although it could be). We
don't actually care if we attend from padding tokens (only to padding) tokens so we create a tensor of all ones.这里Query的padding也会得到没有意义的attention scores,后面是否有处理掉他们呢?困扰很久了,感谢