videos 为什么现在大模型都用pre-norm架构? 随机梯度下降、牛顿法、动量法、Nesterov、AdaGrad、RMSprop、Adam 3b1b 线性代数的本质 blogs Illustrated-transformer 图解Transformer Normalization Attention Additive Attention RoPE RoPE MHA GRU LSTM LSTM seq2seq addtion material D2L dive-into-llms DataWhale-NLP-with-Transformers