I appreciate your excellent work, especially the example https://juliamltools.github.io/shakespeare-gpt there are some existing implementations of MultiHeadAttention and Transformer: https://github.com/FluxML/Flux.jl/pull/2146 https://github.com/chengchingwen/NeuralAttentionlib.jl https://github.com/chengchingwen/Transformers.jl can you give a compare with existing implementations ? why you want to implementation this again ?