woongjoonchoi
diff --git a/‎_posts/DeepLearning/Kernel Fusion/2025-03-07-fused.md‎
Lines changed: 19 additions & 2 deletions b/‎_posts/DeepLearning/Kernel Fusion/2025-03-07-fused.md‎
Lines changed: 19 additions & 2 deletions
diff --git a/‎assets/images/DeepLearning/KernelFusion/country-capital.png‎
76 KB b/‎assets/images/DeepLearning/KernelFusion/country-capital.png‎
76 KB
diff --git a/‎assets/images/DeepLearning/KernelFusion/vectordiff.png‎
23.7 KB b/‎assets/images/DeepLearning/KernelFusion/vectordiff.png‎
23.7 KB
@@ -23,8 +23,25 @@ self-attention을 수학적으로 동일하지만 Memory Efficient하게 구현
 
 ## BackGround
 
-### Self-Attention
-
+### Attention
+Attention은 기존의 RNN,GRU,LSTM 스타일의 sequence encoding의 한계점을 보완하기 위해 나왔습니다. 
+Attention은 뉴옥대의 조경현 교수님께서 처음 제안하셨습니다. 조경현 교수님께서는 fixed size vector에 sequence vector를 압축하는것이 불가능하다고 결론을 내리셨습니다.
+따라서, sequence vector를 fixed size로 압축해서  decoder의 input에 주지 않고 token length에  비례한 vector를 decoder에 input으로 주었습니다.
+input sequence를 tokenize한 embedding vector들로 정보를 만들어야 합니다.
+![vector-diff](\assets\images\DeepLearning\KernelFusion\vectordiff.png)  
+![country-capital](\assets\images\DeepLearning\KernelFusion\country-captial.png)
+embedding 된 sequence간의 relation은 vector의 성질중 하나인 방향과 크기가 같다면 같다라는 성질이 만족함이 보여졌습니다.
+이는 embedding 들을 더하면 위치에 관계없이 어떤 의미를 나타내는 vector를 만들 수 있음을 의미합니다.
+따라서, tokenize된 sequence는  embedding vector의 sequence라 볼 수 있고 이 embedding vector들을 더하면 어떠한 의미를 나타내게 됩니다. 
+이 때, 어떤 output vector를 만들어 낼때 input embedding vector가 항상 동등하게 사용되는 것은 아닙니다.
+따라서, decoder의 time step마다 어떤 embedding이 필요한지 계산하여 이 embedding vector와 input embedding vector의 유사도를 계산하여 필요한 vector를 weighted sum을 계산해줄려합니다.이때 , 유사도의 범위를 조정해주기 위해서 softmax function을 사용합니다.
+이 때, softmax function을 통해서 
+그 결과 , input sequence의 정보가 손실되지 않았습니다. 
+
+
+
+#### Self Attention
+Self-Attention은 기존의 Attention
 
 ### Efficient Attention