-
This page mentions an "Attention Fusion" optimization, however I can't seem to trigger it with CUDA execution provider despite setting graph optimization level to Attention is implemented without masking in PyTorch as: import math
import torch
import torch.nn as nn
import torch.nn.functional as F
class ScaledDotProductAttention(nn.Module):
def __init__(self, d_head: int):
super().__init__()
self.scale = torch.tensor(math.sqrt(d_head), dtype=torch.float)
def forward(self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor) -> torch.Tensor:
attn = (q @ k.transpose(-2, -1)) / self.scale
attn = F.softmax(attn, dim=-1)
return attn @ v |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I guess Attention Fusion is supposed to be done using transformer optimization tool rather than through ONNX Runtime's built-in optimizer. |
Beta Was this translation helpful? Give feedback.
I guess Attention Fusion is supposed to be done using transformer optimization tool rather than through ONNX Runtime's built-in optimizer.