How does one use Attention Fusion optimization? #15322

asyncth · 2023-04-01T10:21:15Z

asyncth
Apr 1, 2023

This page mentions an "Attention Fusion" optimization, however I can't seem to trigger it with CUDA execution provider despite setting graph optimization level to ORT_ENABLE_ALL. Looking at the graph, one thing that changes is that a FusedMatMul node appears in attention parts of the graph, but nothing attention-specific. Does ONNX Runtime require some specific implementation of attention in order for optimizer to recognize it? Or did I misunderstand what "Attention Fusion" means?

Attention is implemented without masking in PyTorch as:

import math

import torch
import torch.nn as nn
import torch.nn.functional as F

class ScaledDotProductAttention(nn.Module):
    def __init__(self, d_head: int):
        super().__init__()
        self.scale = torch.tensor(math.sqrt(d_head), dtype=torch.float)

    def forward(self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor) -> torch.Tensor:
        attn = (q @ k.transpose(-2, -1)) / self.scale
        attn = F.softmax(attn, dim=-1)

        return attn @ v

Answered by asyncth

Apr 2, 2023

I guess Attention Fusion is supposed to be done using transformer optimization tool rather than through ONNX Runtime's built-in optimizer.

View full answer

asyncth · 2023-04-02T09:54:05Z

asyncth
Apr 2, 2023
Author

I guess Attention Fusion is supposed to be done using transformer optimization tool rather than through ONNX Runtime's built-in optimizer.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How does one use Attention Fusion optimization? #15322

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How does one use Attention Fusion optimization? #15322

Uh oh!

asyncth Apr 1, 2023

Replies: 1 comment

Uh oh!

asyncth Apr 2, 2023 Author

asyncth
Apr 1, 2023

asyncth
Apr 2, 2023
Author