Skip to content

Feature Request: Add lowering support for aten._scaled_dot_product_efficient_attention #827

@lordlugo

Description

@lordlugo

Description of the bug:
When attempting to convert a PyTorch model that uses a modern Hugging Face transformers backbone (e.g., facebook/dinov3-vitb16-pretrain-lvd1689m), the conversion process fails. This appears to be because recent versions of the transformers library default to using an efficient attention implementation (like Flash Attention), which is exposed through the aten._scaled_dot_product_efficient_attention.default operator.

The ai-edge-torch converter does not currently have a "lowering" rule for this operator, meaning it cannot be translated into a TFLite-compatible format. This prevents many SOTA vision models from being converted out of the box

Minimal Code to Reproduce:

import torch
import ai_edge_torch
from transformers import AutoModel

1. Define device and model name

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = 'facebook/dinov3-vitb16-pretrain-lvd1689m'

2. Load the model using the default attention implementation

This will fail during conversion

model = AutoModel.from_pretrained(model_name).to(device).eval()

3. Create a sample input

sample_input = (torch.randn(1, 3, 400, 400).to(device),)

4. Attempt to convert the model

This line will raise the RuntimeError

try:
edge_model = ai_edge_torch.convert(model, sample_input)
print("Conversion successful!")
except Exception as e:
print(f"Conversion failed with error:\n{e}")

Actual vs expected behavior:

Actual behavior:
The ai_edge_torch.convert() call fails with the following RuntimeError:
RuntimeError: Lowering not found: aten._scaled_dot_product_efficient_attention.default
While executing %_scaled_dot_product_efficient_attention : [num_users=1] = call_function[target=torch.ops.aten._scaled_dot_product_efficient_attention.default](args = (...))

Expected behavior:
The model should be converted successfully into a TFLite model. Ideally, ai-edge-torch would recognize this operator and lower it to a standard, TFLite-compatible multi-head attention implementation.

Any other information you'd like to share?

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions