Skip to content

Conversation

@lauri9
Copy link
Owner

@lauri9 lauri9 commented Oct 24, 2025

What does this PR do?

AITER is AMD’s centralized repository to support high performance AI operators such as attention kernels for AMD ROCm enabled accelerators. This PR adds support for FlashAttention through AITER by introducing a new attention backend.

Test code for Flux inference below. Requires installation of aiter>=0.15.0 and a supported ROCm enabled accelerator.

import torch
from diffusers import FluxPipeline, FluxTransformer2DModel, attention_backend

model_id = "black-forest-labs/FLUX.1-dev"
transformer = FluxTransformer2DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16, device_map="cuda")
transformer.set_attention_backend("aiter")
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", transformer=transformer, torch_dtype=torch.bfloat16)
pipe.text_encoder.to("cuda")
pipe.text_encoder_2.to("cuda")
pipe.vae.to("cuda")

prompt = "A cat holding a sign that says 'hello world'"

image = pipe(prompt, num_inference_steps=28, guidance_scale=4.0).images[0]
image.save("output.png")

We are interested in following up this PR by eventually also enabling support for context parallelism across multiple devices.

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@lauri9 lauri9 changed the title add aiter attention backend Add support for AITER attention backend Oct 24, 2025

@_AttentionBackendRegistry.register(
AttentionBackendName.AITER,
constraints=[_check_device, _check_qkv_dtype_bf16_or_fp16, _check_shape],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have _check_device_rocm? Not sure if one can accidentally install aiter on NV, but AFAIK nothing here would then prevent from trying to run it.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, perhaps we should change this to _check_device_cuda instead to guarantee that the tensors are on an accelerator device. CPU is not meaningful I suppose.

I tried installing aiter in a nvcr container, but it gives an error:

No ROCm runtime is found, using ROCM_HOME='None'
Traceback (most recent call last):
  File "/diffusers_aiter_backend/aiter/aiter/jit/utils/cpp_extension.py", line 90, in get_hip_version
    hipconfig = executable_path("hipconfig")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/diffusers_aiter_backend/aiter/aiter/jit/utils/cpp_extension.py", line 83, in executable_path
    path is not None
AssertionError: Could not find hipconfig in PATH or ROCM_HOME(None)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/diffusers_aiter_backend/aiter/setup.py", line 16, in <module>
    from jit import core
  File "/diffusers_aiter_backend/aiter/aiter/jit/core.py", line 23, in <module>
    from chip_info import get_gfx
  File "/diffusers_aiter_backend/aiter/aiter/jit/utils/chip_info.py", line 9, in <module>
    from cpp_extension import executable_path
  File "/diffusers_aiter_backend/aiter/aiter/jit/utils/cpp_extension.py", line 173, in <module>
    HIP_VERSION = get_hip_version()
                  ^^^^^^^^^^^^^^^^^
  File "/diffusers_aiter_backend/aiter/aiter/jit/utils/cpp_extension.py", line 94, in get_hip_version
    raise RuntimeError("ROCm version file not found")
RuntimeError: ROCm version file not found

IMO, this is a corner case that is probably not worth adding a separate function to check for - a user that has managed to install ROCm and CUDA on the same platform but only has one type of accelerator has already taken many wrong turns 😁

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense 😆

),
(
"aiter",
torch.tensor([0.0781, 0.0820, 0.0879, 0.0957, 0.0898, 0.0938, 0.0957, 0.0957, 0.2285, 0.2363, 0.2461, 0.2637, 0.2695, 0.2617, 0.2617, 0.2891], dtype=torch.bfloat16),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there some way to not run these on NV devices? I don't see any flags here to not run them

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running without aiter installed will lead to xfail on these tests. Trying on MI355X env without aiter skips all tests except native (which fails probably due to the brittleness of the tests; comparing numerical values) and cudnn (which fails because kernel is not available).

=============================================================================================================================== short test summary info ================================================================================================================================
XFAIL tests/others/test_attention_backends.py::test_forward[flash_hub] - Backend 'flash_hub' not supported in this environment.
XFAIL tests/others/test_attention_backends.py::test_forward[_flash_3_hub] - Backend '_flash_3_hub' not supported in this environment.
XFAIL tests/others/test_attention_backends.py::test_forward[aiter] - Backend 'aiter' not supported in this environment.
XFAIL tests/others/test_attention_backends.py::test_forward_with_compile[flash_hub] - Backend 'flash_hub' not supported in this environment.
XFAIL tests/others/test_attention_backends.py::test_forward_with_compile[_flash_3_hub] - Backend '_flash_3_hub' not supported in this environment.
XFAIL tests/others/test_attention_backends.py::test_forward_with_compile[aiter] - Backend 'aiter' not supported in this environment.
FAILED tests/others/test_attention_backends.py::test_forward[native] - assert False
FAILED tests/others/test_attention_backends.py::test_forward[_native_cudnn] - RuntimeError: No available kernel. Aborting execution.
FAILED tests/others/test_attention_backends.py::test_forward_with_compile[native] - assert False
FAILED tests/others/test_attention_backends.py::test_forward_with_compile[_native_cudnn] - torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in function scaled_dot_product_attention>(*(), **{'query': FakeTensor(..., device='cuda:0', size=(1, 24, 384, 128), dtype=torch.bfloat16), 'key': FakeTensor(..., device=...
====================================================================================================================== 4 failed, 6 xfailed, 17 warnings in 45.12s ======================================================================================================================

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, then that's not an issue :)

@avjves
Copy link

avjves commented Oct 24, 2025

Left a couple of questions, but overall LGTM!

@lauri9 lauri9 force-pushed the add-aiter-backend branch from 7482105 to 89903c3 Compare October 27, 2025 09:52
josephrocca and others added 3 commits October 27, 2025 16:25
…/Chroma1-HD` (huggingface#12508)

* [Fix] Move attention mask padding after T5 embedding

* [Fix] Move attention mask padding after T5 embedding

* Clean up whitespace in pipeline_chroma.py

Removed unnecessary blank lines for cleaner code.

* Fix

* Fix

* Update model to final Chroma1-HD checkpoint

* Update to Chroma1-HD

* Update model to Chroma1-HD

* Update model to Chroma1-HD

* Update Chroma model links to Chroma1-HD

* Add comment about padding/masking

* Fix checkpoint/repo references

* Apply style fixes

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Dhruv Nair <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants