[BUG] Non-flash attention on Vulkan (Intel Iris Xe, Mesa anv) produces structured noise; --diffusion-fa fixes it

## Summary

On Intel Iris Xe (Raptor Lake-P) with Mesa `anv` Vulkan driver, the generic (non-flash) attention path in the Vulkan backend produces structured noise instead of correct outputs. Any model — tested with **SD 1.5** and **Wan 2.2 TI2V-5B Turbo** — outputs the same characteristic horizontal teal/pink stripe pattern when `--diffusion-fa` is omitted. Adding `--diffusion-fa` flips the exact same config to correct, prompt-accurate output.

This is the same symptom as [discussion #1243](https://github.com/leejet/stable-diffusion.cpp/discussions/1243), where Green-Sky wrote *"Using --diffusion-fa with ROCm is absolutely necessary to get viable, non-scrambled output."* It appears Mesa anv on Intel Iris Xe has the same issue.

May be related to #748 (Vulkan blank image) and #1031 (ZImage + Vulkan blank image), which could both be downstream symptoms of a broken generic Vulkan attention kernel.

## Environment

- **GPU:** Intel Iris Xe Graphics (Raptor Lake-P), PCI `8086:a7a0`
- **Driver:** Mesa `anv` 25.2.8, Vulkan 1.4.318
- **OS:** Ubuntu 24.04.1 LTS, kernel 6.8.0-49-generic
- **CPU:** Intel Core i9-13900HK
- **RAM:** 62 GiB
- **sd-cli build:** `master-540-f16a110-3-g6e5fa00+` (branch `wan2.2_5B_flf2v`, commit `6e5fa00c4f0b`), also reproduced on `master-585-44cca3d`
- **Build flags:** `-DSD_VULKAN=ON -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON`

```
vulkaninfo --summary | grep -E 'deviceName|driverVersion|apiVersion' | head -6
  apiVersion         = 1.4.318
  driverVersion      = 25.2.8
  deviceName         = Intel(R) Iris(R) Xe Graphics (RPL-P)
```

## Minimal reproducer — SD 1.5

Identical config, single flag difference.

**Broken (no `--diffusion-fa`):**
```bash
./sd-cli \
    --model v1-5-pruned-emaonly-fp16.safetensors \
    --mode img_gen \
    --prompt "A cinematic shot of a lighthouse at dusk, warm amber light, ocean waves" \
    --negative-prompt "blurry, low quality, deformed" \
    --height 512 --width 512 --steps 20 \
    --cfg-scale 7.0 --seed 42 \
    --sampling-method euler_a \
    --output broken.png
```
Output: structured horizontal stripes in teal/pink, no prompt content. Reproducible across seeds, steps, resolutions, samplers.

**Working (add `--diffusion-fa`):**
```bash
./sd-cli \
    --model v1-5-pruned-emaonly-fp16.safetensors \
    --mode img_gen \
    --prompt "A cinematic shot of a lighthouse at dusk, warm amber light, ocean waves" \
    --negative-prompt "blurry, low quality, deformed" \
    --height 512 --width 512 --steps 20 \
    --cfg-scale 7.0 --seed 42 \
    --sampling-method euler_a \
    --diffusion-fa \
    --output working.png
```
Output: cinematic lighthouse at dusk, prompt-accurate, clean composition. 321 s total on this hardware.

Same model: `Comfy-Org/stable-diffusion-v1-5-archive/v1-5-pruned-emaonly-fp16.safetensors` (2.0 GB fp16).

## Also reproduced with Wan 2.2 TI2V-5B Turbo

Using `Kijai/WanVideo_comfy/Wan22-Turbo/Wan2_2-TI2V-5B-Turbo_fp16.safetensors` + `QuantStack/Wan2.2-TI2V-5B-GGUF/VAE/Wan2.2_VAE.safetensors` + `city96/umt5-xxl-encoder-gguf/umt5-xxl-encoder-Q5_K_M.gguf`:

- Without `--diffusion-fa`: noise
- With `--diffusion-fa` AND `--flow-shift 8.0`: coherent output (smallest tested = 320×256 × 5 frames, `--scheduler simple`, `--cfg-scale 1.0`)

(`--flow-shift 8.0` appears to be a separate Wan-2.2-specific fix — the `auto` default seems to mis-detect for TI2V-5B. Not the focus of this bug report, but noting in case it's a related upstream concern.)

## Observations

- Neither loader path (native `gguf_init_from_file_ptr`, `GGUFReader` fallback, safetensors) affects the outcome — the bug is downstream of loading.
- Sampler (`euler_a`, `euler`) and scheduler (`discrete`, `simple`) don't affect the outcome.
- Model version (SD 1.5, Wan 2.2) doesn't affect the outcome.
- **The single determining factor is `--diffusion-fa`.** Without it: noise. With it: clean output.
- The noise pattern is deterministic for a given seed and visually distinctive (horizontal teal/pink stripes for SD / base Wan, horizontal color bands for Turbo Wan).

## Suggested fix directions

Not sure whether the bug is in ggml's Vulkan attention kernel, in the Vulkan shader codegen, or in sd.cpp's pre-attention tensor layout for the non-FA path. Starting points for investigation:

1. Compare the non-FA vs FA attention kernel on Vulkan against the CUDA reference — is there a known numerical discrepancy?
2. Is there an assumption about tensor layout / stride that holds on NVIDIA but breaks on Mesa anv?
3. If the non-FA Vulkan attention path is deprecated or known-broken, it might be worth making `--diffusion-fa` the Vulkan default (or warning loudly when attention is invoked without it).

Happy to run more targeted diagnostics on this hardware if it'd help narrow it down.

## Workaround

Until fixed, on Intel Iris Xe Vulkan: **always pass `--diffusion-fa`**. It's the only known-working configuration for correct attention output on this backend.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Non-flash attention on Vulkan (Intel Iris Xe, Mesa anv) produces structured noise; --diffusion-fa fixes it #1449

Summary

Environment

Minimal reproducer — SD 1.5

Also reproduced with Wan 2.2 TI2V-5B Turbo

Observations

Suggested fix directions

Workaround

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[BUG] Non-flash attention on Vulkan (Intel Iris Xe, Mesa anv) produces structured noise; --diffusion-fa fixes it #1449

Description

Summary

Environment

Minimal reproducer — SD 1.5

Also reproduced with Wan 2.2 TI2V-5B Turbo

Observations

Suggested fix directions

Workaround

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions