Skip to content

[BUG] Non-flash attention on Vulkan (Intel Iris Xe, Mesa anv) produces structured noise; --diffusion-fa fixes it #1449

@arkwise

Description

@arkwise

Summary

On Intel Iris Xe (Raptor Lake-P) with Mesa anv Vulkan driver, the generic (non-flash) attention path in the Vulkan backend produces structured noise instead of correct outputs. Any model — tested with SD 1.5 and Wan 2.2 TI2V-5B Turbo — outputs the same characteristic horizontal teal/pink stripe pattern when --diffusion-fa is omitted. Adding --diffusion-fa flips the exact same config to correct, prompt-accurate output.

This is the same symptom as discussion #1243, where Green-Sky wrote "Using --diffusion-fa with ROCm is absolutely necessary to get viable, non-scrambled output." It appears Mesa anv on Intel Iris Xe has the same issue.

May be related to #748 (Vulkan blank image) and #1031 (ZImage + Vulkan blank image), which could both be downstream symptoms of a broken generic Vulkan attention kernel.

Environment

  • GPU: Intel Iris Xe Graphics (Raptor Lake-P), PCI 8086:a7a0
  • Driver: Mesa anv 25.2.8, Vulkan 1.4.318
  • OS: Ubuntu 24.04.1 LTS, kernel 6.8.0-49-generic
  • CPU: Intel Core i9-13900HK
  • RAM: 62 GiB
  • sd-cli build: master-540-f16a110-3-g6e5fa00+ (branch wan2.2_5B_flf2v, commit 6e5fa00c4f0b), also reproduced on master-585-44cca3d
  • Build flags: -DSD_VULKAN=ON -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON
vulkaninfo --summary | grep -E 'deviceName|driverVersion|apiVersion' | head -6
  apiVersion         = 1.4.318
  driverVersion      = 25.2.8
  deviceName         = Intel(R) Iris(R) Xe Graphics (RPL-P)

Minimal reproducer — SD 1.5

Identical config, single flag difference.

Broken (no --diffusion-fa):

./sd-cli \
    --model v1-5-pruned-emaonly-fp16.safetensors \
    --mode img_gen \
    --prompt "A cinematic shot of a lighthouse at dusk, warm amber light, ocean waves" \
    --negative-prompt "blurry, low quality, deformed" \
    --height 512 --width 512 --steps 20 \
    --cfg-scale 7.0 --seed 42 \
    --sampling-method euler_a \
    --output broken.png

Output: structured horizontal stripes in teal/pink, no prompt content. Reproducible across seeds, steps, resolutions, samplers.

Working (add --diffusion-fa):

./sd-cli \
    --model v1-5-pruned-emaonly-fp16.safetensors \
    --mode img_gen \
    --prompt "A cinematic shot of a lighthouse at dusk, warm amber light, ocean waves" \
    --negative-prompt "blurry, low quality, deformed" \
    --height 512 --width 512 --steps 20 \
    --cfg-scale 7.0 --seed 42 \
    --sampling-method euler_a \
    --diffusion-fa \
    --output working.png

Output: cinematic lighthouse at dusk, prompt-accurate, clean composition. 321 s total on this hardware.

Same model: Comfy-Org/stable-diffusion-v1-5-archive/v1-5-pruned-emaonly-fp16.safetensors (2.0 GB fp16).

Also reproduced with Wan 2.2 TI2V-5B Turbo

Using Kijai/WanVideo_comfy/Wan22-Turbo/Wan2_2-TI2V-5B-Turbo_fp16.safetensors + QuantStack/Wan2.2-TI2V-5B-GGUF/VAE/Wan2.2_VAE.safetensors + city96/umt5-xxl-encoder-gguf/umt5-xxl-encoder-Q5_K_M.gguf:

  • Without --diffusion-fa: noise
  • With --diffusion-fa AND --flow-shift 8.0: coherent output (smallest tested = 320×256 × 5 frames, --scheduler simple, --cfg-scale 1.0)

(--flow-shift 8.0 appears to be a separate Wan-2.2-specific fix — the auto default seems to mis-detect for TI2V-5B. Not the focus of this bug report, but noting in case it's a related upstream concern.)

Observations

  • Neither loader path (native gguf_init_from_file_ptr, GGUFReader fallback, safetensors) affects the outcome — the bug is downstream of loading.
  • Sampler (euler_a, euler) and scheduler (discrete, simple) don't affect the outcome.
  • Model version (SD 1.5, Wan 2.2) doesn't affect the outcome.
  • The single determining factor is --diffusion-fa. Without it: noise. With it: clean output.
  • The noise pattern is deterministic for a given seed and visually distinctive (horizontal teal/pink stripes for SD / base Wan, horizontal color bands for Turbo Wan).

Suggested fix directions

Not sure whether the bug is in ggml's Vulkan attention kernel, in the Vulkan shader codegen, or in sd.cpp's pre-attention tensor layout for the non-FA path. Starting points for investigation:

  1. Compare the non-FA vs FA attention kernel on Vulkan against the CUDA reference — is there a known numerical discrepancy?
  2. Is there an assumption about tensor layout / stride that holds on NVIDIA but breaks on Mesa anv?
  3. If the non-FA Vulkan attention path is deprecated or known-broken, it might be worth making --diffusion-fa the Vulkan default (or warning loudly when attention is invoked without it).

Happy to run more targeted diagnostics on this hardware if it'd help narrow it down.

Workaround

Until fixed, on Intel Iris Xe Vulkan: always pass --diffusion-fa. It's the only known-working configuration for correct attention output on this backend.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions