Skip to content

wgpu/Metal: naive_attention panics with shared memory overflow on Apple M3 Max #4530

@antimora

Description

@antimora

Describe the bug

Running a Depth-Anything-v2 (DINOv2 ViT-S/14) model with the wgpu backend on macOS/Metal panics in naive_attention -> matmul with:

called `Result::unwrap()` on an `Err` value: Unable to launch matmul with err:
Too many resources were requested during launch
Too much shared memory requested.
Requested 40960 bytes, maximum 32768 bytes available.

The model runs correctly on the ndarray backend. The panic comes from cubek_matmul::launch::strategy::Strategy::launch_ref at cubek-matmul/src/launch/strategy.rs:454:22.

This is likely related to #4492 (Q4S quantized matmul hits the same 32KB Metal shared memory limit), but this reproduces with standard f32 attention, not quantized matmul.

To Reproduce

Using burn-onnx's model-check for Depth-Anything-v2:

cd crates/model-checks/depth-anything-v2
uv run get_model.py                                    # download model
cargo run --release --no-default-features --features wgpu  # panics

The model uses DINOv2 ViT-S/14 with 6 attention heads, 64-dim per head, and 1370 tokens (14x14 patches + CLS + registers). The attention call is burn::tensor::module::attention(q, k, v, mask, scale).

Repository: https://github.com/tracel-ai/burn-onnx (branch add-depth-anything-v2-check, crates/model-checks/depth-anything-v2/)

Expected behavior

The attention operation completes without panicking. The Metal GPU has 32KB shared memory; the matmul strategy should fall back to a kernel configuration that fits within this limit.

Backtrace

thread 'main' panicked at cubek-matmul/src/launch/strategy.rs:454:22:
called `Result::unwrap()` on an `Err` value: Unable to launch matmul with err: Too many resources were requested during launch
Too much shared memory requested.
Requested 40960 bytes, maximum 32768 bytes available.

stack backtrace:
   ...
   cubek_matmul::launch::strategy::Strategy::launch_ref
   burn_cubecl::kernel::matmul::base::matmul
   burn_cubecl::kernel::attention::naive_attention
   burn_cubecl::kernel::attention::attention
   <burn_cubecl::CubeBackend<R,F,I,BT> as burn_tensor::ops::ModuleOps<...>>::attention
   burn_tensor::tensor::module::attention
   depth_anything_v2::Model::forward
   depth_anything_v2_check::main

Environment

  • OS: macOS 26.2 (Tahoe)
  • GPU: Apple M3 Max (40 cores, Metal 4)
  • Burn rev: 8bfa8f753bcccbbe59c489c9885a19ab4dd0c8f0
  • Backend: wgpu

Additional context

The attention dimensions are: Q/K/V shape [1, 6, 1370, 64] (batch=1, heads=6, seq_len=1370, head_dim=64). The 40960 bytes requested (40KB) exceeds Metal's 32KB shared memory limit by 25%. A fallback to a smaller tile size or the naive path should handle this gracefully rather than panicking via unwrap().

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions