wgpu/Metal: naive_attention panics with shared memory overflow on Apple M3 Max

## Describe the bug

Running a Depth-Anything-v2 (DINOv2 ViT-S/14) model with the wgpu backend on macOS/Metal panics in `naive_attention` -> `matmul` with:

```
called `Result::unwrap()` on an `Err` value: Unable to launch matmul with err:
Too many resources were requested during launch
Too much shared memory requested.
Requested 40960 bytes, maximum 32768 bytes available.
```

The model runs correctly on the ndarray backend. The panic comes from `cubek_matmul::launch::strategy::Strategy::launch_ref` at `cubek-matmul/src/launch/strategy.rs:454:22`.

This is likely related to #4492 (Q4S quantized matmul hits the same 32KB Metal shared memory limit), but this reproduces with standard f32 attention, not quantized matmul.

## To Reproduce

Using burn-onnx's model-check for Depth-Anything-v2:

```sh
cd crates/model-checks/depth-anything-v2
uv run get_model.py                                    # download model
cargo run --release --no-default-features --features wgpu  # panics
```

The model uses DINOv2 ViT-S/14 with 6 attention heads, 64-dim per head, and 1370 tokens (14x14 patches + CLS + registers). The attention call is `burn::tensor::module::attention(q, k, v, mask, scale)`.

Repository: https://github.com/tracel-ai/burn-onnx (branch `add-depth-anything-v2-check`, `crates/model-checks/depth-anything-v2/`)

## Expected behavior

The attention operation completes without panicking. The Metal GPU has 32KB shared memory; the matmul strategy should fall back to a kernel configuration that fits within this limit.

## Backtrace

```
thread 'main' panicked at cubek-matmul/src/launch/strategy.rs:454:22:
called `Result::unwrap()` on an `Err` value: Unable to launch matmul with err: Too many resources were requested during launch
Too much shared memory requested.
Requested 40960 bytes, maximum 32768 bytes available.

stack backtrace:
   ...
   cubek_matmul::launch::strategy::Strategy::launch_ref
   burn_cubecl::kernel::matmul::base::matmul
   burn_cubecl::kernel::attention::naive_attention
   burn_cubecl::kernel::attention::attention
   <burn_cubecl::CubeBackend<R,F,I,BT> as burn_tensor::ops::ModuleOps<...>>::attention
   burn_tensor::tensor::module::attention
   depth_anything_v2::Model::forward
   depth_anything_v2_check::main
```

## Environment

- OS: macOS 26.2 (Tahoe)
- GPU: Apple M3 Max (40 cores, Metal 4)
- Burn rev: `8bfa8f753bcccbbe59c489c9885a19ab4dd0c8f0`
- Backend: wgpu

## Additional context

The attention dimensions are: Q/K/V shape `[1, 6, 1370, 64]` (batch=1, heads=6, seq_len=1370, head_dim=64). The 40960 bytes requested (40KB) exceeds Metal's 32KB shared memory limit by 25%. A fallback to a smaller tile size or the naive path should handle this gracefully rather than panicking via `unwrap()`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wgpu/Metal: naive_attention panics with shared memory overflow on Apple M3 Max #4530

Describe the bug

To Reproduce

Expected behavior

Backtrace

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

wgpu/Metal: naive_attention panics with shared memory overflow on Apple M3 Max #4530

Description

Describe the bug

To Reproduce

Expected behavior

Backtrace

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions