-
Notifications
You must be signed in to change notification settings - Fork 830
Description
Describe the bug
Running a Depth-Anything-v2 (DINOv2 ViT-S/14) model with the wgpu backend on macOS/Metal panics in naive_attention -> matmul with:
called `Result::unwrap()` on an `Err` value: Unable to launch matmul with err:
Too many resources were requested during launch
Too much shared memory requested.
Requested 40960 bytes, maximum 32768 bytes available.
The model runs correctly on the ndarray backend. The panic comes from cubek_matmul::launch::strategy::Strategy::launch_ref at cubek-matmul/src/launch/strategy.rs:454:22.
This is likely related to #4492 (Q4S quantized matmul hits the same 32KB Metal shared memory limit), but this reproduces with standard f32 attention, not quantized matmul.
To Reproduce
Using burn-onnx's model-check for Depth-Anything-v2:
cd crates/model-checks/depth-anything-v2
uv run get_model.py # download model
cargo run --release --no-default-features --features wgpu # panicsThe model uses DINOv2 ViT-S/14 with 6 attention heads, 64-dim per head, and 1370 tokens (14x14 patches + CLS + registers). The attention call is burn::tensor::module::attention(q, k, v, mask, scale).
Repository: https://github.com/tracel-ai/burn-onnx (branch add-depth-anything-v2-check, crates/model-checks/depth-anything-v2/)
Expected behavior
The attention operation completes without panicking. The Metal GPU has 32KB shared memory; the matmul strategy should fall back to a kernel configuration that fits within this limit.
Backtrace
thread 'main' panicked at cubek-matmul/src/launch/strategy.rs:454:22:
called `Result::unwrap()` on an `Err` value: Unable to launch matmul with err: Too many resources were requested during launch
Too much shared memory requested.
Requested 40960 bytes, maximum 32768 bytes available.
stack backtrace:
...
cubek_matmul::launch::strategy::Strategy::launch_ref
burn_cubecl::kernel::matmul::base::matmul
burn_cubecl::kernel::attention::naive_attention
burn_cubecl::kernel::attention::attention
<burn_cubecl::CubeBackend<R,F,I,BT> as burn_tensor::ops::ModuleOps<...>>::attention
burn_tensor::tensor::module::attention
depth_anything_v2::Model::forward
depth_anything_v2_check::main
Environment
- OS: macOS 26.2 (Tahoe)
- GPU: Apple M3 Max (40 cores, Metal 4)
- Burn rev:
8bfa8f753bcccbbe59c489c9885a19ab4dd0c8f0 - Backend: wgpu
Additional context
The attention dimensions are: Q/K/V shape [1, 6, 1370, 64] (batch=1, heads=6, seq_len=1370, head_dim=64). The 40960 bytes requested (40KB) exceeds Metal's 32KB shared memory limit by 25%. A fallback to a smaller tile size or the naive path should handle this gracefully rather than panicking via unwrap().