-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Problem
nsys profiling on Jetson Orin (Mar 6, GH-173 optimization) shows dp4a_q6k_gemv takes 28% of total GPU time and 35% of decode time, dominated by the LM head (n=151936, k=1536).
nsys data (SKIP_CUDA_GRAPH=1)
| Kernel | Time % | Instances | Avg µs | Med µs |
|---|---|---|---|---|
| mwv_dp4a_q4k_gemv | 47% | 11,424 | 179 | 69 |
| dp4a_q6k_gemv | 28% | 2,280 | 544 | 26 |
| batched_q4k_gemv | 13% | 504 | 1,125 | 393 |
The Q6K GEMV is bimodal: per-layer calls are 26µs (tiny), but LM head calls are ~17ms (huge). Median 26µs vs avg 544µs confirms the LM head dominates.
Root cause
The LM head launches 151,936 thread blocks (1 per output row) with only 6 super-blocks (k=1536) per row. Each block does minimal work:
- 3 warps × 2 super-block iterations = trivial compute
- Block launch overhead dominates
- Theoretical BW-limited minimum: 0.94ms vs actual 17ms (18x gap)
Proposed fix
Multi-row blocking: process 4-8 rows per thread block for the LM head case. This reduces block count from 151,936 to ~19,000-38,000, improving SM utilization and amortizing launch overhead.
Per-token decode budget (current)
| Component | Time (ms) | % |
|---|---|---|
| Q4K GEMVs | 30.0 | 60% |
| Q6K LM head | 17.4 | 35% |
| Q6K layers | 0.7 | 1.5% |
| Flash attention | 1.8 | 3.6% |
| Total | ~50 |
Current: 19.8 tok/s. Target: 32.3 tok/s (llama.cpp parity).
Impact
Halving LM head time would save ~8.7ms → 41.3ms/token → 24.2 tok/s (+22%).
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com