GH-174: LM head Q6K GEMV takes 35% of decode time — multi-row blocking needed

## Problem

nsys profiling on Jetson Orin (Mar 6, GH-173 optimization) shows `dp4a_q6k_gemv` takes **28% of total GPU time** and **35% of decode time**, dominated by the LM head (n=151936, k=1536).

### nsys data (SKIP_CUDA_GRAPH=1)

| Kernel | Time % | Instances | Avg µs | Med µs |
|--------|--------|-----------|--------|--------|
| mwv_dp4a_q4k_gemv | 47% | 11,424 | 179 | 69 |
| dp4a_q6k_gemv | 28% | 2,280 | 544 | 26 |
| batched_q4k_gemv | 13% | 504 | 1,125 | 393 |

The Q6K GEMV is bimodal: per-layer calls are 26µs (tiny), but LM head calls are ~17ms (huge). Median 26µs vs avg 544µs confirms the LM head dominates.

### Root cause

The LM head launches **151,936 thread blocks** (1 per output row) with only 6 super-blocks (k=1536) per row. Each block does minimal work:
- 3 warps × 2 super-block iterations = trivial compute
- Block launch overhead dominates
- Theoretical BW-limited minimum: 0.94ms vs actual 17ms (18x gap)

### Proposed fix

Multi-row blocking: process 4-8 rows per thread block for the LM head case. This reduces block count from 151,936 to ~19,000-38,000, improving SM utilization and amortizing launch overhead.

### Per-token decode budget (current)

| Component | Time (ms) | % |
|-----------|-----------|---|
| Q4K GEMVs | 30.0 | 60% |
| Q6K LM head | 17.4 | 35% |
| Q6K layers | 0.7 | 1.5% |
| Flash attention | 1.8 | 3.6% |
| **Total** | **~50** | |

Current: 19.8 tok/s. Target: 32.3 tok/s (llama.cpp parity).

### Impact

Halving LM head time would save ~8.7ms → 41.3ms/token → 24.2 tok/s (+22%).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-174: LM head Q6K GEMV takes 35% of decode time — multi-row blocking needed #174

Problem

nsys data (SKIP_CUDA_GRAPH=1)

Root cause

Proposed fix

Per-token decode budget (current)

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kernel	Time %	Instances	Avg µs	Med µs
mwv_dp4a_q4k_gemv	47%	11,424	179	69
dp4a_q6k_gemv	28%	2,280	544	26
batched_q4k_gemv	13%	504	1,125	393

Component	Time (ms)	%
Q4K GEMVs	30.0	60%
Q6K LM head	17.4	35%
Q6K layers	0.7	1.5%
Flash attention	1.8	3.6%
Total	~50

GH-174: LM head Q6K GEMV takes 35% of decode time — multi-row blocking needed #174

Description

Problem

nsys data (SKIP_CUDA_GRAPH=1)

Root cause

Proposed fix

Per-token decode budget (current)

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions