Skip to content

GH-174: LM head Q6K GEMV takes 35% of decode time — multi-row blocking needed #174

@noahgift

Description

@noahgift

Problem

nsys profiling on Jetson Orin (Mar 6, GH-173 optimization) shows dp4a_q6k_gemv takes 28% of total GPU time and 35% of decode time, dominated by the LM head (n=151936, k=1536).

nsys data (SKIP_CUDA_GRAPH=1)

Kernel Time % Instances Avg µs Med µs
mwv_dp4a_q4k_gemv 47% 11,424 179 69
dp4a_q6k_gemv 28% 2,280 544 26
batched_q4k_gemv 13% 504 1,125 393

The Q6K GEMV is bimodal: per-layer calls are 26µs (tiny), but LM head calls are ~17ms (huge). Median 26µs vs avg 544µs confirms the LM head dominates.

Root cause

The LM head launches 151,936 thread blocks (1 per output row) with only 6 super-blocks (k=1536) per row. Each block does minimal work:

  • 3 warps × 2 super-block iterations = trivial compute
  • Block launch overhead dominates
  • Theoretical BW-limited minimum: 0.94ms vs actual 17ms (18x gap)

Proposed fix

Multi-row blocking: process 4-8 rows per thread block for the LM head case. This reduces block count from 151,936 to ~19,000-38,000, improving SM utilization and amortizing launch overhead.

Per-token decode budget (current)

Component Time (ms) %
Q4K GEMVs 30.0 60%
Q6K LM head 17.4 35%
Q6K layers 0.7 1.5%
Flash attention 1.8 3.6%
Total ~50

Current: 19.8 tok/s. Target: 32.3 tok/s (llama.cpp parity).

Impact

Halving LM head time would save ~8.7ms → 41.3ms/token → 24.2 tok/s (+22%).

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions