You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### Description
Support 8 bits in MatMulNBits cuda kernel.
The `MatMulFloat8bKernel` CUDA kernel performs a matrix-vector
multiplication (GEMM) where the matrix B is quantized per block using
8-bit integers.
The kernel computes $Output = A \times B$, where:
* $A$ is a row vector (shape `[M, K]`) of type `T` (`float` or `half`).
* $B$ is a matrix (shape `[K, N]`) quantized using 8-bit unsigned
integers (`uint8_t`) with a block structure. It's stored as `[N,
K/block_size, block_size]`.
* `scales_data` contains the dequantization scales (shape `[N,
K/block_size]`).
* `zero_points` contains the dequantization zero points (shape `[N,
K/block_size]`), if used (`has_zero_point` is true).
* `output` is the resulting row vector (shape `[M, N]`).
The kernel uses a thread block structure of `(kWarpSize,
kColsPerThreadBlock)`, meaning each block handles `kColsPerThreadBlock`
(which is 8) columns of the output. Each warp within the block is
responsible for one output element (`[m_id, n_id]`). Threads within a
warp cooperate to compute the dot product along the K dimension. Each
thread (`lane_id`) handles `kElementsPerThreadPerIteration` (which is 8)
elements of the K dimension in each step.
Here's a breakdown of the three algorithms (`kKernelAlgo`):
1. **`kKernelAlgo = 0` (Unrolling):**
* **Strategy:** This algorithm processes the K dimension by iterating in
large steps (`k_per_iter = kWarpSize * kElementsPerThreadPerIteration =
32 * 8 = 256`). Inside the main loop, it uses a macro
(`UnRollReduction`) with `#pragma unroll` directives to aggressively
unroll the innermost computations. It tries unrolling factors of 16, 4,
and 1 sequentially to cover as much of the K dimension as possible with
unrolled code.
* **Pros:** Can significantly reduce loop overhead (branching
instructions, counter updates) and expose more instruction-level
parallelism, potentially hiding memory latency.
* **Cons:** Can lead to a large increase in compiled code size (register
pressure, potential instruction cache misses). The effectiveness heavily
depends on the compiler and the specific GPU architecture. The
multi-stage unrolling adds complexity. It requires `k_per_iter` to be a
multiple of `block_size` for correct scale/zp indexing within the
unrolled loop.
* **Performance Expectation:** Potentially the highest performance *if*
the unrolling is effective on the target hardware and doesn't cause
resource issues (registers, cache). Often good for compute-bound or
latency-bound scenarios where loop overhead is a bottleneck.
2. **`kKernelAlgo = 1` (Simple Loop):**
* **Strategy:** This algorithm also iterates along the K dimension in
steps of `k_per_iter` (256), but uses a simple `for` loop without
explicit `#pragma unroll`. It relies on the compiler's default loop
optimization capabilities.
* **Pros:** Simpler code, smaller code size compared to Algorithm 0.
Less likely to cause register pressure or instruction cache issues.
Easier for the compiler to reason about.
* **Cons:** May incur higher loop overhead compared to effective
unrolling. Performance might be lower if loop overhead is significant.
* **Performance Expectation:** A solid baseline. Might be close to
Algorithm 0 if the compiler performs implicit unrolling effectively, or
faster if Algorithm 0 suffers from code bloat penalties.
3. **`kKernelAlgo = 2` (Block Size Iteration):**
* **Strategy:** This algorithm changes the iteration strategy
fundamentally. Instead of iterating in fixed steps of `k_per_iter`, it
iterates based on the quantization `block_size`. The outer loop runs
`blocks_per_K` (`K / block_size`) times. Inside this loop, the scale and
zero point for the *entire block* are fetched once per warp. Then, each
thread checks if its assigned K-elements (`lane_offset`) fall within the
current `block_size` chunk and processes them using the fetched
scale/zp.
* **Pros:** Directly aligns with the block quantization data structure.
Fetches scale/zero-point values less frequently (once per `block_size`
chunk per warp), potentially reducing shared memory bank conflicts or
register usage compared to calculating the index (`current_meta_k`) in
every inner step as in Algo 0/1. Might have better memory access
patterns for scale/zp data.
* **Cons:** The outer loop iterates `K / block_size` times. If
`block_size` is small (e.g., 16, 32), this could be many iterations. The
logic inside the loop (`if (current_k_base < k_end_block ...)`) adds
conditional execution.
* **Performance Expectation:** Performance depends heavily on the
`block_size`. If `block_size` is large (e.g., 128, 256), the number of
outer loop iterations is small, and the efficiency gain from fetching
scale/zp once per block might outweigh the overhead. If `block_size` is
small, the overhead of the outer loop might dominate.
**Next Step:**
1. **Profile:** The most reliable way is to benchmark all three
algorithms (`kKernelAlgo = 0, 1, 2`) on your target GPU hardware with
representative input sizes (`N`, `K`), data types (`T`), and
`block_size` values. Use profiling tools like NVIDIA Nsight Compute to
analyze performance metrics (execution time, occupancy, instruction
throughput, memory bandwidth, cache hit rates, register spills).
2. **Hypothesize based on `block_size`:**
* For **large `block_size`** (e.g., 128, 256), Algorithm 2 might be
competitive or even the best due to efficient scale/ZP handling.
Algorithm 0 could also be very fast.
* For **small `block_size`** (e.g., 16, 32), Algorithm 0 (unroll) or
Algorithm 1 (simple loop) might outperform Algorithm 2 due to lower loop
overhead in the K dimension.
3. Compare performance with TRT LLM FpA IntB GEMM.
### Motivation and Context
4 bits has accuracy loss for some LLM, need more bits for some layers.
0 commit comments