Skip to content

Backlog list #2

@vvchernov

Description

@vvchernov

There is list of possible tasks related to development of LLM compression. It will be extended from time to time.

  1. Study SOTA approaches and modern papers (SmoothQuant, AWQ, GPTQ and so on)
  2. Theoretical analysis of different cases of data distribution in activations and weights. Base parameters: context and outliers dispersion, distance between them, size of matrix, number of outliers. Some general consideration: number of outliers is less than context values; context dispersion can be of order of the distance or less; outliers dispersion is much less than the distance.
  • As the first step it can be assumed the following: matrices are square, weight data is distributed around zero and there are no outliers, number of outliers is much less than all values, but can be the order of the matrix size.
  1. Practical collection of data during calibration of a model (e.g. llama2)
  • Comparison with theoretical cases
  • Fake-quantization: define why does quantization reduce accuracy
  • Define error bias (we can compensate it after)
  • Use SmoothQuant alpha per-op not per-model. Optimization idea: minimize matmul error (need study metrics for matrix)
  1. Develop existing compression (quantization) algorithms:
  • SmoothQuant:
    • Alpha parameter optimization
    • Smooth remaining linear layers
    • Separated query, key, value smoothing
    • Vice versa for previous: Dense operation fusing
    • Update padding procedure
    • Test with different batch size to find optimal batch size for specified model and target.
    • Transposition number reducing
      • flash attention
      • faster transformer
      • and so on
    • Other kernel utilization
    • Rounding type optimization
    • Other quantization type utilization
  • Support kernels for T4
  1. Develop an algorithm that automatically quantizes the given model depending on its features defined during calibration
  • Use one quantization mechanism for full model (Ivan has done it for paper SmoothQuant approach)
  • Per-operation quantization
  • Not quantize weak ops
  1. Tests:
  • Check accuracy of customized mlc-llm quantization depending on group size (A. Peskov)
  • Using smoothing from SmoothQuant and q?f16_? from mlc-llm and check accuracy (improvement is expected) (I. Sidorenko)
  • Compensate error bias of quantized matmul
  • Analysis of accuracy for SmoothQuant alpha in interval (eps, 1 - eps)
  • Use SmoothQuant alpha per-op not per-model

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions