Skip to content

[Question] What are the current KV cache quantization strategies available? #1779

@liye0626

Description

@liye0626

Hello, I would like to ask what are the current KV cache quantization strategies available? For example, "tensor", "channel", "group", "block", "token", "tensor_group"?

When I set strategy to "tensor", it can run successfully. When set to "group" or "channel", it fails.

kv_cache_dict = {'num_bits': 8, 'type': 'float', 'symmetric': True, 'strategy': 'channel', 'dynamic': False}
recipe = [
    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"], kv_cache_scheme=kv_cache_dict)
]

-- issue:

  File "/home/user/workspace/pythonlab/llmcompressor/pipelines/sequential/helpers.py", line 72, in forward
    outputs = forward_fn(*args, **kwargs)

  ...

  File "/home/user/workspace/pythonlab/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1857, in _call_impl
    return inner()
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1818, in inner
    hook_result = hook(self, args, result)
  File "/home/user/workspace/pythonlab/llmcompressor/modifiers/utils/hooks.py", line 93, in wrapped_hook
    return hook(*args, **kwargs)
  File "/home/user/workspace/pythonlab/llmcompressor/modifiers/quantization/calibration.py", line 249, in calibrate_kv_cache_output_hook
    update_parameter_data(module, k_scale, KVCacheScaleType.KEY.value)
  File "/home/user/workspace/pythonlab/compressed_tensors/utils/offload.py", line 166, in update_parameter_data
    update_offload_parameter(module, param_name, new_param_data)
  File "/home/user/workspace/pythonlab/compressed_tensors/utils/offload.py", line 257, in update_offload_parameter
    param.data.copy_(data)
RuntimeError: output with shape [1] doesn't match the broadcast shape [1, 1, 1, 1]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions