[Question] What are the current KV cache quantization strategies available?

Hello, I would like to ask what are the current KV cache quantization strategies available? For example, "tensor", "channel", "group", "block", "token", "tensor_group"?

When I set strategy to "tensor", it can run successfully. When set to "group" or "channel", it fails.
- Description in QuantizedKVParameterCache: Quantization strategy (tensor, group, channel) set from Quantization arg's strategy. link: https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/modifiers/quantization/cache.py#L14
- For example: channel
    - code:
``` python
kv_cache_dict = {'num_bits': 8, 'type': 'float', 'symmetric': True, 'strategy': 'channel', 'dynamic': False}
recipe = [
    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"], kv_cache_scheme=kv_cache_dict)
]
```
-- issue: 


``` text
  File "/home/user/workspace/pythonlab/llmcompressor/pipelines/sequential/helpers.py", line 72, in forward
    outputs = forward_fn(*args, **kwargs)

  ...

  File "/home/user/workspace/pythonlab/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1857, in _call_impl
    return inner()
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1818, in inner
    hook_result = hook(self, args, result)
  File "/home/user/workspace/pythonlab/llmcompressor/modifiers/utils/hooks.py", line 93, in wrapped_hook
    return hook(*args, **kwargs)
  File "/home/user/workspace/pythonlab/llmcompressor/modifiers/quantization/calibration.py", line 249, in calibrate_kv_cache_output_hook
    update_parameter_data(module, k_scale, KVCacheScaleType.KEY.value)
  File "/home/user/workspace/pythonlab/compressed_tensors/utils/offload.py", line 166, in update_parameter_data
    update_offload_parameter(module, param_name, new_param_data)
  File "/home/user/workspace/pythonlab/compressed_tensors/utils/offload.py", line 257, in update_offload_parameter
    param.data.copy_(data)
RuntimeError: output with shape [1] doesn't match the broadcast shape [1, 1, 1, 1]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] What are the current KV cache quantization strategies available? #1779

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] What are the current KV cache quantization strategies available? #1779

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions