-
Notifications
You must be signed in to change notification settings - Fork 219
Open
Description
Hello, I would like to ask what are the current KV cache quantization strategies available? For example, "tensor", "channel", "group", "block", "token", "tensor_group"?
When I set strategy to "tensor", it can run successfully. When set to "group" or "channel", it fails.
- Description in QuantizedKVParameterCache: Quantization strategy (tensor, group, channel) set from Quantization arg's strategy. link: https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/modifiers/quantization/cache.py#L14
- For example: channel
- code:
kv_cache_dict = {'num_bits': 8, 'type': 'float', 'symmetric': True, 'strategy': 'channel', 'dynamic': False}
recipe = [
GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"], kv_cache_scheme=kv_cache_dict)
]
-- issue:
File "/home/user/workspace/pythonlab/llmcompressor/pipelines/sequential/helpers.py", line 72, in forward
outputs = forward_fn(*args, **kwargs)
...
File "/home/user/workspace/pythonlab/transformers/models/llama/modeling_llama.py", line 309, in forward
hidden_states, self_attn_weights = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1857, in _call_impl
return inner()
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1818, in inner
hook_result = hook(self, args, result)
File "/home/user/workspace/pythonlab/llmcompressor/modifiers/utils/hooks.py", line 93, in wrapped_hook
return hook(*args, **kwargs)
File "/home/user/workspace/pythonlab/llmcompressor/modifiers/quantization/calibration.py", line 249, in calibrate_kv_cache_output_hook
update_parameter_data(module, k_scale, KVCacheScaleType.KEY.value)
File "/home/user/workspace/pythonlab/compressed_tensors/utils/offload.py", line 166, in update_parameter_data
update_offload_parameter(module, param_name, new_param_data)
File "/home/user/workspace/pythonlab/compressed_tensors/utils/offload.py", line 257, in update_offload_parameter
param.data.copy_(data)
RuntimeError: output with shape [1] doesn't match the broadcast shape [1, 1, 1, 1]
Metadata
Metadata
Assignees
Labels
No labels