-
Notifications
You must be signed in to change notification settings - Fork 30k
Closed
Labels
Description
System Info
transformers
version: 4.56.0.dev0- Platform: Linux-6.14.0-27-generic-x86_64-with-glibc2.39
- Python version: 3.12.3
- Huggingface_hub version: 0.34.3
- Safetensors version: 0.6.1
- Accelerate version: 1.9.0
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.7.1+cu126 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
- Using GPU in script?: Yes
- GPU type: NVIDIA RTX A6000
Who can help?
@manueldeprada @SunMarc @MekkCyber
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I'm using the ame script as in the QuantoQuantizedCache
docstring
from transformers import AutoTokenizer, AutoModelForCausalLM, QuantoQuantizedCache
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
inputs = tokenizer(text="My name is Qwen2", return_tensors="pt")
past_key_values = QuantoQuantizedCache(config=model.config, nbits=4)
outputs = model(**inputs, past_key_values=past_key_values, use_cache=True)
outputs.past_key_values
which yields to
Traceback (most recent call last):
File "/home/mjeblick/.config/JetBrains/PyCharm2025.2/scratches/scratch_41.py", line 8, in <module>
past_key_values = QuantoQuantizedCache(config=model.config, nbits=4)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/2tb/PyCharmProjects/kvpress/.venv/lib/python3.12/site-packages/transformers/cache_utils.py", line 1305, in __init__
super().__init__("quanto", config, nbits, axis_key, axis_value, q_group_size, residual_length)
File "/mnt/2tb/PyCharmProjects/kvpress/.venv/lib/python3.12/site-packages/transformers/cache_utils.py", line 1257, in __init__
layer_class(nbits, axis_key, axis_value, q_group_size, residual_length)
File "/mnt/2tb/PyCharmProjects/kvpress/.venv/lib/python3.12/site-packages/transformers/cache_utils.py", line 587, in __init__
super().__init__(
File "/mnt/2tb/PyCharmProjects/kvpress/.venv/lib/python3.12/site-packages/transformers/cache_utils.py", line 515, in __init__
super().__init__(self)
TypeError: CacheLayerMixin.__init__() takes 1 positional argument but 2 were given
(I'm using optimum_quanto-0.2.7, I don't think, however, the version matters here, as it seems to be an inheritance issue).
Expected behavior
QuantoQuantizedCache
works as expected.