Fix kv cache issue (vllm-project#1797)

shanjiaz · web-flow · commit 5b3ddff74cae · 2025-09-04T07:22:47.000-07:00
SUMMARY:
With the newest transformer change,
`test_kv_cache_gptq_model_state_dict_attr` is failing because it's
initializing empty weights on meta device and attempting to decompress
on meta device. I don't think this is the expected usage. When
model_decompress is called, the weights should finish being loaded
already.


TEST PLAN:
tested locally with the following command and passed:
pytest test
tests/llmcompressor/transformers/kv_cache/test_kv_cache.py::test_kv_cache_gptq_model_state_dict_attr

---------

Signed-off-by: shanjiaz &lt;zsjwpianpian@gmail.com&gt;
diff --git a/tests/llmcompressor/transformers/kv_cache/test_kv_cache.py b/tests/llmcompressor/transformers/kv_cache/test_kv_cache.py
@@ -231,14 +231,10 @@ def test_kv_cache_gptq_model_state_dict_attr(kv_cache_fixture, tmp_path):
 
     output_dir, _ = next(kv_cache_fixture(recipe, tmp_path))
 
-    with init_empty_weights():
-        # TODO: There is a bug in `apply_quantization_config` which means that, if using
-        # CompressedLinears, the compression status is inferred to `compressed` and
-        # therefore the attention kvcache parameters never undergo initializations
-        model = AutoModelForCausalLM.from_pretrained(
-            output_dir,
-            quantization_config=CompressedTensorsConfig(run_compressed=False),
-        )
+    model = AutoModelForCausalLM.from_pretrained(
+        output_dir,
+        quantization_config=CompressedTensorsConfig(run_compressed=False),
+    )
 
     counts = 0
     for name, submodule in model.named_modules():