-
Notifications
You must be signed in to change notification settings - Fork 21
Open
Description
Hi,
When running e2e.py on an A100 (40GB) GPU, I encountered the following error:
v_cache_cpu memory size: 180.0 GB
num_hidden_layers: 32, batch_size: 48, num_key_value_heads: 8, max_length: 61440, chunk_size: 8, hidden_size: 4096, num_attention_heads: 32
Traceback (most recent call last):
File "/home/wsgwak/ShadowKV/test/e2e.py", line 162, in <module>
llm = LLM(model_name=model_name, device='cuda:0', batch_size=shadowkv_bsz, max_length=min_prompt_len, attn_mode='shadowkv_cpu', sparse_budget=sparse_budget)
File "/home/wsgwak/ShadowKV/models/llama.py", line 122, in __init__
self.init_kv_cache(sparse_budget, rank, chunk_size, self.config)
File "/home/wsgwak/ShadowKV/models/base.py", line 43, in init_kv_cache
self.kv_cache = ShadowKVCache_CPU(config, max_length=self.max_length, device=self.device, dtype=self.dtype, batch_size=self.batch_size, sparse_budget=sparse_budget, rank=rank, chunk_size=chunk_size)
File "/home/wsgwak/ShadowKV/models/kv_cache.py", line 403, in __init__
self.v_cache_cpu = torch.zeros(
RuntimeError: CUDA error: out of memory
From my log, the v_cache_cpu memory requirement is much larger than the available GPU memory. Since the maximum pinned CPU memory is limited to the GPU memory size (40GB), it’s not possible to allocate such a large CPU tensor.
Could you please clarify how you tested the e2e throughput benchmark? I’d like some guidance on reproducing your results.
Thanks
Metadata
Metadata
Assignees
Labels
No labels