Efficient Memory Management for Large Language Model #27280
anusonawane
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The main consumers of GPU memory during LLM inference are model parameters (weights), key-value cache memory, activations, and temporary buffers along with overheads.
Model Parameters (Weights):
The memory required to store model weights depends on the number of parameters and the precision format (FP16 uses 2 bytes per parameter).
KV Cache Memory:
The KV cache stores key and value vectors for each token during text generation. Memory usage depends on the number of layers, hidden size, and token count.
Activations and Buffers:
Activations are temporary outputs, typically consuming 5-10% of total GPU memory. For a 40 GB GPU, activations use 2-4 GB
Memory Overheads (Fragmentation):
Fragmentation causes inefficiencies in memory usage.
If 20% of a 40 GB GPU is lost to fragmentation, 8 GB is wasted, leaving only 32 GB for computations.
Beta Was this translation helpful? Give feedback.
All reactions