Efficient Memory Management for Large Language Model #27280

anusonawane · 2024-10-11T18:26:42Z

anusonawane
Oct 11, 2024

The main consumers of GPU memory during LLM inference are model parameters (weights), key-value cache memory, activations, and temporary buffers along with overheads.

Model Parameters (Weights):
The memory required to store model weights depends on the number of parameters and the precision format (FP16 uses 2 bytes per parameter).
KV Cache Memory:
The KV cache stores key and value vectors for each token during text generation. Memory usage depends on the number of layers, hidden size, and token count.
Activations and Buffers:
Activations are temporary outputs, typically consuming 5-10% of total GPU memory. For a 40 GB GPU, activations use 2-4 GB
Memory Overheads (Fragmentation):
Fragmentation causes inefficiencies in memory usage.

If 20% of a 40 GB GPU is lost to fragmentation, 8 GB is wasted, leaving only 32 GB for computations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Efficient Memory Management for Large Language Model #27280

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Efficient Memory Management for Large Language Model #27280

Uh oh!

anusonawane Oct 11, 2024

Replies: 0 comments

anusonawane
Oct 11, 2024