Skip to content

Commit dbc4bc5

Browse files
DonaghBrdsikka
andauthored
[DOCS] INFERENG-1093 memory requirements for LLM Compressor (#1720)
SUMMARY: Adding to the Quantization Method Table in Getting Started > Compress your model to create a list of algorithm-specific memory requirements for a handful of given models INFERENG-1093 TEST PLAN: Covered by CI tests --------- Signed-off-by: Donagh Brennan <[email protected]> Co-authored-by: Dipika Sikka <[email protected]>
1 parent 29f4d56 commit dbc4bc5

File tree

1 file changed

+55
-1
lines changed

1 file changed

+55
-1
lines changed

docs/getting-started/compress.md

Lines changed: 55 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,4 +59,58 @@ oneshot(
5959
)
6060
```
6161

62-
When you run the above code, the compressed model is saved to the specified output directory: `TinyLlama-1.1B-Chat-v1.0-INT8`. You can then load this model using the Hugging Face Transformers library or vLLM for inference and testing.
62+
When you run the above code, the compressed model is saved to the specified output directory: `TinyLlama-1.1B-Chat-v1.0-INT8`. You can then load this model using the Hugging Face Transformers library or vLLM for inference and testing.
63+
64+
## Memory requirements for LLM Compressor
65+
66+
When compressing a model you should be aware that the memory requirements are dependent on model size and the algorithm used, such as GPTQ/SparseGPT.
67+
68+
This section will go through how to calculate the CPU and GPU memory requirements for each algorithm using several popular models, an 8B, a 684B, and a model with vision capabilities, as examples.
69+
70+
The GPTQ/SparseGPT requires a large amount of auxiliary memory. GPTQ/SparseGPT allocates an auxiliary hessian matrix for any layers that are onloaded to the GPU. This is because the hessian matrices have to be almost as large as the weights they are trying to represent.
71+
72+
Also, larger models, like DeepSeek R1 use a large amount of CPU memory, and models with large vision towers, such as command A, may use large amounts of GPU memory.
73+
74+
### Things to note when calculating memory requirements for LLM Compressor:
75+
76+
1. A 1B model uses 2Gb of memory to load:
77+
```
78+
mem(1B parameters) ~= (1B parameters) * (2 bytes / parameter) = 2B bytes ~= 2Gb
79+
```
80+
81+
2. How text decoder layers and vision tower layers are loaded on to GPU differs significantly.
82+
83+
In the case of text decoder layers, LLM Compressor dynamically loads one layer at a time into the GPU for computation. The rest of the model remains in CPU memory.
84+
85+
However, vision tower layers are loaded onto GPU all at once. Unlike the text model, vision towers are not split up into individual layers before onloading to the GPU. This can create a GPU memory bottleneck for models whose vision towers are larger than their text layers.
86+
87+
At this time LLM Compressor does not quantise the vision tower as quantization is generally not worth the tradeoff between latency/throughput and accuracy loss.
88+
89+
3. LLM Compressor does not currently support tensor parallelism for compression. Supporting this feature will allow layers to be sharded across GPUs, leading to reduced memory usage per GPU and faster compression.
90+
91+
### QuantizationModifier or Round-To-Nearest (RTN)
92+
93+
The quantization modifier, RTN, does not require any additional memory beyond the storage needed for its quantization parameters (scales/zeros).
94+
95+
If we ignore these scales and zero points from our calculation, we can estimate the following memory requirements:
96+
97+
98+
| Model| CPU requirements | GPU requirements |
99+
|--------|-------------|----------------------------|
100+
| **Meta-Llama-3-8B-Instruct** | mem(8B params) ~= 16Gb | mem(1 Layer) ~= 0.5Gb |
101+
| **DeepSeek-R1-0528-BF16** | mem(684B params) ~= 1368Gb | mem(1 Layer) ~= 22.4Gb|
102+
| **Qwen2.5-VL-7B-Instruct** | mem(7B params) ~= 14Gb | max(mem(1 Text Layer)~= 0.4B, mem(Vision tower)~=1.3B) ~= 1.3Gb |
103+
104+
### GPT Quantization(GPTQ)/ Sparse GPT
105+
106+
The GPTQ/ SparseGPT algorithms differ from the RTN in that they must also allocate an auxiliary hessian matrices for any layers that are onloaded to the GPU.
107+
108+
This hessian matrix is used to increase the accuracy recovery of the algorithm, and is approximately the same size as the original weights.
109+
110+
| Model| CPU requirements | GPU requirements |
111+
|--------|-------------|----------------------------|
112+
| **Meta-Llama-3-8B-Instruct** |mem(8B params) ~= 16Gb | mem(1 Layer) * 2 ~= 1Gb |
113+
| **DeepSeek-R1-0528-BF16** | mem(684B params) ~= 1368Gb | mem(1 Layer) * 2 ~= 44.8Gb |
114+
| **Qwen2.5-VL-7B-Instruct** | mem(7B params) ~= 14Gb | max(mem(1 Text Layer)~= 0.4B, mem(Vision tower)~=1.3B)*2 ~= 2.6Gb |
115+
116+

0 commit comments

Comments
 (0)