You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[DOCS] INFERENG-1093 memory requirements for LLM Compressor (#1720)
SUMMARY:
Adding to the Quantization Method Table in Getting Started > Compress
your model to create a list of algorithm-specific memory requirements
for a handful of given models
INFERENG-1093
TEST PLAN:
Covered by CI tests
---------
Signed-off-by: Donagh Brennan <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
Copy file name to clipboardExpand all lines: docs/getting-started/compress.md
+55-1Lines changed: 55 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -59,4 +59,58 @@ oneshot(
59
59
)
60
60
```
61
61
62
-
When you run the above code, the compressed model is saved to the specified output directory: `TinyLlama-1.1B-Chat-v1.0-INT8`. You can then load this model using the Hugging Face Transformers library or vLLM for inference and testing.
62
+
When you run the above code, the compressed model is saved to the specified output directory: `TinyLlama-1.1B-Chat-v1.0-INT8`. You can then load this model using the Hugging Face Transformers library or vLLM for inference and testing.
63
+
64
+
## Memory requirements for LLM Compressor
65
+
66
+
When compressing a model you should be aware that the memory requirements are dependent on model size and the algorithm used, such as GPTQ/SparseGPT.
67
+
68
+
This section will go through how to calculate the CPU and GPU memory requirements for each algorithm using several popular models, an 8B, a 684B, and a model with vision capabilities, as examples.
69
+
70
+
The GPTQ/SparseGPT requires a large amount of auxiliary memory. GPTQ/SparseGPT allocates an auxiliary hessian matrix for any layers that are onloaded to the GPU. This is because the hessian matrices have to be almost as large as the weights they are trying to represent.
71
+
72
+
Also, larger models, like DeepSeek R1 use a large amount of CPU memory, and models with large vision towers, such as command A, may use large amounts of GPU memory.
73
+
74
+
### Things to note when calculating memory requirements for LLM Compressor:
2. How text decoder layers and vision tower layers are loaded on to GPU differs significantly.
82
+
83
+
In the case of text decoder layers, LLM Compressor dynamically loads one layer at a time into the GPU for computation. The rest of the model remains in CPU memory.
84
+
85
+
However, vision tower layers are loaded onto GPU all at once. Unlike the text model, vision towers are not split up into individual layers before onloading to the GPU. This can create a GPU memory bottleneck for models whose vision towers are larger than their text layers.
86
+
87
+
At this time LLM Compressor does not quantise the vision tower as quantization is generally not worth the tradeoff between latency/throughput and accuracy loss.
88
+
89
+
3. LLM Compressor does not currently support tensor parallelism for compression. Supporting this feature will allow layers to be sharded across GPUs, leading to reduced memory usage per GPU and faster compression.
90
+
91
+
### QuantizationModifier or Round-To-Nearest (RTN)
92
+
93
+
The quantization modifier, RTN, does not require any additional memory beyond the storage needed for its quantization parameters (scales/zeros).
94
+
95
+
If we ignore these scales and zero points from our calculation, we can estimate the following memory requirements:
The GPTQ/ SparseGPT algorithms differ from the RTN in that they must also allocate an auxiliary hessian matrices for any layers that are onloaded to the GPU.
107
+
108
+
This hessian matrix is used to increase the accuracy recovery of the algorithm, and is approximately the same size as the original weights.
0 commit comments