You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: overview-quantization-transformers.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -86,7 +86,7 @@ We will use the following setup:
86
86
87
87
### Inference speed (forward pass only)
88
88
89
-
This benchmark measures only the prefill step, which corresponds to the foward pass during training. It was run on a single NVIDIA A100-SXM4-80GB GPU with a prompt length of 512. The model we used was `meta-llama/Llama-2-13b-hf`.
89
+
This benchmark measures only the prefill step, which corresponds to the forward pass during training. It was run on a single NVIDIA A100-SXM4-80GB GPU with a prompt length of 512. The model we used was `meta-llama/Llama-2-13b-hf`.
90
90
91
91
with batch size = 1:
92
92
@@ -113,7 +113,7 @@ The following benchmarks measure the generation speed of the model during infere
113
113
#### use_cache
114
114
Let's test `use_cache` to better understand the impact of caching the hidden state during the generation.
115
115
116
-
The benchmark was run on a A100 with a prompt length of 30 and we generated exactly 30 tokens. The model we used was `meta-llama/Llama-2-7b-hf`.
116
+
The benchmark was run on an A100 with a prompt length of 30 and we generated exactly 30 tokens. The model we used was `meta-llama/Llama-2-7b-hf`.
117
117
118
118
with `use_cache=True`
119
119
@@ -169,7 +169,7 @@ From the result, we conclude that bitsandbytes is faster than GPTQ for fine-tuni
169
169
170
170
### Performance degradation
171
171
172
-
Quantization is great for reducing memory comsumption. However, it does come with performance degradation. Let's compare the performance using the [Open-LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) !
172
+
Quantization is great for reducing memory consumption. However, it does come with performance degradation. Let's compare the performance using the [Open-LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) !
173
173
174
174
with 7b model:
175
175
@@ -192,7 +192,7 @@ From the results above, we conclude that there is less degradation in bigger mod
192
192
193
193
## Conclusion and final words
194
194
195
-
In this blogpost, we compared bitsandbytes and GPTQ quantization across mutliple setups. We saw that bitsandbytes is better suited for fine-tuning while GPTQ is better for generation. From this observation, one way to get better merged models would be to:
195
+
In this blogpost, we compared bitsandbytes and GPTQ quantization across multiple setups. We saw that bitsandbytes is better suited for fine-tuning while GPTQ is better for generation. From this observation, one way to get better merged models would be to:
196
196
197
197
- (1) quantize the base model using bitsandbytes (zero-shot quantization)
0 commit comments