Skip to content

Commit 7399f80

Browse files
authored
[docs] fix bugs in the bitsandbytes documentation (#35868)
* fix doc * update model
1 parent 0a1a8e3 commit 7399f80

File tree

1 file changed

+5
-4
lines changed

1 file changed

+5
-4
lines changed

docs/source/en/quantization/bitsandbytes.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -208,7 +208,8 @@ from transformers import AutoModelForCausalLM, BitsAndBytesConfig
208208
model_id = "bigscience/bloom-1b7"
209209

210210
quantization_config = BitsAndBytesConfig(
211-
llm_int8_threshold=10,
211+
llm_int8_threshold=10.0,
212+
llm_int8_enable_fp32_cpu_offload=True
212213
)
213214

214215
model_8bit = AutoModelForCausalLM.from_pretrained(
@@ -285,7 +286,7 @@ For inference, the `bnb_4bit_quant_type` does not have a huge impact on performa
285286

286287
### Nested quantization
287288

288-
Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter. For example, with nested quantization, you can finetune a [Llama-13b](https://huggingface.co/meta-llama/Llama-2-13b) model on a 16GB NVIDIA T4 GPU with a sequence length of 1024, a batch size of 1, and enabling gradient accumulation with 4 steps.
289+
Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter. For example, with nested quantization, you can finetune a [Llama-13b](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) model on a 16GB NVIDIA T4 GPU with a sequence length of 1024, a batch size of 1, and enabling gradient accumulation with 4 steps.
289290

290291
```py
291292
from transformers import BitsAndBytesConfig
@@ -295,7 +296,7 @@ double_quant_config = BitsAndBytesConfig(
295296
bnb_4bit_use_double_quant=True,
296297
)
297298

298-
model_double_quant = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b", torch_dtype="auto", quantization_config=double_quant_config)
299+
model_double_quant = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b-chat-hf", torch_dtype="auto", quantization_config=double_quant_config)
299300
```
300301

301302
## Dequantizing `bitsandbytes` models
@@ -307,7 +308,7 @@ from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
307308

308309
model_id = "facebook/opt-125m"
309310

310-
model = AutoModelForCausalLM.from_pretrained(model_id, BitsAndBytesConfig(load_in_4bit=True))
311+
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=BitsAndBytesConfig(load_in_4bit=True))
311312
tokenizer = AutoTokenizer.from_pretrained(model_id)
312313

313314
model.dequantize()

0 commit comments

Comments
 (0)