Skip to content

Commit abf9343

Browse files
committed
docs: improve quickstart with bitsandbytes usage and examples
1 parent 7bfe923 commit abf9343

File tree

1 file changed

+155
-7
lines changed

1 file changed

+155
-7
lines changed

docs/source/quickstart.mdx

Lines changed: 155 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,163 @@
11
# Quickstart
22

3-
## How does it work?
3+
## What is BitsAndBytes?
44

5-
... work in progress ...
5+
`bitsandbytes` is a lightweight, open-source library that makes it possible to train and run **very large models** on consumer GPUs or limited hardware by using **8-bit and 4-bit quantization** techniques.
66

7-
(Community contributions would we very welcome!)
7+
👉 Put simply:
88

9-
## Minimal examples
9+
* Most deep learning models normally store weights in 16-bit (`float16`) or 32-bit (`float32`) numbers.
10+
* `bitsandbytes` compresses those into 8-bit or even 4-bit representations.
11+
* This reduces the **memory footprint**, makes models **faster to run**, and still preserves nearly the same accuracy.
1012

11-
The following code illustrates the steps above.
13+
This unlocks the ability to run models like **LLaMA, Mistral, Falcon, or GPT-style LLMs** on GPUs with as little as **8–16 GB VRAM**.
1214

13-
```py
14-
code examples will soon follow
15+
---
16+
17+
## How does it work? (Beginner-friendly)
18+
19+
Let’s break it down with an analogy:
20+
21+
* Imagine you have a library of books. Each book is written in **fancy calligraphy (32-bit precision)** — beautiful but heavy.
22+
* Now, you rewrite the same books in **compact handwriting (8-bit)** — still readable, much lighter to carry.
23+
* That’s what `bitsandbytes` does for machine learning weights: it stores the same information in a compressed but efficient format.
24+
25+
**Key benefits for beginners:**
26+
27+
***Memory savings** → Run bigger models on smaller GPUs.
28+
***Speedups** → Smaller weights mean faster computations.
29+
***Plug-and-play** → Works with PyTorch and Hugging Face Transformers without huge code changes.
30+
31+
So, as a beginner, you don’t need to understand all the math under the hood. Just know: it makes models lighter and faster while still accurate.
32+
33+
---
34+
35+
## How does it work? (Nerd edition)
36+
37+
Now let’s peek under the hood 🔬:
38+
39+
* **Quantization**:
40+
41+
* Floating point weights (e.g., `float32`) are mapped to lower precision representations (`int8` or `int4`).
42+
* This involves scaling factors so that the reduced representation doesn’t lose too much information.
43+
44+
* **Custom CUDA kernels**:
45+
46+
* `bitsandbytes` provides hand-optimized CUDA kernels that handle low-precision matrix multiplications efficiently.
47+
* These kernels apply **dynamic range scaling** to reduce quantization error.
48+
49+
* **8-bit Optimizers**:
50+
51+
* Optimizers like Adam, AdamW, RMSProp, etc., are reimplemented in 8-bit precision.
52+
* Instead of storing massive optimizer states in 32-bit (which usually takes *more memory than the model itself*), these states are stored in 8-bit with clever scaling.
53+
54+
* **Dynamic Quantization**:
55+
56+
* Instead of using one scale for the entire tensor, `bitsandbytes` uses per-block quantization (e.g., per 64 values). This improves accuracy significantly.
57+
58+
* **Integrations**:
59+
60+
* Hugging Face Transformers can load models in 4-bit or 8-bit precision with `load_in_4bit=True` or `load_in_8bit=True`.
61+
* Compatible with FSDP (Fully Sharded Data Parallel) and QLoRA fine-tuning techniques.
62+
63+
In short: it’s not *just smaller numbers*. It’s **mathematically smart quantization + GPU-optimized code** that makes it production-ready.
64+
65+
---
66+
67+
## Minimal Examples
68+
69+
### 1. Using quantized embedding directly
70+
71+
```python
72+
import torch
73+
import bitsandbytes as bnb
74+
75+
# Quantized embedding layer
76+
embedding = bnb.nn.Embedding(num_embeddings=1000, embedding_dim=128)
77+
x = torch.randint(0, 1000, (4,))
78+
y = embedding(x)
79+
print(y.shape) # torch.Size([4, 128])
80+
```
81+
82+
This shows that you can drop in `bitsandbytes` layers just like PyTorch ones.
83+
84+
---
85+
86+
### 2. Loading a 4-bit model with Hugging Face Transformers
87+
88+
```python
89+
from transformers import AutoModelForCausalLM, AutoTokenizer
90+
91+
model_id = "HuggingFaceTB/SmolLM3-3B" # replace with a model you have access to
92+
tokenizer = AutoTokenizer.from_pretrained(model_id)
93+
94+
# Load in 4-bit precision with device map for GPU offloading
95+
model = AutoModelForCausalLM.from_pretrained(
96+
model_id,
97+
load_in_4bit=True,
98+
device_map="auto"
99+
)
100+
101+
# Verify quantized layers
102+
print(model)
103+
104+
# Generate text
105+
inputs = tokenizer("Hello, world!", return_tensors="pt").to("cuda")
106+
outputs = model.generate(**inputs, max_new_tokens=50)
107+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
108+
```
109+
110+
When you print the model, you’ll see `Linear4bit` layers, confirming it’s running in **4-bit precision**.
111+
112+
---
113+
114+
### 3. Training with 8-bit optimizers (and verifying)
115+
116+
```python
117+
import torch
118+
import bitsandbytes as bnb
119+
120+
# Simple model
121+
model = torch.nn.Linear(128, 2).cuda()
122+
criterion = torch.nn.CrossEntropyLoss()
123+
124+
# Use 8-bit Adam optimizer
125+
optimizer = bnb.optim.Adam8bit(model.parameters(), lr=1e-3)
126+
127+
x = torch.randn(16, 128).cuda()
128+
y = torch.randint(0, 2, (16,)).cuda()
129+
130+
optimizer.zero_grad()
131+
loss = criterion(model(x), y)
132+
loss.backward()
133+
optimizer.step()
134+
135+
print(f"Loss: {loss.item():.4f}")
136+
137+
# --- Inspect optimizer state to confirm 8-bit usage ---
138+
print("Optimizer type:", type(optimizer))
139+
for i, group in enumerate(optimizer.param_groups):
140+
for p in group['params']:
141+
state = optimizer.state[p]
142+
print(f"Param {i} state keys: {list(state.keys())}")
15143
```
144+
145+
The optimizer type will be `<class 'bitsandbytes.optim.adam.Adam8bit'>`, and state tensors are stored in quantized form, confirming training in **8-bit precision**.
146+
147+
---
148+
149+
## What’s next?
150+
151+
* [Get started](./introduction)
152+
* [Installation](./installation)
153+
* [Quickstart](./quickstart)
154+
* [Usage Guides](./usage)
155+
* [8-bit optimizers](./optimizers/overview)
156+
157+
---
158+
159+
**In summary:**
160+
161+
* Beginners → `bitsandbytes` makes big models smaller and faster.
162+
* Nerds → It achieves this through clever quantization, CUDA kernels, and 8-bit optimizer implementations.
163+
* Everyone → Can benefit by dropping it into their PyTorch or Hugging Face workflows with minimal code changes, and can **verify** the bit precision being used.

0 commit comments

Comments
 (0)