|
1 | 1 | # Quickstart |
2 | 2 |
|
3 | | -## How does it work? |
| 3 | +## What is BitsAndBytes? |
4 | 4 |
|
5 | | -... work in progress ... |
| 5 | +`bitsandbytes` is a lightweight, open-source library that makes it possible to train and run **very large models** on consumer GPUs or limited hardware by using **8-bit and 4-bit quantization** techniques. |
6 | 6 |
|
7 | | -(Community contributions would we very welcome!) |
| 7 | +👉 Put simply: |
8 | 8 |
|
9 | | -## Minimal examples |
| 9 | +* Most deep learning models normally store weights in 16-bit (`float16`) or 32-bit (`float32`) numbers. |
| 10 | +* `bitsandbytes` compresses those into 8-bit or even 4-bit representations. |
| 11 | +* This reduces the **memory footprint**, makes models **faster to run**, and still preserves nearly the same accuracy. |
10 | 12 |
|
11 | | -The following code illustrates the steps above. |
| 13 | +This unlocks the ability to run models like **LLaMA, Mistral, Falcon, or GPT-style LLMs** on GPUs with as little as **8–16 GB VRAM**. |
12 | 14 |
|
13 | | -```py |
14 | | -code examples will soon follow |
| 15 | +--- |
| 16 | + |
| 17 | +## How does it work? (Beginner-friendly) |
| 18 | + |
| 19 | +Let’s break it down with an analogy: |
| 20 | + |
| 21 | +* Imagine you have a library of books. Each book is written in **fancy calligraphy (32-bit precision)** — beautiful but heavy. |
| 22 | +* Now, you rewrite the same books in **compact handwriting (8-bit)** — still readable, much lighter to carry. |
| 23 | +* That’s what `bitsandbytes` does for machine learning weights: it stores the same information in a compressed but efficient format. |
| 24 | + |
| 25 | +**Key benefits for beginners:** |
| 26 | + |
| 27 | +* ✅ **Memory savings** → Run bigger models on smaller GPUs. |
| 28 | +* ✅ **Speedups** → Smaller weights mean faster computations. |
| 29 | +* ✅ **Plug-and-play** → Works with PyTorch and Hugging Face Transformers without huge code changes. |
| 30 | + |
| 31 | +So, as a beginner, you don’t need to understand all the math under the hood. Just know: it makes models lighter and faster while still accurate. |
| 32 | + |
| 33 | +--- |
| 34 | + |
| 35 | +## How does it work? (Nerd edition) |
| 36 | + |
| 37 | +Now let’s peek under the hood 🔬: |
| 38 | + |
| 39 | +* **Quantization**: |
| 40 | + |
| 41 | + * Floating point weights (e.g., `float32`) are mapped to lower precision representations (`int8` or `int4`). |
| 42 | + * This involves scaling factors so that the reduced representation doesn’t lose too much information. |
| 43 | + |
| 44 | +* **Custom CUDA kernels**: |
| 45 | + |
| 46 | + * `bitsandbytes` provides hand-optimized CUDA kernels that handle low-precision matrix multiplications efficiently. |
| 47 | + * These kernels apply **dynamic range scaling** to reduce quantization error. |
| 48 | + |
| 49 | +* **8-bit Optimizers**: |
| 50 | + |
| 51 | + * Optimizers like Adam, AdamW, RMSProp, etc., are reimplemented in 8-bit precision. |
| 52 | + * Instead of storing massive optimizer states in 32-bit (which usually takes *more memory than the model itself*), these states are stored in 8-bit with clever scaling. |
| 53 | + |
| 54 | +* **Dynamic Quantization**: |
| 55 | + |
| 56 | + * Instead of using one scale for the entire tensor, `bitsandbytes` uses per-block quantization (e.g., per 64 values). This improves accuracy significantly. |
| 57 | + |
| 58 | +* **Integrations**: |
| 59 | + |
| 60 | + * Hugging Face Transformers can load models in 4-bit or 8-bit precision with `load_in_4bit=True` or `load_in_8bit=True`. |
| 61 | + * Compatible with FSDP (Fully Sharded Data Parallel) and QLoRA fine-tuning techniques. |
| 62 | + |
| 63 | +In short: it’s not *just smaller numbers*. It’s **mathematically smart quantization + GPU-optimized code** that makes it production-ready. |
| 64 | + |
| 65 | +--- |
| 66 | + |
| 67 | +## Minimal Examples |
| 68 | + |
| 69 | +### 1. Using quantized embedding directly |
| 70 | + |
| 71 | +```python |
| 72 | +import torch |
| 73 | +import bitsandbytes as bnb |
| 74 | + |
| 75 | +# Quantized embedding layer |
| 76 | +embedding = bnb.nn.Embedding(num_embeddings=1000, embedding_dim=128) |
| 77 | +x = torch.randint(0, 1000, (4,)) |
| 78 | +y = embedding(x) |
| 79 | +print(y.shape) # torch.Size([4, 128]) |
| 80 | +``` |
| 81 | + |
| 82 | +This shows that you can drop in `bitsandbytes` layers just like PyTorch ones. |
| 83 | + |
| 84 | +--- |
| 85 | + |
| 86 | +### 2. Loading a 4-bit model with Hugging Face Transformers |
| 87 | + |
| 88 | +```python |
| 89 | +from transformers import AutoModelForCausalLM, AutoTokenizer |
| 90 | + |
| 91 | +model_id = "HuggingFaceTB/SmolLM3-3B" # replace with a model you have access to |
| 92 | +tokenizer = AutoTokenizer.from_pretrained(model_id) |
| 93 | + |
| 94 | +# Load in 4-bit precision with device map for GPU offloading |
| 95 | +model = AutoModelForCausalLM.from_pretrained( |
| 96 | + model_id, |
| 97 | + load_in_4bit=True, |
| 98 | + device_map="auto" |
| 99 | +) |
| 100 | + |
| 101 | +# Verify quantized layers |
| 102 | +print(model) |
| 103 | + |
| 104 | +# Generate text |
| 105 | +inputs = tokenizer("Hello, world!", return_tensors="pt").to("cuda") |
| 106 | +outputs = model.generate(**inputs, max_new_tokens=50) |
| 107 | +print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| 108 | +``` |
| 109 | + |
| 110 | +When you print the model, you’ll see `Linear4bit` layers, confirming it’s running in **4-bit precision**. |
| 111 | + |
| 112 | +--- |
| 113 | + |
| 114 | +### 3. Training with 8-bit optimizers (and verifying) |
| 115 | + |
| 116 | +```python |
| 117 | +import torch |
| 118 | +import bitsandbytes as bnb |
| 119 | + |
| 120 | +# Simple model |
| 121 | +model = torch.nn.Linear(128, 2).cuda() |
| 122 | +criterion = torch.nn.CrossEntropyLoss() |
| 123 | + |
| 124 | +# Use 8-bit Adam optimizer |
| 125 | +optimizer = bnb.optim.Adam8bit(model.parameters(), lr=1e-3) |
| 126 | + |
| 127 | +x = torch.randn(16, 128).cuda() |
| 128 | +y = torch.randint(0, 2, (16,)).cuda() |
| 129 | + |
| 130 | +optimizer.zero_grad() |
| 131 | +loss = criterion(model(x), y) |
| 132 | +loss.backward() |
| 133 | +optimizer.step() |
| 134 | + |
| 135 | +print(f"Loss: {loss.item():.4f}") |
| 136 | + |
| 137 | +# --- Inspect optimizer state to confirm 8-bit usage --- |
| 138 | +print("Optimizer type:", type(optimizer)) |
| 139 | +for i, group in enumerate(optimizer.param_groups): |
| 140 | + for p in group['params']: |
| 141 | + state = optimizer.state[p] |
| 142 | + print(f"Param {i} state keys: {list(state.keys())}") |
15 | 143 | ``` |
| 144 | + |
| 145 | +The optimizer type will be `<class 'bitsandbytes.optim.adam.Adam8bit'>`, and state tensors are stored in quantized form, confirming training in **8-bit precision**. |
| 146 | + |
| 147 | +--- |
| 148 | + |
| 149 | +## What’s next? |
| 150 | + |
| 151 | +* [Get started](./introduction) |
| 152 | +* [Installation](./installation) |
| 153 | +* [Quickstart](./quickstart) |
| 154 | +* [Usage Guides](./usage) |
| 155 | +* [8-bit optimizers](./optimizers/overview) |
| 156 | + |
| 157 | +--- |
| 158 | + |
| 159 | +✨ **In summary:** |
| 160 | + |
| 161 | +* Beginners → `bitsandbytes` makes big models smaller and faster. |
| 162 | +* Nerds → It achieves this through clever quantization, CUDA kernels, and 8-bit optimizer implementations. |
| 163 | +* Everyone → Can benefit by dropping it into their PyTorch or Hugging Face workflows with minimal code changes, and can **verify** the bit precision being used. |
0 commit comments