Skip to content

Commit ea162e8

Browse files
committed
quantized and full SmolLM3
1 parent 9ede204 commit ea162e8

File tree

8 files changed

+1914
-0
lines changed

8 files changed

+1914
-0
lines changed

candle-examples/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ candle-flash-attn = { workspace = true, optional = true }
1919
candle-onnx = { workspace = true, optional = true }
2020

2121
csv = "1.3.0"
22+
crono = "0.4.0"
2223
cudarc = { workspace = true, optional = true }
2324
half = { workspace = true, optional = true }
2425
hf-hub = { workspace = true, features = ["tokio"] }
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# SmolLM3 Unified Inference
2+
3+
A unified Rust implementation for running SmolLM3 models using the Candle ML framework. Supports both quantized (GGUF) and full precision (safetensors) models with a single codebase.
4+
5+
## Features
6+
7+
- **Dual Model Support**: Run either quantized or full precision models
8+
- **Multiple Quantization Levels**: Q4_K_M (1.9GB), Q8_0 (3.3GB), F16 (6.2GB)
9+
- **Chat Template Support**: Automatic formatting for instruction-tuned models
10+
- **Thinking Mode**: Enable reasoning traces with `/think` mode
11+
- **NoPE Architecture**: Supports SmolLM3's mixed RoPE/NoPE layer configuration
12+
- **Auto-download**: Automatically fetches models from HuggingFace Hub
13+
14+
## Quick Start
15+
16+
### Quantized Model (Recommended)
17+
```bash
18+
cargo run --release --example smollm3 -- \
19+
--model-type quantized \
20+
--quantization q8_0 \
21+
--prompt "Explain Rust's ownership system"
22+
```
23+
24+
### Full Precision Model
25+
```bash
26+
cargo run --release --example smollm3 -- \
27+
--model-type full \
28+
--dtype f16 \
29+
--prompt "Write a sorting algorithm in Rust"
30+
```
31+
32+
## Command Line Options
33+
34+
### Model Selection
35+
- `--model-type <TYPE>`: Choose `quantized` or `full` (default: quantized)
36+
- `--model <VARIANT>`: Choose `3b` (instruct) or `3b-base` (default: 3b)
37+
- `--quantization <LEVEL>`: For quantized models - `q4_k_m`, `q8_0`, or `f16` (default: q8_0)
38+
- `--dtype <TYPE>`: For full models - `f32`, `f16`, `bf16`, or `auto` (default: auto)
39+
40+
### Generation Parameters
41+
- `--prompt <TEXT>`: The prompt to generate from
42+
- `-n, --sample-len <NUM>`: Number of tokens to generate (default: 1000)
43+
- `--temperature <FLOAT>`: Sampling temperature, 0 for greedy (default: 0.8)
44+
- `--top-p <FLOAT>`: Nucleus sampling probability cutoff
45+
- `--top-k <NUM>`: Only sample among top K tokens
46+
- `--repeat-penalty <FLOAT>`: Penalty for repeating tokens (default: 1.1)
47+
- `--repeat-last-n <NUM>`: Context size for repeat penalty (default: 64)
48+
49+
### Advanced Options
50+
- `--no-chat-template`: Disable chat template formatting (use for base models)
51+
- `--thinking`: Enable thinking/reasoning mode with `/think` tags
52+
- `--split-prompt`: Process prompt tokens individually (for debugging)
53+
- `--tracing`: Enable performance tracing (generates trace JSON)
54+
- `--model-path <PATH>`: Use local model file instead of auto-download
55+
- `--tokenizer <PATH>`: Use local tokenizer instead of auto-download
56+
57+
## Quantization Comparison
58+
59+
| Level | Size | Quality | Use Case |
60+
|--------|-------|---------|----------|
61+
| Q4_K_M | 1.9GB | Good | Fast inference, constrained environments |
62+
| Q8_0 | 3.3GB | Better | Balanced quality and speed |
63+
| F16 | 6.2GB | Best | Maximum quality in GGUF format |
64+
65+
## Examples
66+
67+
### Creative Writing with Thinking Mode
68+
```bash
69+
cargo run --release --example smollm3 -- \
70+
--thinking \
71+
--temperature 0.9 \
72+
--prompt "Write a short sci-fi story about AI"
73+
```
74+
75+
### Code Generation (Base Model)
76+
```bash
77+
cargo run --release --example smollm3 -- \
78+
--model 3b-base \
79+
--no-chat-template \
80+
--temperature 0.2 \
81+
--prompt "def fibonacci(n):"
82+
```
83+
84+
### High Quality Output
85+
```bash
86+
cargo run --release --example smollm3 -- \
87+
--model-type full \
88+
--dtype f16 \
89+
--temperature 0.7 \
90+
--prompt "Explain quantum entanglement"
91+
```
92+
93+
## Model Architecture
94+
95+
SmolLM3 uses a hybrid RoPE/NoPE architecture:
96+
- **RoPE layers**: Standard rotary position embeddings (75% of layers)
97+
- **NoPE layers**: No position embeddings (25% of layers - every 4th layer)
98+
99+
This configuration is automatically detected and handled by the implementation.
100+
101+
## Hardware Requirements
102+
103+
- **Quantized Q4_K_M**: ~2.5GB RAM
104+
- **Quantized Q8_0**: ~4GB RAM
105+
- **Full F16**: ~7GB RAM
106+
- **Full F32**: ~13GB RAM
107+
108+
GPU acceleration supported via CUDA (with `cuda` feature) or Metal (macOS).
109+
110+
## Troubleshooting
111+
112+
**Model download fails**: Check internet connection and HuggingFace Hub access
113+
114+
**Out of memory**: Try a smaller quantization level or use `--sample-len` to reduce generation length
115+
116+
**Compilation errors**: Ensure you're using the latest version of the Candle crate
117+
118+
## License
119+
120+
This implementation follows the Candle framework license. SmolLM3 models are available under Apache 2.0.

0 commit comments

Comments
 (0)