|
| 1 | +# SmolLM3 Unified Inference |
| 2 | + |
| 3 | +A unified Rust implementation for running SmolLM3 models using the Candle ML framework. Supports both quantized (GGUF) and full precision (safetensors) models with a single codebase. |
| 4 | + |
| 5 | +## Features |
| 6 | + |
| 7 | +- **Dual Model Support**: Run either quantized or full precision models |
| 8 | +- **Multiple Quantization Levels**: Q4_K_M (1.9GB), Q8_0 (3.3GB), F16 (6.2GB) |
| 9 | +- **Chat Template Support**: Automatic formatting for instruction-tuned models |
| 10 | +- **Thinking Mode**: Enable reasoning traces with `/think` mode |
| 11 | +- **NoPE Architecture**: Supports SmolLM3's mixed RoPE/NoPE layer configuration |
| 12 | +- **Auto-download**: Automatically fetches models from HuggingFace Hub |
| 13 | + |
| 14 | +## Quick Start |
| 15 | + |
| 16 | +### Quantized Model (Recommended) |
| 17 | +```bash |
| 18 | +cargo run --release --example smollm3 -- \ |
| 19 | + --model-type quantized \ |
| 20 | + --quantization q8_0 \ |
| 21 | + --prompt "Explain Rust's ownership system" |
| 22 | +``` |
| 23 | + |
| 24 | +### Full Precision Model |
| 25 | +```bash |
| 26 | +cargo run --release --example smollm3 -- \ |
| 27 | + --model-type full \ |
| 28 | + --dtype f16 \ |
| 29 | + --prompt "Write a sorting algorithm in Rust" |
| 30 | +``` |
| 31 | + |
| 32 | +## Command Line Options |
| 33 | + |
| 34 | +### Model Selection |
| 35 | +- `--model-type <TYPE>`: Choose `quantized` or `full` (default: quantized) |
| 36 | +- `--model <VARIANT>`: Choose `3b` (instruct) or `3b-base` (default: 3b) |
| 37 | +- `--quantization <LEVEL>`: For quantized models - `q4_k_m`, `q8_0`, or `f16` (default: q8_0) |
| 38 | +- `--dtype <TYPE>`: For full models - `f32`, `f16`, `bf16`, or `auto` (default: auto) |
| 39 | + |
| 40 | +### Generation Parameters |
| 41 | +- `--prompt <TEXT>`: The prompt to generate from |
| 42 | +- `-n, --sample-len <NUM>`: Number of tokens to generate (default: 1000) |
| 43 | +- `--temperature <FLOAT>`: Sampling temperature, 0 for greedy (default: 0.8) |
| 44 | +- `--top-p <FLOAT>`: Nucleus sampling probability cutoff |
| 45 | +- `--top-k <NUM>`: Only sample among top K tokens |
| 46 | +- `--repeat-penalty <FLOAT>`: Penalty for repeating tokens (default: 1.1) |
| 47 | +- `--repeat-last-n <NUM>`: Context size for repeat penalty (default: 64) |
| 48 | + |
| 49 | +### Advanced Options |
| 50 | +- `--no-chat-template`: Disable chat template formatting (use for base models) |
| 51 | +- `--thinking`: Enable thinking/reasoning mode with `/think` tags |
| 52 | +- `--split-prompt`: Process prompt tokens individually (for debugging) |
| 53 | +- `--tracing`: Enable performance tracing (generates trace JSON) |
| 54 | +- `--model-path <PATH>`: Use local model file instead of auto-download |
| 55 | +- `--tokenizer <PATH>`: Use local tokenizer instead of auto-download |
| 56 | + |
| 57 | +## Quantization Comparison |
| 58 | + |
| 59 | +| Level | Size | Quality | Use Case | |
| 60 | +|--------|-------|---------|----------| |
| 61 | +| Q4_K_M | 1.9GB | Good | Fast inference, constrained environments | |
| 62 | +| Q8_0 | 3.3GB | Better | Balanced quality and speed | |
| 63 | +| F16 | 6.2GB | Best | Maximum quality in GGUF format | |
| 64 | + |
| 65 | +## Examples |
| 66 | + |
| 67 | +### Creative Writing with Thinking Mode |
| 68 | +```bash |
| 69 | +cargo run --release --example smollm3 -- \ |
| 70 | + --thinking \ |
| 71 | + --temperature 0.9 \ |
| 72 | + --prompt "Write a short sci-fi story about AI" |
| 73 | +``` |
| 74 | + |
| 75 | +### Code Generation (Base Model) |
| 76 | +```bash |
| 77 | +cargo run --release --example smollm3 -- \ |
| 78 | + --model 3b-base \ |
| 79 | + --no-chat-template \ |
| 80 | + --temperature 0.2 \ |
| 81 | + --prompt "def fibonacci(n):" |
| 82 | +``` |
| 83 | + |
| 84 | +### High Quality Output |
| 85 | +```bash |
| 86 | +cargo run --release --example smollm3 -- \ |
| 87 | + --model-type full \ |
| 88 | + --dtype f16 \ |
| 89 | + --temperature 0.7 \ |
| 90 | + --prompt "Explain quantum entanglement" |
| 91 | +``` |
| 92 | + |
| 93 | +## Model Architecture |
| 94 | + |
| 95 | +SmolLM3 uses a hybrid RoPE/NoPE architecture: |
| 96 | +- **RoPE layers**: Standard rotary position embeddings (75% of layers) |
| 97 | +- **NoPE layers**: No position embeddings (25% of layers - every 4th layer) |
| 98 | + |
| 99 | +This configuration is automatically detected and handled by the implementation. |
| 100 | + |
| 101 | +## Hardware Requirements |
| 102 | + |
| 103 | +- **Quantized Q4_K_M**: ~2.5GB RAM |
| 104 | +- **Quantized Q8_0**: ~4GB RAM |
| 105 | +- **Full F16**: ~7GB RAM |
| 106 | +- **Full F32**: ~13GB RAM |
| 107 | + |
| 108 | +GPU acceleration supported via CUDA (with `cuda` feature) or Metal (macOS). |
| 109 | + |
| 110 | +## Troubleshooting |
| 111 | + |
| 112 | +**Model download fails**: Check internet connection and HuggingFace Hub access |
| 113 | + |
| 114 | +**Out of memory**: Try a smaller quantization level or use `--sample-len` to reduce generation length |
| 115 | + |
| 116 | +**Compilation errors**: Ensure you're using the latest version of the Candle crate |
| 117 | + |
| 118 | +## License |
| 119 | + |
| 120 | +This implementation follows the Candle framework license. SmolLM3 models are available under Apache 2.0. |
0 commit comments