huggingface
diff --git a/‎candle-examples/Cargo.toml‎
Lines changed: 1 addition & 0 deletions b/‎candle-examples/Cargo.toml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎candle-examples/examples/smollm3/README.md‎
Lines changed: 120 additions & 0 deletions b/‎candle-examples/examples/smollm3/README.md‎
Lines changed: 120 additions & 0 deletions
@@ -19,6 +19,7 @@ candle-flash-attn = { workspace = true, optional = true }
 candle-onnx = { workspace = true, optional = true }
 
 csv = "1.3.0"
+crono = "0.4.0"
 cudarc = { workspace = true, optional = true }
 half = { workspace = true, optional = true }
 hf-hub = { workspace = true, features = ["tokio"] }
 
@@ -0,0 +1,120 @@
+# SmolLM3 Unified Inference
+
+A unified Rust implementation for running SmolLM3 models using the Candle ML framework. Supports both quantized (GGUF) and full precision (safetensors) models with a single codebase.
+
+## Features
+
+- **Dual Model Support**: Run either quantized or full precision models
+- **Multiple Quantization Levels**: Q4_K_M (1.9GB), Q8_0 (3.3GB), F16 (6.2GB)
+- **Chat Template Support**: Automatic formatting for instruction-tuned models
+- **Thinking Mode**: Enable reasoning traces with `/think` mode
+- **NoPE Architecture**: Supports SmolLM3's mixed RoPE/NoPE layer configuration
+- **Auto-download**: Automatically fetches models from HuggingFace Hub
+
+## Quick Start
+
+### Quantized Model (Recommended)
+```bash
+cargo run --release --example smollm3 -- \
+  --model-type quantized \
+  --quantization q8_0 \
+  --prompt "Explain Rust's ownership system"
+```
+
+### Full Precision Model
+```bash
+cargo run --release --example smollm3 -- \
+  --model-type full \
+  --dtype f16 \
+  --prompt "Write a sorting algorithm in Rust"
+```
+
+## Command Line Options
+
+### Model Selection
+- `--model-type <TYPE>`: Choose `quantized` or `full` (default: quantized)
+- `--model <VARIANT>`: Choose `3b` (instruct) or `3b-base` (default: 3b)
+- `--quantization <LEVEL>`: For quantized models - `q4_k_m`, `q8_0`, or `f16` (default: q8_0)
+- `--dtype <TYPE>`: For full models - `f32`, `f16`, `bf16`, or `auto` (default: auto)
+
+### Generation Parameters
+- `--prompt <TEXT>`: The prompt to generate from
+- `-n, --sample-len <NUM>`: Number of tokens to generate (default: 1000)
+- `--temperature <FLOAT>`: Sampling temperature, 0 for greedy (default: 0.8)
+- `--top-p <FLOAT>`: Nucleus sampling probability cutoff
+- `--top-k <NUM>`: Only sample among top K tokens
+- `--repeat-penalty <FLOAT>`: Penalty for repeating tokens (default: 1.1)
+- `--repeat-last-n <NUM>`: Context size for repeat penalty (default: 64)
+
+### Advanced Options
+- `--no-chat-template`: Disable chat template formatting (use for base models)
+- `--thinking`: Enable thinking/reasoning mode with `/think` tags
+- `--split-prompt`: Process prompt tokens individually (for debugging)
+- `--tracing`: Enable performance tracing (generates trace JSON)
+- `--model-path <PATH>`: Use local model file instead of auto-download
+- `--tokenizer <PATH>`: Use local tokenizer instead of auto-download
+
+## Quantization Comparison
+
+| Level  | Size  | Quality | Use Case |
+|--------|-------|---------|----------|
+| Q4_K_M | 1.9GB | Good    | Fast inference, constrained environments |
+| Q8_0   | 3.3GB | Better  | Balanced quality and speed |
+| F16    | 6.2GB | Best    | Maximum quality in GGUF format |
+
+## Examples
+
+### Creative Writing with Thinking Mode
+```bash
+cargo run --release --example smollm3 -- \
+  --thinking \
+  --temperature 0.9 \
+  --prompt "Write a short sci-fi story about AI"
+```
+
+### Code Generation (Base Model)
+```bash
+cargo run --release --example smollm3 -- \
+  --model 3b-base \
+  --no-chat-template \
+  --temperature 0.2 \
+  --prompt "def fibonacci(n):"
+```
+
+### High Quality Output
+```bash
+cargo run --release --example smollm3 -- \
+  --model-type full \
+  --dtype f16 \
+  --temperature 0.7 \
+  --prompt "Explain quantum entanglement"
+```
+
+## Model Architecture
+
+SmolLM3 uses a hybrid RoPE/NoPE architecture:
+- **RoPE layers**: Standard rotary position embeddings (75% of layers)
+- **NoPE layers**: No position embeddings (25% of layers - every 4th layer)
+
+This configuration is automatically detected and handled by the implementation.
+
+## Hardware Requirements
+
+- **Quantized Q4_K_M**: ~2.5GB RAM
+- **Quantized Q8_0**: ~4GB RAM  
+- **Full F16**: ~7GB RAM
+- **Full F32**: ~13GB RAM
+
+GPU acceleration supported via CUDA (with `cuda` feature) or Metal (macOS).
+
+## Troubleshooting
+
+**Model download fails**: Check internet connection and HuggingFace Hub access
+
+**Out of memory**: Try a smaller quantization level or use `--sample-len` to reduce generation length
+
+**Compilation errors**: Ensure you're using the latest version of the Candle crate
+
+## License
+
+This implementation follows the Candle framework license. SmolLM3 models are available under Apache 2.0.