You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(llm-katan): add CPU quantization for faster inference (#556)
* feat(llm-katan): add CPU quantization for faster inference
Add int8 dynamic quantization support for CPU inference to improve
performance of llm-katan in testing scenarios.
Changes:
- Add --quantize/--no-quantize CLI flag (enabled by default)
- Implement int8 quantization in TransformersBackend for CPU
- Gracefully fallback on platforms without quantization support
- Add comprehensive documentation in README
Performance improvements:
- 2-4x faster inference on supported platforms (Linux x86_64)
- 4x memory reduction with quantization
- Minimal quality impact (acceptable for testing)
Platform notes:
- Works best on Linux with x86_64 CPUs
- Gracefully falls back on unsupported platforms (e.g., Mac)
- Users can disable with --no-quantize for full precision
Closes: #552
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>
* fix(llm-katan): apply black formatting with line-length=88
Apply root project black configuration (line-length=88) to match CI
formatting requirements.
Signed-off-by: Yossi Ovadia <[email protected]>
---------
Signed-off-by: Yossi Ovadia <[email protected]>
Co-authored-by: Claude <[email protected]>
LLM Katan includes **automatic int8 quantization** for CPU inference, providing significant performance improvements:
188
+
189
+
### Performance Gains
190
+
191
+
-**2-4x faster inference** on CPU (on supported platforms)
192
+
-**4x memory reduction**
193
+
-**Enabled by default** for best testing experience
194
+
-**Minimal quality impact** (acceptable for testing scenarios)
195
+
-**Platform support**: Works best on Linux x86_64; may not be available on all platforms (e.g., Mac)
196
+
197
+
### When to Use Quantization
198
+
199
+
✅ **Enabled (default)** - Recommended for:
200
+
201
+
- Fast E2E testing
202
+
- Development environments
203
+
- CI/CD pipelines
204
+
- Resource-constrained environments
205
+
206
+
❌ **Disabled (--no-quantize)** - Use when you need:
207
+
208
+
- Maximum accuracy (though tiny models have limited accuracy anyway)
209
+
- Debugging precision-sensitive issues
210
+
- Comparing with full-precision baselines
211
+
212
+
### Example Performance
213
+
214
+
```bash
215
+
# Default: Fast with quantization (~50-100s per inference)
216
+
llm-katan --model Qwen/Qwen3-0.6B
217
+
218
+
# Slower but more accurate (~200s per inference)
219
+
llm-katan --model Qwen/Qwen3-0.6B --no-quantize
220
+
```
221
+
222
+
> **Note**: Even with quantization, llm-katan is slower than production tools like LM Studio (which uses llama.cpp with extensive optimizations). For production workloads, use vLLM, Ollama, or similar solutions.
223
+
182
224
## Use Cases
183
225
184
226
### Strengths
185
227
186
228
-**Fastest time-to-test**: 30 seconds from install to running
229
+
-**Optimized for CPU**: Automatic int8 quantization for 2-4x speedup
187
230
-**Minimal resource footprint**: Designed for tiny models and efficient testing
188
231
-**No GPU required**: Runs on laptops, Macs, and any CPU-only environment
189
232
-**CI/CD integration friendly**: Lightweight and automation-ready
@@ -223,6 +266,7 @@ Optional:
223
266
--max, --max-tokens INTEGER Maximum tokens to generate (default: 512)
224
267
-t, --temperature FLOAT Sampling temperature (default: 0.7)
225
268
-d, --device [auto|cpu|cuda] Device to use (default: auto)
269
+
--quantize/--no-quantize Enable int8 quantization for faster CPU inference (default: enabled)
0 commit comments