Releases: codewithdark-git/QuantLLM
QuantLLM v2.0
🚀 QuantLLM v2.0
The Ultra-Fast LLM Quantization & Export Library
Load → Quantize → Finetune → Export — All in One Line
🎉 What's New in v2.0
We're excited to announce QuantLLM v2.0 — a complete redesign focused on simplicity, performance, and developer experience. This release transforms LLM quantization from a complex multi-step process into a single, intuitive workflow.
✨ Key Features
🔥 TurboModel: One API to Rule Them All
Gone are the days of juggling multiple libraries. TurboModel unifies everything:
from quantllm import turbo
# Load any model with automatic optimization
model = turbo("meta-llama/Llama-3.2-3B")
# Generate text instantly
response = model.generate("Explain quantum computing in simple terms")
# Fine-tune with one line
model.finetune(dataset, epochs=3)
# Export to any format
model.export("gguf", quantization="Q4_K_M") # → llama.cpp, Ollama, LM Studio
model.export("onnx") # → ONNX Runtime, TensorRT
model.export("mlx", quantization="4bit") # → Apple Silicon📦 Multi-Format Export
Export your models to any deployment target:
| Format | Use Case | Platforms |
|---|---|---|
| GGUF | llama.cpp, Ollama, LM Studio | Windows, Linux, macOS |
| ONNX | ONNX Runtime, TensorRT | Cross-platform |
| MLX | Apple Silicon optimized | macOS (M1/M2/M3/M4) |
| SafeTensors | HuggingFace Transformers | Cross-platform |
🎯 Native GGUF Export — No C++ Required!
Forget compiling llama.cpp or wrestling with C++ toolchains:
# Just works™ — on any platform
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")- ✅ Pure Python implementation
- ✅ All quantization types:
Q2_K→Q8_0,F16,F32 - ✅ Windows, Linux, macOS — zero configuration
- ✅ Automatic llama.cpp installation when needed
🤗 One-Click Hub Publishing
Push directly to HuggingFace with auto-generated model cards:
model.push(
"your-username/my-awesome-model",
format="gguf",
quantization="Q4_K_M"
)Auto-generated features:
- 📋 Proper YAML frontmatter (
library_name,tags,base_model) - 📖 Format-specific usage examples
- 🔘 "Use this model" button compatibility
- 📊 Quantization details and benchmarks
🎨 Beautiful Developer Experience
Themed Progress & Logging
A cohesive orange theme across all interactions:
╔════════════════════════════════════════════════════════════╗
║ ║
║ 🚀 QuantLLM v2.0.0 ║
║ Ultra-fast LLM Quantization & Export ║
║ ║
║ ✓ GGUF ✓ ONNX ✓ MLX ✓ SafeTensors ║
║ ║
╚════════════════════════════════════════════════════════════╝
SmartConfig Auto-Detection
See exactly what's happening before loading:
┌─────────────────────────────────────────┐
│ 📊 Model Analysis │
├─────────────────────────────────────────┤
│ Parameters 7.24B │
│ Original 14.5 GB │
│ Quantized 4.2 GB (71% saved) │
│ GPU Memory Available: 24 GB ✓ │
└─────────────────────────────────────────┘
Clean Console Output
- 🔇 Suppressed HuggingFace/Datasets noise
- 📊 Rich progress bars with ETA
- ✅ Clear success/error indicators
- 🎯 Actionable error messages
⚡ Performance Optimizations
| Feature | Improvement |
|---|---|
torch.compile |
Up to 2x faster training |
| Dynamic Padding | 30-50% less VRAM usage |
| Flash Attention 2 | Auto-enabled when available |
| Gradient Checkpointing | Automatic for large models |
| Memory Optimization | expandable_segments prevents OOM |
🐛 Bug Fixes
- FIXED:
TypeError: object of type 'generator' has no len()during GGUF export - FIXED:
ValueError: model did not return a losswith properDataCollatorForLanguageModeling - FIXED:
AttributeErrorwhen usingSmartConfigwithtorch.dtypeobjects - FIXED: BitsAndBytes models now properly dequantize before GGUF conversion
- FIXED: ONNX export now uses Optimum for correct graph conversion
- CHANGED: WandB disabled by default (enable with
WANDB_DISABLED="false")
📦 Installation
# Basic installation
pip install git+https://github.com/codewithdark-git/QuantLLM.git
# With ONNX support
pip install "quantllm[onnx]"
# With MLX support (Apple Silicon)
pip install "quantllm[mlx]"
# Full installation (all features)
pip install "quantllm[full]"🚀 Quick Start
from quantllm import turbo
# Load with automatic 4-bit quantization
model = turbo("meta-llama/Llama-3.2-3B")
# Chat
print(model.generate("What is machine learning?"))
# Export to GGUF for Ollama/llama.cpp
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")
# Or push directly to HuggingFace
model.push("username/my-model", format="gguf", quantization="Q4_K_M")📚 Documentation
🙏 Acknowledgments
Special thanks to the open-source community and all contributors who made this release possible.
v1.0.0
What's Changed
- Refactor: Improve Quantization Suite & Benchmarking by @codewithdark-git in #3
- Feat: Introduce QuantizerFactory API and Refactor Quantization Workflow by @codewithdark-git in #4
- Fix/awq quantized linear device issue by @codewithdark-git in #6
- Fix: Unify nn.Module device placement across all quantizers and base … by @codewithdark-git in #7
- Fix: Handle nn.Module in move_to_device by @codewithdark-git in #8
- Hi there! I've made some improvements to the AWQ quantization impleme… by @codewithdark-git in #9
- feat: Isolate and optimize GGUF quantization by @codewithdark-git in #10
Full Changelog: https://github.com/codewithdark-git/QuantLLM/commits/v1.0.0