Skip to content

Releases: codewithdark-git/QuantLLM

QuantLLM v2.0

21 Dec 15:04
26230d0

Choose a tag to compare

🚀 QuantLLM v2.0

The Ultra-Fast LLM Quantization & Export Library

PyPI License Python

Load → Quantize → Finetune → Export — All in One Line


🎉 What's New in v2.0

We're excited to announce QuantLLM v2.0 — a complete redesign focused on simplicity, performance, and developer experience. This release transforms LLM quantization from a complex multi-step process into a single, intuitive workflow.


✨ Key Features

🔥 TurboModel: One API to Rule Them All

Gone are the days of juggling multiple libraries. TurboModel unifies everything:

from quantllm import turbo

# Load any model with automatic optimization
model = turbo("meta-llama/Llama-3.2-3B")

# Generate text instantly
response = model.generate("Explain quantum computing in simple terms")

# Fine-tune with one line
model.finetune(dataset, epochs=3)

# Export to any format
model.export("gguf", quantization="Q4_K_M")    # → llama.cpp, Ollama, LM Studio
model.export("onnx")                            # → ONNX Runtime, TensorRT
model.export("mlx", quantization="4bit")        # → Apple Silicon

📦 Multi-Format Export

Export your models to any deployment target:

Format Use Case Platforms
GGUF llama.cpp, Ollama, LM Studio Windows, Linux, macOS
ONNX ONNX Runtime, TensorRT Cross-platform
MLX Apple Silicon optimized macOS (M1/M2/M3/M4)
SafeTensors HuggingFace Transformers Cross-platform

🎯 Native GGUF Export — No C++ Required!

Forget compiling llama.cpp or wrestling with C++ toolchains:

# Just works™ — on any platform
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")
  • ✅ Pure Python implementation
  • ✅ All quantization types: Q2_KQ8_0, F16, F32
  • ✅ Windows, Linux, macOS — zero configuration
  • ✅ Automatic llama.cpp installation when needed

🤗 One-Click Hub Publishing

Push directly to HuggingFace with auto-generated model cards:

model.push(
    "your-username/my-awesome-model",
    format="gguf",
    quantization="Q4_K_M"
)

Auto-generated features:

  • 📋 Proper YAML frontmatter (library_name, tags, base_model)
  • 📖 Format-specific usage examples
  • 🔘 "Use this model" button compatibility
  • 📊 Quantization details and benchmarks

🎨 Beautiful Developer Experience

Themed Progress & Logging

A cohesive orange theme across all interactions:

╔════════════════════════════════════════════════════════════╗
║                                                            ║
║   🚀 QuantLLM v2.0.0                                       ║
║   Ultra-fast LLM Quantization & Export                     ║
║                                                            ║
║   ✓ GGUF  ✓ ONNX  ✓ MLX  ✓ SafeTensors                     ║
║                                                            ║
╚════════════════════════════════════════════════════════════╝

SmartConfig Auto-Detection

See exactly what's happening before loading:

┌─────────────────────────────────────────┐
│  📊 Model Analysis                      │
├─────────────────────────────────────────┤
│  Parameters    7.24B                    │
│  Original      14.5 GB                  │
│  Quantized     4.2 GB (71% saved)       │
│  GPU Memory    Available: 24 GB ✓       │
└─────────────────────────────────────────┘

Clean Console Output

  • 🔇 Suppressed HuggingFace/Datasets noise
  • 📊 Rich progress bars with ETA
  • ✅ Clear success/error indicators
  • 🎯 Actionable error messages

⚡ Performance Optimizations

Feature Improvement
torch.compile Up to 2x faster training
Dynamic Padding 30-50% less VRAM usage
Flash Attention 2 Auto-enabled when available
Gradient Checkpointing Automatic for large models
Memory Optimization expandable_segments prevents OOM

🐛 Bug Fixes

  • FIXED: TypeError: object of type 'generator' has no len() during GGUF export
  • FIXED: ValueError: model did not return a loss with proper DataCollatorForLanguageModeling
  • FIXED: AttributeError when using SmartConfig with torch.dtype objects
  • FIXED: BitsAndBytes models now properly dequantize before GGUF conversion
  • FIXED: ONNX export now uses Optimum for correct graph conversion
  • CHANGED: WandB disabled by default (enable with WANDB_DISABLED="false")

📦 Installation

# Basic installation
pip install git+https://github.com/codewithdark-git/QuantLLM.git

# With ONNX support
pip install "quantllm[onnx]"

# With MLX support (Apple Silicon)
pip install "quantllm[mlx]"

# Full installation (all features)
pip install "quantllm[full]"

🚀 Quick Start

from quantllm import turbo

# Load with automatic 4-bit quantization
model = turbo("meta-llama/Llama-3.2-3B")

# Chat
print(model.generate("What is machine learning?"))

# Export to GGUF for Ollama/llama.cpp
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")

# Or push directly to HuggingFace
model.push("username/my-model", format="gguf", quantization="Q4_K_M")

📚 Documentation


🙏 Acknowledgments

Special thanks to the open-source community and all contributors who made this release possible.


Made with 🧡 by Dark Coder

⭐ Star on GitHub ·
🐛 Report Bug ·
💖 Sponsor


Happy Quantizing! 🚀

v1.0.0

08 Dec 22:48
49e8748

Choose a tag to compare

What's Changed

Full Changelog: https://github.com/codewithdark-git/QuantLLM/commits/v1.0.0