A comprehensive benchmarking framework for evaluating quantized GPT-OSS and Qwen3 models across multiple datasets and precision levels.
cd ./dockers
sudo ./build_docker.sh ./[dockerfile_name]
sudo ./launch_docker.sh [container_name]cd bench
python3 ./infer.py eval_dataset=mmlu-redux output_dir=/app/outputs/qwen3_4B_bf16_mmlu_redux
python3 ./compute_metric.py -c /app/outputs/qwen3_4B_bf16_mmlu_redux- GPT-OSS-20B: Original MXFP4 quantization format, NVFP4
- [https://huggingface.co/2imi9/gpt-oss-20B-NVFP4A16-BF16]
- Qwen3-1.7B: Base model with comprehensive quantization support
- Precision levels: BF16, INT8, FP8, NVFP4, MXFP4, INT4
- Thinking mode variants available
- [https://huggingface.co/2imi9/Qwen3-1.7B-NVFP4A16]
- [https://huggingface.co/2imi9/Qwen3-1.7b-gptq-int4]
- Qwen3-4B: Mid-size model with strong performance
- Precision levels: BF16, NVFP4
- Thinking/non-thinking mode support
- [https://huggingface.co/2imi9/Qwen3-4B-NVFP4A16]
- Qwen3-30B-A3B-Instruct: Large instruction-tuned model
- Precision levels: BF16, NVFP4, MXFP4, INT8-W8A8
- Optimized for instruction following and reasoning tasks
- MMLU-Redux (29 subjects): Multi-task language understanding
- Math-500: Mathematical reasoning with LaTeX parsing
- LiveCodeBench V5: Code generation (2024-10 to 2025-02)
- IFEval: Instruction following evaluation
- AIME 2025: Competition-level mathematics
- NIAH: Needle in haystack variants
- CWE/FWE: Word extraction tasks
- VT: Variable tracking
# INT4 Group-32 quantization
python3 quantization/qwen3/gptq_int4_group32.py
# W4A8 quantization
python3 quantization/qwen3/gptq_int4_w4a8.pycd quantization/custom
# NVFP4 quantization
python3 to_mxfp.py --config-name=quant quant_type=nvfp4
# Dequantize to BF16
python3 to_bf16.py --config-name=dequant quant_type=nvfp4qwen3_4B_bf16.yaml: Base Qwen3-4B configurationqwen3_30B_A3B_Instruct_bf16.yaml: Instruction modelgptoss_20B.yaml: GPT-OSS configuration
Each dataset supports:
- Temperature, top_p, top_k sampling
- Sequence length (4K-32K)
- Thinking mode toggle
- Multi-sequence generation
python3 ./infer.py \
eval_dataset=mmlu-redux \
output_dir=/app/outputs/qwen3_4B_bf16_mmlu_redux \
eval_predictor=vllm# Multiple datasets
bash bench/scripts/custom.sh
# Long context evaluation
bash bench/scripts/ruler.sh# Compare quantization levels
for precision in bf16 int4 int8; do
python3 ./infer.py \
model_config=qwen3_4B_${precision}.yaml \
eval_dataset=mmlu-redux \
output_dir=/app/outputs/qwen3_4B_${precision}_mmlu_redux
done| Model | Precision | MMLU-Redux | Reference |
|---|---|---|---|
| Qwen3-4B | BF16 | 72.2 | 77.3 |
- MMLU-Redux: 70%+ accuracy across quantization levels
- Math-500: 90%+ with reasoning tokens
- LiveCode V5: 60%+ Pass@1
- RULER-32K: 80%+ retrieval accuracy
- Pipeline/tensor parallelism for multi-GPU
- Dynamic batching with memory optimization
- CPU offloading for large models
- NVFP4: 4-bit with E4M3 scales
- MXFP: Microscaling formats (4-bit/8-bit)
- Block-wise quantization with triton kernels
- Pass@k: Code generation success rates
- Math-verify: LaTeX answer verification
- String matching: Long context retrieval
- Instruction following: Strict/loose compliance
├── bench/ # Core framework
│ ├── conf/ # Model configurations
│ ├── dataset/ # Dataset implementations
│ ├── predictor/ # Inference backends
│ └── scripts/ # Batch evaluation
├── quantization/ # Quantization toolkit
│ ├── qwen3/ # GPTQ scripts
│ └── custom/ # NVFP4/MXFP methods
├── dockers/ # Container configs
└── outputs/ # Results storage
- Create config in
bench/conf/model_name.yaml - Add chat template in
bench/chat_template/model_name.py - Update predictor for custom quantization
- Implement
BaseDatasetinbench/dataset/dataset_name.py - Register in
bench/dataset/__init__.py - Add evaluation metrics
# Run with debug mode (uses smaller dataset)
python3 ./infer.py eval_dataset=mmlu-redux debug=true
# Test custom quantization
cd quantization/custom
pytest tests/test_nvfp4.py -v
# Test with different container
sudo ./launch_docker.sh test_container bench-gpu:latest- GPU: RTX 3080 (10GB VRAM)
- RAM: 32GB system memory
- Storage: 100GB for models/results
- GPU: RTX 5090 (32GB VRAM) or 4x RTX 3080
- RAM: 128GB system memory
- Storage: 500GB NVMe SSD
- V100: CUDA 7.0 compatibility mode
- Multi-GPU: Pipeline parallelism for 30B+ models
# Reduce memory usage
predictor_conf.vllm.gpu_memory_utilization=0.8
predictor_conf.vllm.cpu_offload_gb=20
# Enable tensor parallelism
predictor_conf.vllm.tensor_parallel_size=4# Dynamic batching
predictor_conf.vllm.max_num_batched_tokens=16384
predictor_conf.vllm.max_num_seqs=8
# Disable features for speed
predictor_conf.vllm.enable_prefix_caching=false
predictor_conf.vllm.enforce_eager=true- Follow existing code patterns and naming
- Add tests for new quantization methods
- Validate against reference implementations
- Update documentation and benchmarks
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.