Benchmark Low-Bit LLMs: GPT-OSS & Qwen3

A comprehensive benchmarking framework for evaluating quantized GPT-OSS and Qwen3 models across multiple datasets and precision levels.

Quick Start

Setup Environment

cd ./dockers
sudo ./build_docker.sh ./[dockerfile_name]
sudo ./launch_docker.sh [container_name]

Run Evaluation

cd bench
python3 ./infer.py eval_dataset=mmlu-redux output_dir=/app/outputs/qwen3_4B_bf16_mmlu_redux
python3 ./compute_metric.py -c /app/outputs/qwen3_4B_bf16_mmlu_redux

Supported Models

GPT-OSS

GPT-OSS-20B: Original MXFP4 quantization format, NVFP4
[https://huggingface.co/2imi9/gpt-oss-20B-NVFP4A16-BF16]

Qwen3 Model Family

Qwen3-1.7B: Base model with comprehensive quantization support
- Precision levels: BF16, INT8, FP8, NVFP4, MXFP4, INT4
- Thinking mode variants available
- [https://huggingface.co/2imi9/Qwen3-1.7B-NVFP4A16]
- [https://huggingface.co/2imi9/Qwen3-1.7b-gptq-int4]
Qwen3-4B: Mid-size model with strong performance
- Precision levels: BF16, NVFP4
- Thinking/non-thinking mode support
- [https://huggingface.co/2imi9/Qwen3-4B-NVFP4A16]
Qwen3-30B-A3B-Instruct: Large instruction-tuned model
- Precision levels: BF16, NVFP4, MXFP4, INT8-W8A8
- Optimized for instruction following and reasoning tasks

Datasets

Core Benchmarks

MMLU-Redux (29 subjects): Multi-task language understanding
Math-500: Mathematical reasoning with LaTeX parsing
LiveCodeBench V5: Code generation (2024-10 to 2025-02)
IFEval: Instruction following evaluation
AIME 2025: Competition-level mathematics

Long Context (RULER-32K)

NIAH: Needle in haystack variants
CWE/FWE: Word extraction tasks
VT: Variable tracking

Quantization Methods

GPTQ (via llm-compressor)

# INT4 Group-32 quantization
python3 quantization/qwen3/gptq_int4_group32.py

# W4A8 quantization  
python3 quantization/qwen3/gptq_int4_w4a8.py

Custom Formats

cd quantization/custom

# NVFP4 quantization
python3 to_mxfp.py --config-name=quant quant_type=nvfp4

# Dequantize to BF16
python3 to_bf16.py --config-name=dequant quant_type=nvfp4

Configuration

Model Configs (`bench/conf/`)

qwen3_4B_bf16.yaml: Base Qwen3-4B configuration
qwen3_30B_A3B_Instruct_bf16.yaml: Instruction model
gptoss_20B.yaml: GPT-OSS configuration

Dataset Parameters

Each dataset supports:

Temperature, top_p, top_k sampling
Sequence length (4K-32K)
Thinking mode toggle
Multi-sequence generation

Evaluation Pipeline

Single Evaluation

python3 ./infer.py \
    eval_dataset=mmlu-redux \
    output_dir=/app/outputs/qwen3_4B_bf16_mmlu_redux \
    eval_predictor=vllm

Batch Evaluation

# Multiple datasets
bash bench/scripts/custom.sh

# Long context evaluation
bash bench/scripts/ruler.sh

Model Comparison

# Compare quantization levels
for precision in bf16 int4 int8; do
    python3 ./infer.py \
        model_config=qwen3_4B_${precision}.yaml \
        eval_dataset=mmlu-redux \
        output_dir=/app/outputs/qwen3_4B_${precision}_mmlu_redux
done

Results

Current Benchmarks

Model	Precision	MMLU-Redux	Reference
Qwen3-4B	BF16	72.2	77.3

Performance Targets

MMLU-Redux: 70%+ accuracy across quantization levels
Math-500: 90%+ with reasoning tokens
LiveCode V5: 60%+ Pass@1
RULER-32K: 80%+ retrieval accuracy

Advanced Features

vLLM Integration

Pipeline/tensor parallelism for multi-GPU
Dynamic batching with memory optimization
CPU offloading for large models

Custom Quantization

NVFP4: 4-bit with E4M3 scales
MXFP: Microscaling formats (4-bit/8-bit)
Block-wise quantization with triton kernels

Evaluation Metrics

Pass@k: Code generation success rates
Math-verify: LaTeX answer verification
String matching: Long context retrieval
Instruction following: Strict/loose compliance

File Structure

├── bench/                  # Core framework
│   ├── conf/              # Model configurations  
│   ├── dataset/           # Dataset implementations
│   ├── predictor/         # Inference backends
│   └── scripts/          # Batch evaluation
├── quantization/          # Quantization toolkit
│   ├── qwen3/            # GPTQ scripts
│   └── custom/           # NVFP4/MXFP methods
├── dockers/              # Container configs
└── outputs/              # Results storage

Development

Adding Models

Create config in bench/conf/model_name.yaml
Add chat template in bench/chat_template/model_name.py
Update predictor for custom quantization

Adding Datasets

Implement BaseDataset in bench/dataset/dataset_name.py
Register in bench/dataset/__init__.py
Add evaluation metrics

Testing

# Run with debug mode (uses smaller dataset)
python3 ./infer.py eval_dataset=mmlu-redux debug=true

# Test custom quantization
cd quantization/custom
pytest tests/test_nvfp4.py -v

# Test with different container
sudo ./launch_docker.sh test_container bench-gpu:latest

Hardware Requirements

Minimum

GPU: RTX 3080 (10GB VRAM)
RAM: 32GB system memory
Storage: 100GB for models/results

Legacy Support

V100: CUDA 7.0 compatibility mode
Multi-GPU: Pipeline parallelism for 30B+ models

Troubleshooting

Memory Issues

# Reduce memory usage
predictor_conf.vllm.gpu_memory_utilization=0.8
predictor_conf.vllm.cpu_offload_gb=20

# Enable tensor parallelism
predictor_conf.vllm.tensor_parallel_size=4

Performance Optimization

# Dynamic batching
predictor_conf.vllm.max_num_batched_tokens=16384
predictor_conf.vllm.max_num_seqs=8

# Disable features for speed
predictor_conf.vllm.enable_prefix_caching=false
predictor_conf.vllm.enforce_eager=true

Contributing

Follow existing code patterns and naming
Add tests for new quantization methods
Validate against reference implementations
Update documentation and benchmarks

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
bench		bench
dockers		dockers
quantization		quantization
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Benchmark Low-Bit LLMs: GPT-OSS & Qwen3

Quick Start

Setup Environment

Run Evaluation

Supported Models

GPT-OSS

Qwen3 Model Family

Datasets

Core Benchmarks

Long Context (RULER-32K)

Quantization Methods

GPTQ (via llm-compressor)

Custom Formats

Configuration

Model Configs (bench/conf/)

Dataset Parameters

Evaluation Pipeline

Single Evaluation

Batch Evaluation

Model Comparison

Results

Current Benchmarks

Performance Targets

Advanced Features

vLLM Integration

Custom Quantization

Evaluation Metrics

File Structure

Development

Adding Models

Adding Datasets

Testing

Hardware Requirements

Minimum

Recommended

Legacy Support

Troubleshooting

Memory Issues

Performance Optimization

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Model Configs (`bench/conf/`)

Packages