GPT-OSS-120B SGLang Example

This example demonstrates how to benchmark the GPT-OSS-120B model using SGLang as the inference backend with three evaluation datasets:

GPQA (Graduate-Level Google-Proof Q&A): Diamond subset for testing reasoning capabilities
AIME 2025: Mathematical problem-solving benchmark
LiveCodeBench: Coding evaluation benchmark

Prerequisites
Setup Instructions
- 1. Setting Up the SGLang Endpoint
- 2. Setting Up LiveCodeBench
Running the Benchmark
Evaluation Scripts
Configuration
Troubleshooting

Prerequisites

Python 3.12+
CUDA-capable GPU(s) with sufficient VRAM (recommended: 8x H200 or B200 GPUs)
SGLang installed (see setup instructions below)
Git
pip

Setup Instructions

1. Setting Up the SGLang Endpoint

You have two options for setting up the GPT-OSS-120B model with SGLang:

Option A: Using MLCommons MLPerf Inference Reference Implementation

The official MLPerf Inference reference implementation for GPT-OSS-120B provides detailed instructions for model setup, data preparation, and server deployment.

Clone the MLCommons Inference Repository:

git clone https://github.com/mlcommons/inference.git
cd inference/language/gpt-oss-120b

Follow the Setup Instructions:
- Review the README at https://github.com/mlcommons/inference/tree/master/language/gpt-oss-120b
- Download the model weights
- Set up the required dependencies
- Configure the environment
Launch the SGLang Server: Follow the instructions in the MLPerf reference implementation to start the SGLang server. The typical command looks like:
```
./sglang/run_server.sh \
     --model_path /path/to/gpt-oss-120b/model/ \
     --dp <Number of GPUs> \
     --stream_interval 100
```

Option B: Direct SGLang Installation

If you already have the model weights or prefer a direct approach, follow the instructions from SGLang on how to set up and deploy GPT-OSS. Make sure it is set to port 30000.

2. Setting Up LiveCodeBench

LiveCodeBench has a few security concerns and dependency conflicts, so it is recommended to run LiveCodeBench via the containerized workflow.

Follow the instructions in the LiveCodeBench README

Non-containerized run (NOT RECOMMENDED)

If you prefer to run lcb-service standalone without the docker container, do the following:

# Enter your venv for inference-endpoint
source /path/to/inference-endpoint/venv/bin/activate

# Downgrade Huggingface Datasets to 3.6.0
pip install datasets==3.6.0

# Install other dependencies
pip install fastapi==0.128.0 uvicorn[standard]==0.40.0

# Enable cli-based calling of LCBServe
export ALLOW_LCB_LOCAL_EVAL=true

After these steps, the LiveCodeBenchScorer will fallback to running lcb_serve as a subprocess on the host.

Running the Benchmark

Main Benchmark Suite

The run.py script runs all three benchmarks (GPQA, AIME25, and LiveCodeBench) in sequence:

python run.py \
    --report-dir ./results \
    --num-repeats 1 \
    --min-duration 10 \
    --max-duration 600

Arguments:

--report-dir: Directory to save benchmark results (default: sglang_accuracy_report)
--num-repeats: Number of times to repeat each dataset (default: 1)
--min-duration: Minimum benchmark duration in seconds (default: 10)
--max-duration: Maximum benchmark duration in seconds (default: 600)
--force-regenerate: Force regeneration of datasets even if they exist

Expected Output

The benchmark will display:

Progress bars for each dataset evaluation
Pass@1 scores for each benchmark:
- GPQA Diamond accuracy
- AIME25 problem-solving accuracy
- LiveCodeBench coding accuracy

Results are saved to the specified report directory with detailed event logs and metrics.

Evaluation Scripts

Individual evaluation scripts are provided for running each benchmark separately. These can be run after run.py is complete and the report directory has been generated.

GPQA Evaluation

python eval_gpqa.py \
    --dataset-path datasets/gpqa/diamond/gpqa_diamond.parquet \
    --report-dir sglang_accuracy_report

AIME25 Evaluation

python eval_aime.py \
    --dataset-path datasets/aime25/aime25.parquet \
    --report-dir sglang_accuracy_report

LiveCodeBench Evaluation

python eval_livecodebench.py \
    --dataset-path datasets/livecodebench/release_v6/livecodebench_release_v6.parquet \
    --report-dir sglang_accuracy_report \
    --lcb-version release_v6 \
    --timeout 60

Additional Arguments:

--lcb-version: LiveCodeBench version tag (default: release_v6)
--timeout: Timeout in seconds for each test execution (default: 60)

Configuration

Endpoint Configuration

The default endpoint configuration is in run.py:

SGLANG_SERVER_HOST = "localhost"
SGLANG_SERVER_PORT = 30000
SGLANG_ENDPOINT = f"http://{SGLANG_SERVER_HOST}:{SGLANG_SERVER_PORT}/generate"

To use a different endpoint, modify these constants or edit the script.

Worker Configuration

The HTTP client uses 4 workers by default:

http_config = HTTPClientConfig(
    endpoint_urls=[SGLANG_ENDPOINT],
    num_workers=4,
    api_type="sglang",
)

Adjust num_workers based on your workload and server capacity.

Troubleshooting

Server Connection Issues

Problem: Cannot connect to SGLang server

Solutions:

Verify the server is running: curl http://localhost:30000/health
Check firewall settings if using a remote server
Ensure the port number matches in both server and client configurations

Out of Memory Errors

Problem: CUDA out of memory errors

Solutions:

Increase the tensor parallelism size (--tp-size)
Reduce batch size in the generation config
Use GPUs with more VRAM
Check for memory leaks with nvidia-smi

LiveCodeBench Dependency Conflicts

Problem: Dependency version conflicts with datasets package

Solutions:

Use a separate virtual environment for LiveCodeBench
Ensure datasets==3.6.0 is installed (required by LiveCodeBench)
Check that the LiveCodeBench installation completed successfully

Slow Inference Speed

Problem: Benchmark takes too long to complete

Solutions:

Verify GPU utilization with nvidia-smi
Check server logs for bottlenecks
Increase num_workers in the HTTP client configuration
Consider using FlashInfer or other optimizations in SGLang

Dataset Generation Errors

Problem: Errors during dataset loading

Solutions:

Use --force-regenerate to regenerate datasets from scratch
Check internet connection for downloading Hugging Face datasets
Verify sufficient disk space for dataset caching
Check Hugging Face credentials if datasets are gated

Additional Resources

SGLang Documentation: https://sgl-project.github.io/
MLPerf Inference GPT-OSS-120B: https://github.com/mlcommons/inference/tree/master/language/gpt-oss-120b
LiveCodeBench: https://github.com/LiveCodeBench/LiveCodeBench
GPQA Dataset: https://huggingface.co/datasets/Idavidrein/gpqa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT-OSS-120B SGLang Example

Table of Contents

Prerequisites

Setup Instructions

1. Setting Up the SGLang Endpoint

Option A: Using MLCommons MLPerf Inference Reference Implementation

Option B: Direct SGLang Installation

2. Setting Up LiveCodeBench

Non-containerized run (NOT RECOMMENDED)

Running the Benchmark

Main Benchmark Suite

Expected Output

Evaluation Scripts

GPQA Evaluation

AIME25 Evaluation

LiveCodeBench Evaluation

Configuration

Endpoint Configuration

Worker Configuration

Troubleshooting

Server Connection Issues

Out of Memory Errors

LiveCodeBench Dependency Conflicts

Slow Inference Speed

Dataset Generation Errors

Additional Resources

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

GPT-OSS-120B SGLang Example

Table of Contents

Prerequisites

Setup Instructions

1. Setting Up the SGLang Endpoint

Option A: Using MLCommons MLPerf Inference Reference Implementation

Option B: Direct SGLang Installation

2. Setting Up LiveCodeBench

Non-containerized run (NOT RECOMMENDED)

Running the Benchmark

Main Benchmark Suite

Expected Output

Evaluation Scripts

GPQA Evaluation

AIME25 Evaluation

LiveCodeBench Evaluation

Configuration

Endpoint Configuration

Worker Configuration

Troubleshooting

Server Connection Issues

Out of Memory Errors

LiveCodeBench Dependency Conflicts

Slow Inference Speed

Dataset Generation Errors

Additional Resources