This document explains how to run distributed, data parallel evaluations on HPC clusters (which also sometimes don't have internet access on gpu nodes).
The distributed evaluation system allows you to run model evaluations in parallel across multiple nodes on a computing cluster, dramatically reducing the time needed to evaluate models on large benchmarks.
-
Basic workflow:
- Create a dataset of queries/prompts
- Distribute the dataset across multiple nodes
- Run model inference in parallel on each shard
- Collect and merge the results
- Calculate metrics and upload to database
-
Example command:
python eval/distributed/launch.py --model_name open-thoughts/OpenThinker-7B --tasks AIME24,AIME25,AMC23,MATH500,GPQADiamond,LiveCodeBench --num_shards 8 --max-job-duration 4 --watchdog
-
Launch script arguments:
--tasks: Comma-separated list of tasks to evaluate (e.g., "LiveCodeBench,AIME24")--model_name: HuggingFace model ID to evaluate--num_shards: Number of shards for distributed evaluation--max-job-duration: Maximum job duration in hours--watchdog: Monitor job progress and compute scores when done (recommended)
-
Cluster Auto-Detection
The system automatically detects which cluster you're running on by checking the hostname and configures the appropriate:
- HuggingFace cache location
- Conda environment
- SLURM batch script template
source /data/horse/ws/ryma833h-DCFT_Shared/miniconda3/bin/activate
conda activate evalchemy
cd /data/horse/ws/ryma833h-DCFT_Shared/evalchemysource /leonardo_work/EUHPC_E03_068/DCFT_shared/mamba/bin/activate /leonardo_work/EUHPC_E03_068/DCFT_shared/evalchemy/env/cpu-evalchemy
cd /leonardo_work/EUHPC_E03_068/DCFT_shared/evalchemy/Using --num_shards 8 is suggested for the best resource efficiency and job scheduling ease, however more shards can be used for "rush jobs" where you are willing to pay a premium (and there are enough GPUs available).
For instructions on setting up a new cluster environment, see the Adding a New Environment section below.
For detailed setup instructions for specific clusters, see:
-
Dataset Preparation:
- The launcher creates or reuses a cached dataset for the selected tasks
- It generates a unique task hash for each set of tasks
- This dataset is uploaded to HuggingFace Hub for distributed access
-
Job Submission:
- The launcher creates a customized sbatch script for the detected cluster
- It submits an array job with the specified number of shards
- Each shard processes a non-overlapping subset of the evaluation dataset
-
Model Inference:
- Each worker node:
- Loads the model (using vLLM for efficient inference)
- Processes its assigned data shard
- Applies the appropriate chat template
- Generates outputs with the specified sampling parameters
- Saves results to a local parquet file
- Each worker node:
-
Result Collection:
- When using
--watchdog, the launcher monitors job progress - Once all shards complete, results are uploaded to HuggingFace Hub
- Metrics are calculated and stored in the database
- When using
The distributed evaluation system creates precomputed answers that can be reused:
- Results are stored as parquet files in HuggingFace datasets
- These precomputed answers can be evaluated with different metrics later
- They can be inspected on the HuggingFace data studio
- Metrics are added to the repo README for easy viewing
Before deploying at scale, it's important to benchmark your distributed evaluation setup to determine the optimal number of shards for your workload. This helps balance execution time against resource efficiency.
Use the following command to benchmark with different shard counts:
# Replace N with the number of shards to test (e.g., 2, 4, 8, 16, 32, 64, 128)
SHARDS=N && python eval/distributed/launch.py --model_name [model-name] --tasks [task-list] --num_shards $SHARDS --watchdogWhen analyzing benchmark results, look for:
- Diminishing returns on time reduction: The point where adding more shards produces minimal time savings
- Resource efficiency: Total GPU hours used across all shards (lower is better)
- Initialization overhead: As shards increase, the time spent loading models and libraries becomes significant
- Utilization: At higher shard counts, vLLM might not fully utilize GPU resources due to smaller batch sizes per shard and increased overhead
To add support for a new HPC cluster:
-
Manual Testing:
- First ensure you can run the evaluation "manually" on a single node
- Test with an interactive session:
# Request an interactive node salloc --nodes=1 --ntasks-per-node=1 --gres=gpu:1 [other cluster-specific options] # Test model loading and inference python eval/distributed/process_shard.py --global_size 1 --rank 0 --input_dataset [dataset-id] --model_name [model-path] --output_dataset [output-dir]
-
Update Launch Script:
- Modify
eval/distributed/launch.pyto add hostname detection for your cluster - Add the appropriate paths for:
- HF_HUB_CACHE
- Python environment
- Activation commands
- Modify
-
Create SBATCH Script:
- Create a new file
eval/distributed/process_shards_[cluster].sbatch - Include cluster-specific configurations:
- Partition and account information
- Module loading commands
- Environment setup
- Path configurations
- Create a new file
-
Benchmark Performance:
- Run the benchmark command with different shard counts
- Analyze the trade-off between execution time and total GPU hours
- Set an optimal default in
launch.pybased on your results
-
Document Setup:
- Create a detailed setup guide (
SETUP_[CLUSTER].md) - Include all necessary setup steps
- Document any cluster-specific quirks or requirements
- Add benchmarking results and insights
- Create a detailed setup guide (

