A comprehensive tool for benchmarking and evaluating large language models, with a focus on instruction following capabilities.
- Multi-Provider Support: Test models from Anthropic, OpenAI, IBM WatsonX, and Ollama
- Instruction Following Evaluation: Built-in IFEval integration for standardized instruction following testing
- Concurrent Execution: Parallelize test execution for faster benchmarking
- Comprehensive Analysis: Generate detailed performance metrics and visualizations
- Flexible Configuration: Filter models, customize test cases, and control execution parameters
- Provider-Specific Optimizations: Model-specific prompt formatting for each provider
Proper prompt formatting can significantly impact model performance. Below are benchmark results showing the difference in performance for various models before and after prompt format optimization:
Our benchmark results demonstrate that:
- LLaMA Models: Show a dramatic 360% improvement with proper formatting
- Mistral Models: Benefit the most from formatters, with 500% better performance
- Granite Models: Still show significant gains with a 75% improvement
Without proper formatters, models perform significantly worse:
- LLaMA Models: 23% accuracy with formatters vs only 5% without
- Mistral Models: 20% accuracy with formatters vs 0% without
- Granite Models: 11.7% accuracy with formatters vs 6.7% without
These results conclusively demonstrate the critical importance of using the correct prompt format for each model architecture. Our benchmark uses provider-specific prompt templates tailored to each model architecture:
- Llama Format:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>{system}<|eot_id|><|start_header_id|>user<|end_header_id|>{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
- Granite Format:
<|system|>\n{system}\n<|user|>\n{prompt}\n<|assistant|>\n
- Mistral Format:
<s>[INST] {system}\n\n{prompt} [/INST]
- Clone this repository
git clone https://github.com/manavgup/llm-benchmark.git
cd llm-benchmark
- Install dependencies
pip install -r requirements.txt
- Create a
.env
file in the root directory with your API keys:
ANTHROPIC_API_KEY=your_anthropic_api_key
OPENAI_API_KEY=your_openai_api_key
WATSONX_API_KEY=your_watsonx_api_key
WATSONX_PROJECT_ID=your_watsonx_project_id
WATSONX_URL=https://us-south.ml.cloud.ibm.com
The benchmark tool provides a full-featured command line interface:
# Run benchmark with IFEval tests on all providers
python llm_benchmark.py --use-ifeval --verbose
# Run benchmark on specific providers with concurrent execution
python llm_benchmark.py --use-ifeval --providers watsonx --max-concurrent 8
# Filter for instruction-following models
python llm_benchmark.py --use-ifeval --providers watsonx --model-filter instruct
# Use predefined model sets
python llm_benchmark.py --use-ifeval --model-set watsonx-instruct
# List all available models
python llm_benchmark.py --list-models
--use-ifeval
: Use IFEval for standardized instruction following evaluation--providers
: Specify which providers to test (anthropic, openai, watsonx, ollama)--models
: Specify specific models to test--model-filter
: Filter models by type (instruct, chat, stable)--model-set
: Use predefined sets of models (anthropic, openai, watsonx-instruct, etc.)--max-concurrent
: Number of concurrent test executions (for faster benchmarking)--analyze-only
: Only analyze existing results without running tests--verbose
: Show detailed output during execution
Our benchmark now includes successful integration with the IFEval framework for standardized instruction following evaluation. This allows us to measure how well models understand and follow specific instructions using a consistent methodology.
The results demonstrate significant differences in instruction following capabilities among IBM's Granite models:
- Granite 3.2 8B Instruct: Highest accuracy at 0.61, showing strong instruction following capabilities
- Granite 3.8B Instruct: Moderate accuracy at 0.50
- Granite 34B Code Instruct: Lower accuracy at 0.34, despite being the largest model
Interestingly, these results show that model size doesn't directly correlate with instruction following ability. The smaller Granite 3.2 8B model outperforms its larger counterparts on instruction tasks.
from benchmark.benchmark import LLMBenchmark
from benchmark.analyzer import BenchmarkAnalyzer
from benchmark.visualizer import BenchmarkVisualizer
# Initialize benchmark with IFEval
benchmark = LLMBenchmark(
verbose=True,
use_ifeval=True,
results_dir="results/ifeval"
)
# Register specific models
benchmark.register_model(
name="claude-3-opus",
provider="anthropic",
model_id="claude-3-opus-20240229",
description="Claude 3 Opus by Anthropic"
)
# Run the benchmark with concurrent execution
results = benchmark.run_tests(max_concurrent=8)
# Analyze results
analyzer = BenchmarkAnalyzer(results_dir="results/ifeval")
summary_df, detailed_df, category_df = analyzer.analyze()
# Create visualizations
visualizer = BenchmarkVisualizer(results_dir="results/ifeval")
plot_paths = visualizer.create_all_plots(summary_df, detailed_df, category_df)
-
benchmark module: Core benchmarking logic and framework
benchmark.py
: Main benchmarking functionalityanalyzer.py
: Results analysisvisualizer.py
: Visualization generationevaluators.py
: Response evaluation functionsifeval_integration.py
: Integration with IFEvalconcurrent_executor.py
: Parallel test executioncli.py
: Command line interface
-
llm_clients module: Provider-specific client implementations
anthropic.py
,openai.py
,watsonx.py
,ollama.py
: Provider clientsprompt_formatters.py
: Provider-specific prompt formattingfactory.py
: Client factory for easy instantiation
Register new models using the register_model
method:
benchmark.register_model(
name="my-new-model",
provider="provider_name",
model_id="specific_model_id",
description="Description of model"
)
To support a new model architecture, extend the PromptFormatter
class in prompt_formatters.py
:
class NewModelFormatter(PromptFormatter):
"""Formatter for a new model architecture."""
@classmethod
def format_prompt(cls, prompt: str, system_prompt: Optional[str] = None) -> str:
"""Format a prompt for the new model architecture."""
default_system = "You are a helpful assistant."
system = system_prompt if system_prompt else default_system
# Create a formatted prompt using the model's expected format
formatted_prompt = f"<<system>>{system}<<user>>{prompt}<<assistant>>"
return formatted_prompt
Then update the get_formatter_for_model
method to detect and use your new formatter.
Create custom test cases in JSON format. Example:
[
{
"name": "Format Test",
"instruction": "Respond with your answer in the format: '<<Answer: X>>'",
"expected_output": "<<Answer:",
"eval_fn": "contains"
}
]
Extend the evaluator functions in evaluators.py
for custom evaluation logic.