LLM Benchmark

A comprehensive tool for benchmarking and evaluating large language models, with a focus on instruction following capabilities.

Features

Multi-Provider Support: Test models from Anthropic, OpenAI, IBM WatsonX, and Ollama
Instruction Following Evaluation: Built-in IFEval integration for standardized instruction following testing
Concurrent Execution: Parallelize test execution for faster benchmarking
Comprehensive Analysis: Generate detailed performance metrics and visualizations
Flexible Configuration: Filter models, customize test cases, and control execution parameters
Provider-Specific Optimizations: Model-specific prompt formatting for each provider

Performance Impact of Prompt Formatting

Proper prompt formatting can significantly impact model performance. Below are benchmark results showing the difference in performance for various models before and after prompt format optimization:

Performance Improvement by Model Family

Our benchmark results demonstrate that:

LLaMA Models: Show a dramatic 360% improvement with proper formatting
Mistral Models: Benefit the most from formatters, with 500% better performance
Granite Models: Still show significant gains with a 75% improvement

Accuracy Comparison

Without proper formatters, models perform significantly worse:

LLaMA Models: 23% accuracy with formatters vs only 5% without
Mistral Models: 20% accuracy with formatters vs 0% without
Granite Models: 11.7% accuracy with formatters vs 6.7% without

These results conclusively demonstrate the critical importance of using the correct prompt format for each model architecture. Our benchmark uses provider-specific prompt templates tailored to each model architecture:

Llama Format: <|begin_of_text|><|start_header_id|>system<|end_header_id|>{system}<|eot_id|><|start_header_id|>user<|end_header_id|>{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Granite Format: <|system|>\n{system}\n<|user|>\n{prompt}\n<|assistant|>\n
Mistral Format: <s>[INST] {system}\n\n{prompt} [/INST]

Installation

Clone this repository

git clone https://github.com/manavgup/llm-benchmark.git
cd llm-benchmark

Install dependencies

pip install -r requirements.txt

Create a .env file in the root directory with your API keys:

ANTHROPIC_API_KEY=your_anthropic_api_key
OPENAI_API_KEY=your_openai_api_key
WATSONX_API_KEY=your_watsonx_api_key
WATSONX_PROJECT_ID=your_watsonx_project_id
WATSONX_URL=https://us-south.ml.cloud.ibm.com

Usage

Command Line Interface

The benchmark tool provides a full-featured command line interface:

# Run benchmark with IFEval tests on all providers
python llm_benchmark.py --use-ifeval --verbose

# Run benchmark on specific providers with concurrent execution
python llm_benchmark.py --use-ifeval --providers watsonx --max-concurrent 8

# Filter for instruction-following models
python llm_benchmark.py --use-ifeval --providers watsonx --model-filter instruct

# Use predefined model sets
python llm_benchmark.py --use-ifeval --model-set watsonx-instruct

# List all available models
python llm_benchmark.py --list-models

Key Command Line Options

--use-ifeval: Use IFEval for standardized instruction following evaluation
--providers: Specify which providers to test (anthropic, openai, watsonx, ollama)
--models: Specify specific models to test
--model-filter: Filter models by type (instruct, chat, stable)
--model-set: Use predefined sets of models (anthropic, openai, watsonx-instruct, etc.)
--max-concurrent: Number of concurrent test executions (for faster benchmarking)
--analyze-only: Only analyze existing results without running tests
--verbose: Show detailed output during execution

IFEval Integration Results

Our benchmark now includes successful integration with the IFEval framework for standardized instruction following evaluation. This allows us to measure how well models understand and follow specific instructions using a consistent methodology.

IBM Granite Model Performance on IFEval

The results demonstrate significant differences in instruction following capabilities among IBM's Granite models:

Granite 3.2 8B Instruct: Highest accuracy at 0.61, showing strong instruction following capabilities
Granite 3.8B Instruct: Moderate accuracy at 0.50
Granite 34B Code Instruct: Lower accuracy at 0.34, despite being the largest model

Interestingly, these results show that model size doesn't directly correlate with instruction following ability. The smaller Granite 3.2 8B model outperforms its larger counterparts on instruction tasks.

Programmatic Usage

from benchmark.benchmark import LLMBenchmark
from benchmark.analyzer import BenchmarkAnalyzer
from benchmark.visualizer import BenchmarkVisualizer

# Initialize benchmark with IFEval
benchmark = LLMBenchmark(
    verbose=True,
    use_ifeval=True,
    results_dir="results/ifeval"
)

# Register specific models
benchmark.register_model(
    name="claude-3-opus",
    provider="anthropic",
    model_id="claude-3-opus-20240229",
    description="Claude 3 Opus by Anthropic"
)

# Run the benchmark with concurrent execution
results = benchmark.run_tests(max_concurrent=8)

# Analyze results
analyzer = BenchmarkAnalyzer(results_dir="results/ifeval")
summary_df, detailed_df, category_df = analyzer.analyze()

# Create visualizations
visualizer = BenchmarkVisualizer(results_dir="results/ifeval")
plot_paths = visualizer.create_all_plots(summary_df, detailed_df, category_df)

Architecture

benchmark module: Core benchmarking logic and framework
- benchmark.py: Main benchmarking functionality
- analyzer.py: Results analysis
- visualizer.py: Visualization generation
- evaluators.py: Response evaluation functions
- ifeval_integration.py: Integration with IFEval
- concurrent_executor.py: Parallel test execution
- cli.py: Command line interface
llm_clients module: Provider-specific client implementations
- anthropic.py, openai.py, watsonx.py, ollama.py: Provider clients
- prompt_formatters.py: Provider-specific prompt formatting
- factory.py: Client factory for easy instantiation

Extending

Adding New Models

Register new models using the register_model method:

benchmark.register_model(
    name="my-new-model",
    provider="provider_name",
    model_id="specific_model_id",
    description="Description of model"
)

Adding New Prompt Formatters

To support a new model architecture, extend the PromptFormatter class in prompt_formatters.py:

class NewModelFormatter(PromptFormatter):
    """Formatter for a new model architecture."""
    
    @classmethod
    def format_prompt(cls, prompt: str, system_prompt: Optional[str] = None) -> str:
        """Format a prompt for the new model architecture."""
        default_system = "You are a helpful assistant."
        system = system_prompt if system_prompt else default_system
        
        # Create a formatted prompt using the model's expected format
        formatted_prompt = f"<<system>>{system}<<user>>{prompt}<<assistant>>"
        
        return formatted_prompt

Then update the get_formatter_for_model method to detect and use your new formatter.

Custom Test Cases

Create custom test cases in JSON format. Example:

[
  {
    "name": "Format Test",
    "instruction": "Respond with your answer in the format: '<<Answer: X>>'",
    "expected_output": "<<Answer:",
    "eval_fn": "contains"
  }
]

Custom Evaluators

Extend the evaluator functions in evaluators.py for custom evaluation logic.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
benchmark		benchmark
examples		examples
img		img
llm_clients		llm_clients
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.sample		env.sample
llm_benchmark.py		llm_benchmark.py
requirements.txt		requirements.txt
watsonx_test.py		watsonx_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Benchmark

Features

Performance Impact of Prompt Formatting

Performance Improvement by Model Family

Accuracy Comparison

Installation

Usage

Command Line Interface

Key Command Line Options

IFEval Integration Results

IBM Granite Model Performance on IFEval

Programmatic Usage

Architecture

Extending

Adding New Models

Adding New Prompt Formatters

Custom Test Cases

Custom Evaluators

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

manavgup/llm-benchmark

Folders and files

Latest commit

History

Repository files navigation

LLM Benchmark

Features

Performance Impact of Prompt Formatting

Performance Improvement by Model Family

Accuracy Comparison

Installation

Usage

Command Line Interface

Key Command Line Options

IFEval Integration Results

IBM Granite Model Performance on IFEval

Programmatic Usage

Architecture

Extending

Adding New Models

Adding New Prompt Formatters

Custom Test Cases

Custom Evaluators

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages