Skip to content

manavgup/llm-benchmark

Repository files navigation

LLM Benchmark

A comprehensive tool for benchmarking and evaluating large language models, with a focus on instruction following capabilities.

Features

  • Multi-Provider Support: Test models from Anthropic, OpenAI, IBM WatsonX, and Ollama
  • Instruction Following Evaluation: Built-in IFEval integration for standardized instruction following testing
  • Concurrent Execution: Parallelize test execution for faster benchmarking
  • Comprehensive Analysis: Generate detailed performance metrics and visualizations
  • Flexible Configuration: Filter models, customize test cases, and control execution parameters
  • Provider-Specific Optimizations: Model-specific prompt formatting for each provider

Performance Impact of Prompt Formatting

Proper prompt formatting can significantly impact model performance. Below are benchmark results showing the difference in performance for various models before and after prompt format optimization:

Performance Improvement by Model Family

Benchmark Before Optimization

Our benchmark results demonstrate that:

  • LLaMA Models: Show a dramatic 360% improvement with proper formatting
  • Mistral Models: Benefit the most from formatters, with 500% better performance
  • Granite Models: Still show significant gains with a 75% improvement

Accuracy Comparison

Without proper formatters, models perform significantly worse:

  • LLaMA Models: 23% accuracy with formatters vs only 5% without
  • Mistral Models: 20% accuracy with formatters vs 0% without
  • Granite Models: 11.7% accuracy with formatters vs 6.7% without

These results conclusively demonstrate the critical importance of using the correct prompt format for each model architecture. Our benchmark uses provider-specific prompt templates tailored to each model architecture:

  • Llama Format: <|begin_of_text|><|start_header_id|>system<|end_header_id|>{system}<|eot_id|><|start_header_id|>user<|end_header_id|>{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
  • Granite Format: <|system|>\n{system}\n<|user|>\n{prompt}\n<|assistant|>\n
  • Mistral Format: <s>[INST] {system}\n\n{prompt} [/INST]

Installation

  1. Clone this repository
git clone https://github.com/manavgup/llm-benchmark.git
cd llm-benchmark
  1. Install dependencies
pip install -r requirements.txt
  1. Create a .env file in the root directory with your API keys:
ANTHROPIC_API_KEY=your_anthropic_api_key
OPENAI_API_KEY=your_openai_api_key
WATSONX_API_KEY=your_watsonx_api_key
WATSONX_PROJECT_ID=your_watsonx_project_id
WATSONX_URL=https://us-south.ml.cloud.ibm.com

Usage

Command Line Interface

The benchmark tool provides a full-featured command line interface:

# Run benchmark with IFEval tests on all providers
python llm_benchmark.py --use-ifeval --verbose

# Run benchmark on specific providers with concurrent execution
python llm_benchmark.py --use-ifeval --providers watsonx --max-concurrent 8

# Filter for instruction-following models
python llm_benchmark.py --use-ifeval --providers watsonx --model-filter instruct

# Use predefined model sets
python llm_benchmark.py --use-ifeval --model-set watsonx-instruct

# List all available models
python llm_benchmark.py --list-models

Key Command Line Options

  • --use-ifeval: Use IFEval for standardized instruction following evaluation
  • --providers: Specify which providers to test (anthropic, openai, watsonx, ollama)
  • --models: Specify specific models to test
  • --model-filter: Filter models by type (instruct, chat, stable)
  • --model-set: Use predefined sets of models (anthropic, openai, watsonx-instruct, etc.)
  • --max-concurrent: Number of concurrent test executions (for faster benchmarking)
  • --analyze-only: Only analyze existing results without running tests
  • --verbose: Show detailed output during execution

IFEval Integration Results

Our benchmark now includes successful integration with the IFEval framework for standardized instruction following evaluation. This allows us to measure how well models understand and follow specific instructions using a consistent methodology.

IBM Granite Model Performance on IFEval

IBM Granite Model Performance

The results demonstrate significant differences in instruction following capabilities among IBM's Granite models:

  • Granite 3.2 8B Instruct: Highest accuracy at 0.61, showing strong instruction following capabilities
  • Granite 3.8B Instruct: Moderate accuracy at 0.50
  • Granite 34B Code Instruct: Lower accuracy at 0.34, despite being the largest model

Interestingly, these results show that model size doesn't directly correlate with instruction following ability. The smaller Granite 3.2 8B model outperforms its larger counterparts on instruction tasks.

Programmatic Usage

from benchmark.benchmark import LLMBenchmark
from benchmark.analyzer import BenchmarkAnalyzer
from benchmark.visualizer import BenchmarkVisualizer

# Initialize benchmark with IFEval
benchmark = LLMBenchmark(
    verbose=True,
    use_ifeval=True,
    results_dir="results/ifeval"
)

# Register specific models
benchmark.register_model(
    name="claude-3-opus",
    provider="anthropic",
    model_id="claude-3-opus-20240229",
    description="Claude 3 Opus by Anthropic"
)

# Run the benchmark with concurrent execution
results = benchmark.run_tests(max_concurrent=8)

# Analyze results
analyzer = BenchmarkAnalyzer(results_dir="results/ifeval")
summary_df, detailed_df, category_df = analyzer.analyze()

# Create visualizations
visualizer = BenchmarkVisualizer(results_dir="results/ifeval")
plot_paths = visualizer.create_all_plots(summary_df, detailed_df, category_df)

Architecture

  • benchmark module: Core benchmarking logic and framework

    • benchmark.py: Main benchmarking functionality
    • analyzer.py: Results analysis
    • visualizer.py: Visualization generation
    • evaluators.py: Response evaluation functions
    • ifeval_integration.py: Integration with IFEval
    • concurrent_executor.py: Parallel test execution
    • cli.py: Command line interface
  • llm_clients module: Provider-specific client implementations

    • anthropic.py, openai.py, watsonx.py, ollama.py: Provider clients
    • prompt_formatters.py: Provider-specific prompt formatting
    • factory.py: Client factory for easy instantiation

Extending

Adding New Models

Register new models using the register_model method:

benchmark.register_model(
    name="my-new-model",
    provider="provider_name",
    model_id="specific_model_id",
    description="Description of model"
)

Adding New Prompt Formatters

To support a new model architecture, extend the PromptFormatter class in prompt_formatters.py:

class NewModelFormatter(PromptFormatter):
    """Formatter for a new model architecture."""
    
    @classmethod
    def format_prompt(cls, prompt: str, system_prompt: Optional[str] = None) -> str:
        """Format a prompt for the new model architecture."""
        default_system = "You are a helpful assistant."
        system = system_prompt if system_prompt else default_system
        
        # Create a formatted prompt using the model's expected format
        formatted_prompt = f"<<system>>{system}<<user>>{prompt}<<assistant>>"
        
        return formatted_prompt

Then update the get_formatter_for_model method to detect and use your new formatter.

Custom Test Cases

Create custom test cases in JSON format. Example:

[
  {
    "name": "Format Test",
    "instruction": "Respond with your answer in the format: '<<Answer: X>>'",
    "expected_output": "<<Answer:",
    "eval_fn": "contains"
  }
]

Custom Evaluators

Extend the evaluator functions in evaluators.py for custom evaluation logic.

About

Repository to compare instruction following capabilities of various models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages