MMM Benchmarking - An MMM & LLM Performance and Quality Benchmarking Tool

Overview

This project provides a Python-based tool to benchmark the performance and quality of various large language model (LLM) APIs. It sends sample queries to each API, measures response times, calculates average response times, and assesses the quality of responses using multiple metrics including BLEU and ROUGE scores.

The tool is designed to support API and model assessment for various use cases, considering both speed and output quality. Future versions will support deeper evaluation across multiple metrics and multi-modal model assessments.

Current Features

Supports multiple LLM APIs: OpenAI, Azure OpenAI, Anthropic, Custom OpenAI-compatible endpoints (e.g., LM Studio), Hugging Face, and AWS Bedrock.
Allows users to select specific APIs and models to benchmark through a command-line interface.
Uses predefined sample queries with reference answers for consistent evaluation.
Constrains response length to a fixed number of tokens for fair assessment.
Measures and reports average response times for each API and model combination.
Calculates BLEU scores to assess the quality of generated responses.
Calculates ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) for a more comprehensive quality assessment.
Provides color-coded output for easy interpretation of performance and quality metrics.

Project Structure

.
├── main.py
├── config.py
├── utils.py
├── quality_metrics.py
└── benchmarks/
    ├── __init__.py
    ├── base.py
    ├── openai_benchmark.py
    ├── azure_openai_benchmark.py
    ├── anthropic_benchmark.py
    ├── local_openai_benchmark.py
    ├── huggingface_benchmark.py
    └── aws_bedrock_benchmark.py

main.py: The entry point of the application
config.py: Contains configuration settings like API models, sample queries, and reference answers
utils.py: Utility functions for user input and output formatting
quality_metrics.py: Implements quality assessment metrics (BLEU and ROUGE scores)
benchmarks/: Directory containing benchmark classes for each API

Prerequisites

To run the benchmarking tool, you need:

Python 3.12.2 installed
Libraries defined in the requirements.txt file
API credentials and access tokens for the APIs you want to benchmark

Setup

Clone the repository:

git clone https://github.com/yourusername/llm-benchmarking-tool.git
cd llm-benchmarking-tool

Install the required dependencies (assuming you're using pyenv):

pyenv install 3.12.2
pyenv virtualenv 3.12.2 llm-benchmarking
pyenv activate llm-benchmarking
pip install -r requirements.txt

Set up API credentials:
- Obtain necessary API keys, access tokens, or authentication credentials for each API you want to benchmark.
- Create a .env file in the root directory, using the .env-sample file as a template. Add your API keys and endpoints as needed.

Install NLTK data for BLEU score calculation:

python -c "import nltk; nltk.download('punkt')"

Usage

Run the benchmarking tool with:

python main.py

Follow the prompts to:

Select the APIs you want to benchmark
Choose specific models for each selected API
Set the number of iterations for each benchmark

The tool will run the benchmarks and display results, showing the average response time, BLEU score, and ROUGE scores for each API and model combination.

Benchmarking Process

Load configuration settings and environment variables.
Present available APIs and prompt user for selection.
For each selected API, show available models and ask user to choose.
User specifies the number of iterations for each benchmark.
Run benchmarks, sending predefined queries to each selected API and model.
Measure response time and calculate BLEU and ROUGE scores for each query.
Display results, with color-coding based on average response times and quality metric scores.

Adding New Benchmarks

To add a new benchmark for a different API:

Create a new file in the benchmarks/ directory, e.g., new_api_benchmark.py.
Define a new class that inherits from BaseBenchmark in benchmarks/base.py.
Implement the setup_client(), invoke_model(), and extract_output() methods for the new API.
Add the new benchmark class to benchmarks/__init__.py.
Update config.py to include the new API and its available models.
Modify main.py to include the new benchmark in the benchmarks dictionary.

Example of a new benchmark class:

from .base import BaseBenchmark

class NewAPIBenchmark(BaseBenchmark):
    def __init__(self):
        super().__init__("New API")

    def setup_client(self):
        # Initialize and return the client for the new API
        pass

    def invoke_model(self, client, query, model, max_tokens):
        # Implement the API call and return the response
        pass

    def extract_output(self, response):
        # Extract and return the generated text from the API response
        pass

Future Developments

The (very long) list of potential future developments

LLM Todo List

Multimodal Assessment TODO List

Testing TODO List

A/B Testing:
- Design controlled experiments for comparing model performance based on user preferences.
- Develop a system for presenting responses from different APIs to users without revealing the source.
- Collect user preferences, ratings, or other metrics during A/B testing.
- Analyse A/B testing results to determine the relative performance of each API.

Contribution

Contributions are welcome! If you have ideas, suggestions, or bug reports, please open an issue or submit a pull request. Please follow the existing code style and provide appropriate documentation for your changes.

License

This project is licensed under the MIT License.

Acknowledgements

We'd like to thank the developers and maintainers of the various language models, their APIs, and the tools used in this benchmarking project. This tech continues to blow my mind on a daily basis.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
benchmarks		benchmarks
.env.sample		.env.sample
.gitignore		.gitignore
LICENSE		LICENSE
README.MD		README.MD
config.py		config.py
main.py		main.py
quality_metrics.py		quality_metrics.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MMM Benchmarking - An MMM & LLM Performance and Quality Benchmarking Tool

Overview

Current Features

Project Structure

Prerequisites

Setup

Usage

Benchmarking Process

Adding New Benchmarks

Future Developments

The (very long) list of potential future developments

LLM Todo List

Multimodal Assessment TODO List

Testing TODO List

Contribution

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

davidmoore-io/mmm_benchmarking

Folders and files

Latest commit

History

Repository files navigation

MMM Benchmarking - An MMM & LLM Performance and Quality Benchmarking Tool

Overview

Current Features

Project Structure

Prerequisites

Setup

Usage

Benchmarking Process

Adding New Benchmarks

Future Developments

The (very long) list of potential future developments

LLM Todo List

Multimodal Assessment TODO List

Testing TODO List

Contribution

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages