An optimized benchmarking tool for comparing Large Language Model providers during prompt engineering. Test OpenAI, AWS Bedrock, and Google Gemini side-by-side to optimize performance and accuracy across any content type.
- Multi-Provider Support: Compare OpenAI, AWS Bedrock (multiple models), and Google Gemini
- Multiple Models Per Provider: Test different model variants (e.g., GPT-4 Nano vs Mini, Claude Haiku vs Sonnet)
- Vision Model Support: Full support for Llama 4, Claude, Pixtral, GPT-4V, and Gemini vision models
- Multi-Image Processing: Process ALL images in test_images/ directory when no specific image is set
- Flexible Input: Test with text prompts, images, documents, or any content type
- Structured Output Testing: Compare how well each provider follows your JSON schemas
- Multi-Tool Testing: Let AI choose the best analysis method from multiple options
- Modern API Integration: Uses each provider's optimal structured output methods:
- OpenAI:
json_schemaand function calling - Claude: Tool use with structured schemas
- Llama 4: Converse API with tool support for vision
- Pixtral: Function calling with image analysis
- Gemini:
responseSchemawith union types
- OpenAI:
- Raw Response Comparison: See authentic output formatting from each provider
- Production Ready: Async operations, error handling, rate limiting
- Performance Tracking: Monitor token usage and latency across providers
- Configurable: YAML-based test configuration with environment variables
- Prompt Engineering: Compare how different providers handle your prompts
- Schema Validation: Test which provider best follows your structured output requirements
- Multi-Tool Selection: Benchmark AI's ability to choose appropriate analysis methods
- Vision Analysis: Compare accuracy of image understanding across models
- Performance Testing: Measure latency and reliability across providers
- Object Detection: Test how well different models identify and categorize various objects
- Model Comparison: Find optimal cost/performance balance across model variants
git clone https://github.com/realadeel/llm-test-bench.git
cd llm-test-bench
pip install -r requirements.txtCopy the example environment file and add your API keys:
cp .env.example .env
# Edit .env with your actual API keysRequired keys:
OPENAI_API_KEY- Get from OpenAI PlatformAWS_ACCESS_KEY_ID&AWS_SECRET_ACCESS_KEY- Get from AWS ConsoleGEMINI_API_KEY- Get from Google AI Studio
Copy and customize the config:
cp config.yaml.example config.yaml
# Edit config.yaml with your test cases# Add your test images to test_images/
cp your_content/*.jpg test_images/
cp your_files/*.png test_images/
# Or process ALL existing images by commenting out image_path in config.yamlpython llm_test_bench.pyπ Test complete!
π Test Cases: 1
πΌοΈ Images Processed: 3
β
Successful Provider Calls: 9
β Failed Provider Calls: 0
π Production-Optimized Multi-Tool Analysis (3 images):
πΈ image1:
β
openai_gpt4_nano: 1101ms
β
gemini_flash_lite: 987ms
β
bedrock_haiku_3: 1234ms
πΈ image2:
β
openai_gpt4_nano: 987ms
β
gemini_flash_lite: 876ms
β
bedrock_haiku_3: 1098ms
πΈ image3:
β
openai_gpt4_nano: 1045ms
β
gemini_flash_lite: 934ms
β
bedrock_haiku_3: 1176ms
π Results saved to results/test_results_2025-07-07_01-54-05.json
Efficient format with prompt and tools stored once per test case:
[
{
"name": "Production-Optimized Multi-Tool Analysis",
"prompt": "You are a professional appraiser...", // STORED ONCE
"max_tokens": 2000,
"temperature": 0.1,
"tools": [...], // STORED ONCE
"is_multi_image": true,
"image_results": [
{
"image_path": "test_images/image1.jpg",
"provider_results": [
{
"provider": "openai_gpt4_nano",
"model": "gpt-4.1-nano-2025-04-14",
"response": "{...}",
"latency_ms": 1101.5,
"timestamp": "2025-07-07T01:53:52.253486",
"error": null,
"tokens_used": 107
}
// ... other providers
]
},
{
"image_path": "test_images/image2.jpg",
"provider_results": [
// ... provider results for image2
]
}
// ... other images
]
}
]Benefits:
- πΎ Maximum efficiency (prompt + tools stored once per test case)
- π Perfect organization (results grouped by test case, then by image)
- π Easy analysis (compare providers across all images in one test case)
- βοΈ Full compatibility (all original data preserved)
Process ALL images in your test directory:
test_cases:
- name: "Batch Analysis"
prompt: "Analyze this object..."
# image_path: "specific.jpg" # Comment out to process ALL images
tools:
# ... your toolsResult: Automatically processes every .jpg, .png, .gif, .webp file in test_images/
Test different model variants from the same provider:
providers:
# Multiple OpenAI models
- name: "openai_gpt4_nano"
model: "gpt-4.1-nano-2025-04-14"
- name: "openai_gpt4_mini"
model: "gpt-4.1-mini-2025-04-14"
# Multiple Gemini models
- name: "gemini_flash_lite"
model: "gemini-2.0-flash-lite"
- name: "gemini_pro"
model: "gemini-1.5-pro"
# Multiple Bedrock models
- name: "bedrock_haiku_3"
model: "anthropic.claude-3-haiku-20240307-v1:0"
- name: "bedrock_sonnet_4"
model: "us.anthropic.claude-sonnet-4-20250514-v1:0"Naming Convention: Use pattern {provider}_{model_family}_{variant} for clear identification in results.
Let the AI choose the best analysis method:
test_cases:
- name: "Smart Object Analysis"
prompt: "Examine this image and choose the appropriate analysis tool."
image_path: "test_images/sample_image.jpg"
tools:
- name: "analyze_media_content"
description: "For digital media files and multimedia content"
schema:
type: "object"
properties:
title: {type: "string"}
creator: {type: "string"}
category: {type: "string"}
year: {type: "string"}
required: ["title", "creator"]
- name: "analyze_publication"
description: "For printed materials and publications"
schema:
type: "object"
properties:
title: {type: "string"}
author: {type: "string"}
publisher: {type: "string"}
required: ["title", "author"]Traditional structured output testing:
test_cases:
- name: "Object Detection"
prompt: "List all objects in this image."
image_path: "test_images/scene.jpg"
schema:
type: "object"
properties:
objects:
type: "array"
items:
properties:
name: {type: "string"}
confidence: {type: "number"}
total_count: {type: "integer"}
required: ["objects", "total_count"]llm-test-bench/
βββ llm_test_bench.py # Main benchmarking engine
βββ config.yaml # Your test configuration
βββ config.yaml.example # Example configuration
βββ test_images/ # Your test images (auto-processed when image_path not set)
βββ results/ # Benchmark results (optimized JSON format)
βββ docs/ # Documentation
β βββ README.md # Comprehensive documentation
βββ requirements.txt # Dependencies
providers:
# Multiple OpenAI models
- name: "openai_gpt4_nano"
model: "gpt-4.1-nano-2025-04-14"
- name: "openai_gpt4_mini"
model: "gpt-4.1-mini-2025-04-14"
# Multiple Gemini models
- name: "gemini_flash_lite"
model: "gemini-2.0-flash-lite"
- name: "gemini_pro"
model: "gemini-1.5-pro"
# Multiple Bedrock models
- name: "bedrock_haiku_3"
model: "anthropic.claude-3-haiku-20240307-v1:0"
- name: "bedrock_sonnet_4"
model: "us.anthropic.claude-sonnet-4-20250514-v1:0"
- name: "bedrock_llama_4_maverick"
model: "us.meta.llama4-maverick-17b-instruct-v1:0"
# Rate limiting
delay_between_calls: 1 # seconds between API calls
delay_between_test_cases: 2 # seconds between test casestest_cases:
- name: "Custom Test"
prompt: "Your analysis prompt..."
image_path: "test_images/image.jpg" # Or comment out for multi-image
max_tokens: 2000
temperature: 0.7
# ... schema or tools- Loads Configuration: Reads your test cases and provider settings from YAML
- Processes Images: Converts images to base64 for API calls
- Smart API Selection: Uses optimal API for each model (Converse for Llama 4 vision, InvokeModel for others)
- Structured Requests: Converts your tool schemas to each provider's format
- Captures Raw Responses: Records authentic JSON output from each API
- Measures Performance: Tracks latency, tokens, and success rates
- Optimized Storage: Groups results by test case for efficient analysis
- Saves Results: Outputs organized JSON results for comparison
- Python 3.8+
- API keys for desired providers
- Images in supported formats (JPG, PNG, GIF, WebP)
MIT License - see LICENSE file for details.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- π Report Issues
- π‘ Request Features
- π Documentation
Ready to benchmark your vision AI? Start with the Quick Start guide above! π