Skip to content

Compare LLM providers (OpenAI, Claude, Gemini) for vision tasks - benchmark latency, accuracy, and cost

License

Notifications You must be signed in to change notification settings

realadeel/llm-test-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ LLM Test Bench

An optimized benchmarking tool for comparing Large Language Model providers during prompt engineering. Test OpenAI, AWS Bedrock, and Google Gemini side-by-side to optimize performance and accuracy across any content type.

✨ Features

  • Multi-Provider Support: Compare OpenAI, AWS Bedrock (multiple models), and Google Gemini
  • Multiple Models Per Provider: Test different model variants (e.g., GPT-4 Nano vs Mini, Claude Haiku vs Sonnet)
  • Vision Model Support: Full support for Llama 4, Claude, Pixtral, GPT-4V, and Gemini vision models
  • Multi-Image Processing: Process ALL images in test_images/ directory when no specific image is set
  • Flexible Input: Test with text prompts, images, documents, or any content type
  • Structured Output Testing: Compare how well each provider follows your JSON schemas
  • Multi-Tool Testing: Let AI choose the best analysis method from multiple options
  • Modern API Integration: Uses each provider's optimal structured output methods:
    • OpenAI: json_schema and function calling
    • Claude: Tool use with structured schemas
    • Llama 4: Converse API with tool support for vision
    • Pixtral: Function calling with image analysis
    • Gemini: responseSchema with union types
  • Raw Response Comparison: See authentic output formatting from each provider
  • Production Ready: Async operations, error handling, rate limiting
  • Performance Tracking: Monitor token usage and latency across providers
  • Configurable: YAML-based test configuration with environment variables

🎯 Use Cases

  • Prompt Engineering: Compare how different providers handle your prompts
  • Schema Validation: Test which provider best follows your structured output requirements
  • Multi-Tool Selection: Benchmark AI's ability to choose appropriate analysis methods
  • Vision Analysis: Compare accuracy of image understanding across models
  • Performance Testing: Measure latency and reliability across providers
  • Object Detection: Test how well different models identify and categorize various objects
  • Model Comparison: Find optimal cost/performance balance across model variants

πŸš€ Quick Start

1. Clone and Install

git clone https://github.com/realadeel/llm-test-bench.git
cd llm-test-bench
pip install -r requirements.txt

2. Set Up API Keys

Copy the example environment file and add your API keys:

cp .env.example .env
# Edit .env with your actual API keys

Required keys:

3. Configure Your Test

Copy and customize the config:

cp config.yaml.example config.yaml
# Edit config.yaml with your test cases

4. Add Test Content

# Add your test images to test_images/
cp your_content/*.jpg test_images/
cp your_files/*.png test_images/

# Or process ALL existing images by commenting out image_path in config.yaml

5. Run Benchmark

python llm_test_bench.py

πŸ“Š Example Output

πŸŽ‰ Test complete!
πŸ“Š Test Cases: 1
πŸ–ΌοΈ Images Processed: 3
βœ… Successful Provider Calls: 9
❌ Failed Provider Calls: 0

πŸ“ Production-Optimized Multi-Tool Analysis (3 images):
  πŸ“Έ image1:
    βœ… openai_gpt4_nano: 1101ms
    βœ… gemini_flash_lite: 987ms
    βœ… bedrock_haiku_3: 1234ms
  πŸ“Έ image2:
    βœ… openai_gpt4_nano: 987ms
    βœ… gemini_flash_lite: 876ms
    βœ… bedrock_haiku_3: 1098ms
  πŸ“Έ image3:
    βœ… openai_gpt4_nano: 1045ms
    βœ… gemini_flash_lite: 934ms
    βœ… bedrock_haiku_3: 1176ms

πŸ“Š Results saved to results/test_results_2025-07-07_01-54-05.json

Optimized JSON Structure

Efficient format with prompt and tools stored once per test case:

[
  {
    "name": "Production-Optimized Multi-Tool Analysis",
    "prompt": "You are a professional appraiser...",  // STORED ONCE
    "max_tokens": 2000,
    "temperature": 0.1,
    "tools": [...], // STORED ONCE
    "is_multi_image": true,
    "image_results": [
      {
        "image_path": "test_images/image1.jpg",
        "provider_results": [
          {
            "provider": "openai_gpt4_nano",
            "model": "gpt-4.1-nano-2025-04-14",
            "response": "{...}",
            "latency_ms": 1101.5,
            "timestamp": "2025-07-07T01:53:52.253486",
            "error": null,
            "tokens_used": 107
          }
          // ... other providers
        ]
      },
      {
        "image_path": "test_images/image2.jpg",
        "provider_results": [
          // ... provider results for image2
        ]
      }
      // ... other images
    ]
  }
]

Benefits:

  • πŸ’Ύ Maximum efficiency (prompt + tools stored once per test case)
  • πŸ“ Perfect organization (results grouped by test case, then by image)
  • πŸ” Easy analysis (compare providers across all images in one test case)
  • βš™οΈ Full compatibility (all original data preserved)

πŸ”§ Configuration

Multi-Image Processing

Process ALL images in your test directory:

test_cases:
  - name: "Batch Analysis"
    prompt: "Analyze this object..."
    # image_path: "specific.jpg"  # Comment out to process ALL images
    tools:
      # ... your tools

Result: Automatically processes every .jpg, .png, .gif, .webp file in test_images/

Multiple Models Per Provider

Test different model variants from the same provider:

providers:
  # Multiple OpenAI models
  - name: "openai_gpt4_nano"
    model: "gpt-4.1-nano-2025-04-14"
  - name: "openai_gpt4_mini"
    model: "gpt-4.1-mini-2025-04-14"
  
  # Multiple Gemini models
  - name: "gemini_flash_lite"
    model: "gemini-2.0-flash-lite"
  - name: "gemini_pro"
    model: "gemini-1.5-pro"
    
  # Multiple Bedrock models
  - name: "bedrock_haiku_3"
    model: "anthropic.claude-3-haiku-20240307-v1:0"
  - name: "bedrock_sonnet_4"
    model: "us.anthropic.claude-sonnet-4-20250514-v1:0"

Naming Convention: Use pattern {provider}_{model_family}_{variant} for clear identification in results.

Multi-Tool Testing

Let the AI choose the best analysis method:

test_cases:
  - name: "Smart Object Analysis"
    prompt: "Examine this image and choose the appropriate analysis tool."
    image_path: "test_images/sample_image.jpg"
    tools:
      - name: "analyze_media_content"
        description: "For digital media files and multimedia content"
        schema:
          type: "object"
          properties:
            title: {type: "string"}
            creator: {type: "string"}
            category: {type: "string"}
            year: {type: "string"}
          required: ["title", "creator"]
          
      - name: "analyze_publication"
        description: "For printed materials and publications"
        schema:
          type: "object"
          properties:
            title: {type: "string"}
            author: {type: "string"}
            publisher: {type: "string"}
          required: ["title", "author"]

Single Schema Testing

Traditional structured output testing:

test_cases:
  - name: "Object Detection"
    prompt: "List all objects in this image."
    image_path: "test_images/scene.jpg"
    schema:
      type: "object"
      properties:
        objects:
          type: "array"
          items:
            properties:
              name: {type: "string"}
              confidence: {type: "number"}
        total_count: {type: "integer"}
      required: ["objects", "total_count"]

πŸ“ Project Structure

llm-test-bench/
β”œβ”€β”€ llm_test_bench.py      # Main benchmarking engine
β”œβ”€β”€ config.yaml           # Your test configuration
β”œβ”€β”€ config.yaml.example   # Example configuration
β”œβ”€β”€ test_images/          # Your test images (auto-processed when image_path not set)
β”œβ”€β”€ results/              # Benchmark results (optimized JSON format)
β”œβ”€β”€ docs/                 # Documentation
β”‚   └── README.md          # Comprehensive documentation
└── requirements.txt      # Dependencies

πŸŽ›οΈ Advanced Configuration

Provider configurations

providers:
  # Multiple OpenAI models
  - name: "openai_gpt4_nano"
    model: "gpt-4.1-nano-2025-04-14"
  - name: "openai_gpt4_mini"
    model: "gpt-4.1-mini-2025-04-14"
  
  # Multiple Gemini models
  - name: "gemini_flash_lite"
    model: "gemini-2.0-flash-lite"
  - name: "gemini_pro"
    model: "gemini-1.5-pro"
    
  # Multiple Bedrock models
  - name: "bedrock_haiku_3"
    model: "anthropic.claude-3-haiku-20240307-v1:0"
  - name: "bedrock_sonnet_4"
    model: "us.anthropic.claude-sonnet-4-20250514-v1:0"
  - name: "bedrock_llama_4_maverick"
    model: "us.meta.llama4-maverick-17b-instruct-v1:0"

# Rate limiting
delay_between_calls: 1      # seconds between API calls
delay_between_test_cases: 2 # seconds between test cases

Test Parameters

test_cases:
  - name: "Custom Test"
    prompt: "Your analysis prompt..."
    image_path: "test_images/image.jpg"  # Or comment out for multi-image
    max_tokens: 2000
    temperature: 0.7
    # ... schema or tools

πŸ” How It Works

  1. Loads Configuration: Reads your test cases and provider settings from YAML
  2. Processes Images: Converts images to base64 for API calls
  3. Smart API Selection: Uses optimal API for each model (Converse for Llama 4 vision, InvokeModel for others)
  4. Structured Requests: Converts your tool schemas to each provider's format
  5. Captures Raw Responses: Records authentic JSON output from each API
  6. Measures Performance: Tracks latency, tokens, and success rates
  7. Optimized Storage: Groups results by test case for efficient analysis
  8. Saves Results: Outputs organized JSON results for comparison

πŸ› οΈ Requirements

  • Python 3.8+
  • API keys for desired providers
  • Images in supported formats (JPG, PNG, GIF, WebP)

πŸ“ License

MIT License - see LICENSE file for details.

🀝 Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“§ Support


Ready to benchmark your vision AI? Start with the Quick Start guide above! πŸš€

About

Compare LLM providers (OpenAI, Claude, Gemini) for vision tasks - benchmark latency, accuracy, and cost

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages