Skip to content

ValyrianTech/ValyrianGamesCodingChallenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Valyrian Games: Olympics of AI

The Ultimate AI Model Benchmarking Platform


๐Ÿ”— Connect with ValyrianTech

Follow ValyrianTech

๐ŸŒŸ Follow me on all platforms โ†’ linktr.ee/ValyrianTech


๐Ÿ“Š Quick Access to Key Results


Welcome to the Valyrian Games, an advanced AI benchmarking system that evaluates Large Language Models (LLMs) through rigorous coding challenges. This platform serves as the "Olympics of AI," providing comprehensive performance analytics, cost analysis, and qualification metrics for AI models across multiple providers.

๐ŸŽฏ System Overview

The Valyrian Games platform is a sophisticated benchmarking ecosystem designed to:

  • Automatically generate coding challenges using AI models
  • Validate challenge quality through multi-attempt solving
  • Track comprehensive performance metrics including cost, speed, and accuracy
  • Provide executive-grade analytics with professional visualizations
  • Support automated model qualification based on performance thresholds
  • Enable cost-aware model selection for optimal resource utilization

๐Ÿ—๏ธ Architecture

The system consists of four core components working in harmony:

1. Challenge Generation & Execution

The foundation system that orchestrates the complete challenge workflow:

  • Docker Environment Management: Automatically restarts containers for clean execution
  • Challenge Creation: Generates unique coding problems using specified LLM models
  • Multi-Attempt Validation: Tests each challenge multiple times to ensure solvability
  • Performance Tracking: Records tokens, costs, timing, and accuracy metrics
  • Automated Classification: Sorts results into accepted/ or rejected/ directories
  • Statistics Integration: Automatically triggers analysis updates after completion

Key Features:

  • Configurable validation attempts (default: 3)
  • Adjustable success thresholds (default: 50%)
  • Timeout handling with graceful failure recording
  • Comprehensive conversation metrics extraction
  • Real-time progress monitoring with detailed logging

2. Intelligent Model Selection

Advanced orchestration system for automated testing across the entire model fleet:

  • Cost-Aware Selection: Prioritizes cheaper models and those with fewer existing challenges
  • Dynamic Disqualification: Two-tier system removes poorly performing models
    • Early Disqualification: โ‰ฅ3 rejected challenges with 0 accepted
    • Statistical Disqualification: <50% acceptance rate after 10+ challenges
  • Weighted Random Selection: Balances data collection across models using cost and challenge count
  • Qualified Model Pool: Automatically loads qualified models from cost analysis data
  • Batch Execution: Supports multiple runs with configurable delays
  • Auto-Visualization: Generates updated charts after successful runs

Supported Model Providers:

  • OpenAI: GPT-4.1, GPT-4o, O1, O3, O4 series (62+ models)
  • Anthropic: Claude Opus, Sonnet, Haiku series
  • Google: Gemini 2.5, 2.0, 1.5 series
  • Mistral: Magistral, Codestral, Devstral, Ministral series
  • DeepSeek: Chat and reasoning models
  • Together.ai: DeepSeek-R1, Qwen, GLM, Llama variants
  • Groq: High-speed inference models

3. Comprehensive Analytics Engine

Sophisticated statistical analysis system that processes all challenge results:

  • Multi-Dimensional Metrics: Acceptance rate, success rate, cost efficiency, token usage
  • Nested Directory Support: Handles complex provider structures (Together.ai, Groq)
  • Qualification Determination: Automatic model qualification based on performance criteria
  • Cost Summary Generation: Creates model_cost_summary.json for downstream systems
  • Executive Reporting: Generates detailed markdown reports with model rankings
  • Auto-Visualization Trigger: Launches chart generation after analysis completion

Generated Outputs:

  • qualification_results.md: Comprehensive performance report
  • model_cost_summary.json: Structured data for visualizations and model selection
  • Console reports with sortable metrics and detailed breakdowns

4. Professional Visualization Suite

Executive-grade analytics dashboard with four distinct visualization types:

Cost vs Performance Scatter Plot

Cost Performance Chart

  • X-Axis: Acceptance Rate (%) - primary performance metric
  • Y-Axis: Average Cost per Challenge ($) - logarithmic scale for better distribution
  • Bubble Size: Total challenges completed
  • Color Coding: Green (qualified) vs Red (disqualified)
  • Smart Labeling: Collision-aware model name placement

Model Comparison Bar Charts

Model Comparison Chart

  • Dual Charts: Acceptance rates and costs side-by-side
  • Sorted Rankings: Performance-based ordering for easy comparison
  • Value Labels: Precise metrics displayed on each bar
  • Qualification Status: Color-coded qualification indicators

Executive Dashboard

Executive Dashboard

  • Qualification Overview: Pie chart showing qualified vs disqualified models
  • Top Performers: Top 5 models by acceptance rate
  • Cost Efficiency Analysis: Acceptance rate divided by cost with negative values for disqualified models
  • Key Metrics Summary: Total models, qualification rates, averages, and top performers
  • Adaptive Labeling: Smart label reduction for crowded charts
  • Professional Sizing: 20ร—14 inches optimized for presentations

Performance Heatmap

Performance Heatmap

  • Multi-Dimensional View: Success rate, acceptance rate, cost, challenges, efficiency
  • Dual Visualization: Raw values and normalized scores (0-1 scale)
  • Acceptance Rate Sorting: Models ordered by performance for easy pattern recognition
  • Color Mapping: Red-Yellow-Green scale for intuitive interpretation

๐Ÿ“Š Key Metrics & Terminology

Performance Metrics

  • Acceptance Rate: Percentage of challenges that meet the success threshold (primary metric)
  • Success Rate: Overall percentage of correct solution attempts across all challenges
  • Cost Efficiency: Acceptance rate divided by average cost per challenge
  • Qualification Status: Models with โ‰ฅ50% acceptance rate after sufficient testing

Cost Analysis

  • Average Cost per Challenge: Total cost divided by number of challenges
  • Token Efficiency: Tokens per second during challenge execution
  • Cost-Performance Ratio: Balances model capability with economic efficiency

Challenge Classification

  • Accepted Challenges: Meet or exceed the success threshold (default: 50%)
  • Rejected Challenges: Fall below the success threshold or timeout
  • Validation Attempts: Number of solution attempts per challenge (default: 3)

๐Ÿš€ Quick Start Guide

Running Individual Challenges

The system supports running individual challenges with specific models and configurable parameters including temperature, validation attempts, and success thresholds.

Automated Fleet Testing

The system provides automated testing across qualified models with configurable parameters including:

  • Number of runs and delays between runs
  • Disqualification thresholds
  • Option to include expensive models
  • Verbose logging capabilities

Analytics Generation

The system provides comprehensive analytics and visualization capabilities including:

  • Statistical analysis with markdown report generation
  • Sortable metrics by various criteria (acceptance rate, cost, etc.)
  • Professional visualizations in multiple formats (PNG, SVG, PDF)
  • Executive dashboards and performance heatmaps

๐Ÿ“ Directory Structure

/volumes/Serendipity/ValyrianGames/CodingChallenge/
โ”œโ”€โ”€ README.md                           # This comprehensive guide
โ”œโ”€โ”€ model_cost_summary.json             # Structured performance data
โ”œโ”€โ”€ qualification_results.md            # Detailed analysis report
โ”œโ”€โ”€ valyrian_games_cost_performance.png # Cost vs performance chart
โ”œโ”€โ”€ valyrian_games_model_comparison.png # Model comparison bars
โ”œโ”€โ”€ valyrian_games_dashboard.png        # Executive dashboard
โ”œโ”€โ”€ valyrian_games_heatmap.png         # Performance heatmap
โ”œโ”€โ”€ OpenAI:gpt-4.1-2025-04-14/         # Model-specific results
โ”‚   โ”œโ”€โ”€ accepted/                       # Successful challenges
โ”‚   โ”‚   โ”œโ”€โ”€ conversation_001.json       # Challenge result with metrics
โ”‚   โ”‚   โ””โ”€โ”€ conversation_002.json
โ”‚   โ””โ”€โ”€ rejected/                       # Failed challenges
โ”‚       โ”œโ”€โ”€ conversation_003.json       # Failed challenge with reason
โ”‚       โ””โ”€โ”€ conversation_004.json
โ”œโ”€โ”€ Anthropic:claude-3-5-sonnet-20241022/
โ”‚   โ”œโ”€โ”€ accepted/
โ”‚   โ””โ”€โ”€ rejected/
โ””โ”€โ”€ [Additional model directories...]

๐ŸŽฎ Challenge Result Format

Each challenge result is stored as a comprehensive JSON file containing:

{
  "conversation_id": "unique_identifier",
  "timestamp": "2025-01-29T12:00:00",
  "status": "ACCEPTED|REJECTED",
  "parameters": {
    "validation_attempts": 3,
    "success_threshold": 0.5,
    "agent": "Contender"
  },
  "challenge": {
    "challenge_prompt": "Create a function that...",
    "example_code": "def solution():",
    "expected_answer": 42
  },
  "validation_results": {
    "total_attempts": 3,
    "correct_answers": 2,
    "success_rate": 0.67,
    "accepted": true
  },
  "performance_metrics": {
    "model_name": "OpenAI:gpt-4.1-2025-04-14",
    "temperature": 0.7,
    "total_completion_tokens": 1250,
    "total_cost": 0.0125,
    "total_elapsed_time": 45.2,
    "tokens_per_second": 27.6
  },
  "solution_attempts": [
    {
      "filename": "challenge_candidate_solution_1.json",
      "answer": 42,
      "python_code": "def solution(): return 42",
      "is_correct": true
    }
  ]
}

๐Ÿ† Qualification System

Qualification Criteria

Models are automatically qualified based on:

  1. Minimum Challenges: At least 1 completed challenge
  2. Acceptance Threshold: โ‰ฅ50% acceptance rate
  3. Statistical Significance: Performance maintained over multiple challenges

Disqualification Rules

Models are disqualified through a two-tier system:

  1. Early Disqualification: โ‰ฅ3 rejected challenges with 0 accepted
  2. Statistical Disqualification: <50% acceptance rate after 10+ challenges

Re-qualification

Disqualified models can re-qualify by:

  • Achieving successful challenge completions
  • Improving acceptance rate above 50%
  • Demonstrating consistent performance over time

๐Ÿ’ก Advanced Features

Cost-Aware Selection

The system intelligently balances:

  • Model Performance: Prioritizes higher-performing models
  • Cost Efficiency: Favors economical models for budget optimization
  • Data Balance: Ensures comprehensive testing across all models
  • Quality Control: Automatically removes consistently poor performers

Docker Integration

  • Clean Execution Environment: Containers restart before each challenge
  • Isolation: Prevents cross-contamination between challenges
  • Reliability: Ensures consistent execution conditions
  • Scalability: Supports concurrent challenge execution

Automated Workflows

  • End-to-End Automation: From challenge generation to visualization
  • Failure Handling: Graceful timeout and error management
  • Progress Tracking: Real-time status updates and logging
  • Integration: Seamless data flow between all components

๐Ÿ”ง Configuration Options

Challenge Parameters

  • --validation-attempts: Number of solution attempts (1-10)
  • --success-threshold: Minimum success rate (0.0-1.0)
  • --temperature: Model creativity parameter (0.0-2.0)
  • --solution-timeout: Maximum time per solution (seconds)

Selection Parameters

  • --disqualification-threshold: Rejection limit before disqualification
  • --include-expensive: Include high-cost models in selection
  • --category: Test specific model categories only
  • --use-static-pool: Use hardcoded model list instead of qualified models

Output Parameters

  • --save-markdown: Generate detailed markdown reports
  • --sort-by: Sort results by specific metrics
  • --verbose: Enable detailed logging and progress updates
  • --format: Chart output format (png, svg, pdf)

๐ŸŽฏ Use Cases

AI Research & Development

  • Model Comparison: Objective performance benchmarking
  • Cost Analysis: Budget optimization for AI deployments
  • Capability Assessment: Understanding model strengths and limitations
  • Trend Analysis: Tracking performance improvements over time

Enterprise AI Strategy

  • Vendor Selection: Data-driven model provider decisions
  • Budget Planning: Cost forecasting for AI initiatives
  • Performance Monitoring: Ongoing model evaluation
  • Risk Assessment: Identifying reliable vs unreliable models

Academic Research

  • Benchmarking Studies: Standardized model evaluation
  • Performance Analysis: Statistical model comparison
  • Cost-Benefit Research: Economic efficiency studies
  • Longitudinal Studies: Model evolution tracking

๐Ÿ”ง System Requirements

  • Python 3.8+ with required dependencies
  • Docker for containerized execution environment
  • Sufficient Storage for challenge results and visualizations
  • API Access to supported LLM providers

Dependencies

pip install matplotlib seaborn pandas numpy requests

๐Ÿ Conclusion

The Valyrian Games represents the pinnacle of AI model benchmarking, providing unprecedented insights into LLM performance, cost efficiency, and reliability. Through rigorous testing, comprehensive analytics, and professional visualizations, this platform empowers organizations to make informed decisions about AI model selection and deployment.

Whether you're conducting academic research, optimizing enterprise AI costs, or simply curious about the latest AI capabilities, the Valyrian Games provides the tools and insights needed to navigate the rapidly evolving landscape of artificial intelligence.

Welcome to the Olympics of AI โ€“ may the best models win! ๐Ÿ†


Generated by the Valyrian Games Analytics System
Last Updated: 2025-01-29

About

Coding challenge for the Valyrian Games

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published