GitHub - andytolle/benchmark-itw-v1-plan: A proposed LLM benchmark that aims to close the gap between 'academic scores' and 'in the wild'

Problem Statement: The Disconnect in LLM Evaluation

Current Large Language Model (LLM) evaluation methods are failing businesses by creating a dangerous disconnect between benchmark scores and real-world performance. From talks with companies, I get the impression this gap seems to be driven by four critical domains, of which users and developpers alike underestimate the impact on LLMs performances In The Wild:

The Unquantified Impact of Quantization: The effect of model compression on core reasoning abilities is not being measured, leaving businesses unsure about the performance of smaller, more efficient models.
Erosion of Synthesis and Accuracy: As context windows fill with progressively growing irrelevant information ("noise"), a model's ability to synthesize information and maintain accuracy degrades. It's a common scenario in real-world document analysis, but the context size where the model starts to degrade is often unknown.
Inability to Generate Structured Output: A model can be smart, but if it fails to reliably produce the structured data formats, then this intelligence will be lost in the business processes we're trying to automate.
Data Contamination in Public Benchmarks: Widespread data contamination in public benchmarks renders many scores untrustworthy and not indicative of true performance.

Collectively, these blind spots mean organizations cannot accurately predict a model's operational effectiveness. This leads to the selection of inadequate models and the failure of high-stakes automation projects.

Proposed Solution: ITW (In The Wild) v1 Benchmark

While most benchmarks tell you if a model is "academically smart" in a lab, they don't predict its success on solving your real world business process. To bridge this critical gap, we propose ITW (In The Wild) v1, a benchmark designed to simulate the actual challenges of business automation.

The fundamental goal of ITW is to shift the evaluation focus from "How intelligent is this model?" to "To what extent can this model's intelligence redefine what's possible for aiding in business processes and unlocking new opportunities?" "How effective can I use this model's intelligence to automate a practical bunsiness process?" It achieves this by providing clear, actionable diagnostics on how a model will behave when deployed in real word tasks, focusing on three key areas:

Measures the True Cost of Quantization: ITW is built to evaluate local quantized versions of models against their full-precision counterparts. This allows users to finally see the true performance cost of quantization on their own machines, balancing the delicate blend of speed, cost, and task-solving ability.
Tests Performance in Progressively Growing Noisy Contexts: The benchmark specifically tests a model's ability to synthesize scattered information from large, noisy documents. It assesses the model's capacity to connect the dots like a real analyst would, rather than just finding a single, isolated fact. Measuring the practical breaking points of where the size of context starts to water the model's output.
Introduces a Multi-Vector Score for Automation Readiness: For any given task, ITW provides a Multi-Vector Score. This innovative approach doesn't just give you one number. It separately reports on:
- Reasoning Score: Did the model get the answer right?
- Formatting Score: Did it deliver that answer in the perfectly structured format you need for automation?

With ITW, you don't just know which model is the smartest. You know which model is the most reliable, efficient, and ready for your business processes, empowering anyone running an LLM to make an informed decision.

Potential Benchmark Components:

The "Local Intelligence" Module (Quantization and Core Capabilities):
- Task: A series of complex reasoning and instruction-following tasks that are sensitive to the precision of the model's weights. This could include logic puzzles, multi-step calculations, and nuanced text comprehension.
- Evaluation: The benchmark would run these tasks on both the full and quantized versions of a model, providing a clear comparison of the performance trade-offs. The evaluation would go beyond simple accuracy to measure the model's ability to maintain a coherent line of reasoning.
The "Workflow Simulation" Module Design
- Define a Realistic Business Scenario: Start with a central task that requires synthesis. For example: "Based on the provided project documentation, identify the top 3 risks to the project timeline and provide the name of the person responsible for mitigating each."
- Construct the Context Document Set: Create a large context by concatenating several simulated document snippets.
  - Scattered Information: The necessary information (e.g., a risk about supply chain delays, a risk about team burnout, a risk about budget cuts, and the names of the associated managers) is not in one place. It's scattered across simulated emails, meeting minutes, and project charters.
  - Plausible Noise: The context is filled with relevant-sounding but ultimately irrelevant information (e.g., details from a different project, outdated meeting notes, general company news). This tests noise filtering.
  - Positional Testing (Core of "Needle in the Haystack"): Strategically place the critical pieces of information.
    - Place one key fact near the very beginning of the context.
    - Place another key fact deep in the middle.
    - Place a third key fact near the very end. This explicitly tests if the model's attention is consistent across the entire context window.
- Implement a Progressive Difficulty Scale:
  - Level 1 (Easy): Context of 4k tokens. The 3 key risks and names are present, with minimal noise.
  - Level 2 (Medium): Context of 16k tokens. The same 3 risks are present, but now they are split into more fragments and padded with significantly more irrelevant-but-plausible noise paragraphs.
  - Level 3 (Hard): Context of 100k+ tokens (or max for the model). The information is further scattered, and the noise is more subtle and harder to distinguish from the signal.
- Evaluation Metrics:
  - Synthesis Accuracy: A score based on how many of the correct risks were identified and correctly paired with the responsible person (e.g., a score from 0 to 3).
  - Noise Intrusion: A penalty if the model includes irrelevant information in its answer.
  - Positional Score: Note which pieces of information (beginning, middle, end) were successfully retrieved to identify models prone to the "lost in the middle" problem.
  - Structured Output Compliance: As a final step, require the entire answer to be formatted into a specific JSON structure, testing this capability under cognitive load.
The "Automation" Module (Structured Output Reliability):
- Tier 1: "The Formatting Gauntlet" a comprehensive gauntlet that profiles a model's formatting capabilities in their entirety.
  - Task 1: Extraction & Population (Simple & Complex): extract data to populate simple and complex schemas. This tests the basics.
  - Task 2: Resilience & Repair: Give it broken JSON and see if it can fix it. This tests robustness.
  - Task 3: Schema Generation: Give the model a natural language description of data and ask it to create the optimal JSON schema.
    - Example: "I need to store information about users, including their name, a list of their past orders with order ID and date, and whether they are a premium member. Generate a JSON schema for this."
    - Why? This tests a higher-level business skill. Can the model design the data structures, not just fill them?
  - Task 4: Schema Interpretation: Give the model a complex JSON schema and ask it to generate a diverse set of valid examples.
    - Why? This tests if the model truly understands the constraints (required fields, data types, nesting rules) to the point where it can create mock data, a crucial skill for development and testing.
The "Unseen World" Module (Data Contamination Resistance):
- Task: This module would utilize a "retro-holdout" approach, where new test data is generated that is statistically similar to existing public benchmarks but could not have been part of the training data. This would provide a more accurate measure of a model's true generalization capabilities

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages