Skip to content

hyeonsangjeon/gdpval-realworks

Repository files navigation

GDPVal RealWorks

GDPVal RealWorks

Benchmark LLMs on real expert work β€” not academic toy problems.
A YAML-driven experiment pipeline + live dashboard for the GDPVal Gold Subset (220 tasks).

Deploy Batch Run License

🌐 Live Dashboard Β· πŸ‡°πŸ‡· ν•œκ΅­μ–΄ Β· πŸ“– Batch Runner Docs Β· πŸ“„ Paper


πŸ“Š Live Dashboard β†’ https://hyeonsangjeon.github.io/gdpval-realworks/

Leaderboard Β· Trends Β· Execution Errors Β· Grading Analysis β€” all in one place.


The Problem

Most LLM benchmarks test academic reasoning β€” math, code puzzles, trivia.
None of that tells you whether a model can actually do your job.

GDPVal (GDP-level Validation) is different: 220 real-world expert tasks across 11 sectors and 55 occupations β€” Excel reports, legal docs, sales decks, the stuff people actually get paid for.

This repo automates the entire loop: configure β†’ run β†’ collect β†’ visualize β€” driven by a single YAML file, executed on GitHub Actions, results on a live dashboard.

🎯 One YAML file. One button click. Full experiment lifecycle.

Leaderboard β€” experiment rankings, KPI cards, sector heatmap

Live Dashboard β€” leaderboard, success rates, QA scores across experiments

Task Detail β€” real professional task with reference files and deliverables

Task Detail β€” real-world task description, reference files, and generated deliverables


How It Works

Preparation β†’ Execution
↓
Delivery β†’ Report & Upload

⚑ Quick Start

1. Fork & Clone

git clone https://github.com/hyeonsangjeon/gdpval-realworks.git
cd gdpval-realworks

2. Configure GitHub Repository Settings

πŸ”‘ Secrets

Go to Settings β†’ Secrets and variables β†’ Actions β†’ New repository secret and add the secrets you need:

Secret Name Value Required?
AZURE_OPENAI_API_KEY Azure OpenAI API key βœ… If using Azure
AZURE_OPENAI_ENDPOINT https://your-resource.openai.azure.com/ βœ… If using Azure
OPENAI_API_KEY OpenAI API key If using OpenAI
ANTHROPIC_API_KEY Anthropic API key If using Anthropic
HF_TOKEN HuggingFace write token (get one here) βœ… For upload

πŸ’‘ You don't need all of them β€” just the provider you'll actually use.
For Azure users: AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT + HF_TOKEN is the minimum.

πŸ“„ GitHub Pages

Settings β†’ Pages β†’ Source β†’ change to "GitHub Actions" (not "Deploy from a branch")

πŸ”“ Workflow Permissions

Settings β†’ Actions β†’ General β†’ Workflow permissions:

  • βœ… Select "Read and write permissions"
  • βœ… Check "Allow GitHub Actions to create and approve pull requests"
  • Save

🧹 Auto-cleanup (recommended)

Settings β†’ General β†’ βœ… Check "Automatically delete head branches"

This cleans up experiment branches automatically after PR merge.


3. Run Your First Experiment

  1. Go to Actions tab β†’ "Run GDPVal Batch Experiment"
  2. Click "Run workflow"
  3. Fill in:
    • experiment_yaml: exp998_smoke_baseline_sample (smoke test, 3 tasks)
    • dry_run: βœ… checked (first time β€” skip upload)
  4. Click Run workflow πŸš€
βœ… Step 0: Bootstrap        β†’ HF repo ready
βœ… Step 1: Prepare tasks    β†’ 3 tasks filtered
βœ… Step 2: Run inference    β†’ LLM called for each task
βœ… Step 3: Format results   β†’ JSON + Markdown generated
βœ… Step 4: Fill parquet     β†’ Submission parquet ready
⏭️ Step 5: Validate        β†’ Skipped (smoke test)
⏭️ Step 6: Upload          β†’ Skipped (dry run)

πŸŽ‰ If this passes, uncheck dry_run and run a full experiment!


πŸ“ Write Your Own Experiment

Create a YAML file in batch-runner/experiments/:

experiment:
  id: "exp001_GPT52Chat_baseline"
  name: "GPT-5.2 Chat Baseline (Full 220 tasks)"
  description: "Full baseline run with code_interpreter and Self-QA."

data:
  source: "HyeonSang/exp001_GPT52Chat_baseline"
  filter:
    sector: null          # null = all sectors
    sample_size: null     # null = all 220 tasks

condition_a:
  name: "Baseline"
  model:
    provider: "azure"
    deployment: "gpt-5.2-chat"
    temperature: 0.0
    seed: 42
  prompt:
    system: "You are a helpful assistant that completes professional tasks."
    suffix: "Generate actual files, not descriptions."
  qa:
    enabled: true
    min_score: 6
    max_retries: 3

# condition_b:            ← Add for A/B comparison (optional)

execution:
  mode: "code_interpreter"
  max_retries: 5
  resume_max_rounds: 3

Then trigger it from Actions β†’ Run workflow with experiment_yaml: exp001_GPT52Chat_baseline.


🧠 Execution Modes

Mode How It Works Best For
code_interpreter LLM writes + runs code inside Azure/OpenAI's secure sandbox. Files generated in the cloud. βœ… Production β€” safe, powerful
subprocess LLM generates code β†’ executed locally in an isolated temp directory. Non-OpenAI models (Anthropic, etc.)
json_renderer LLM outputs a JSON spec β†’ a fixed renderer creates files. Same renderer for all models. Fair A/B comparison across models

🐳 subprocess mode is planned to evolve into a container-based execution mode β€” if time permits and coffee supply holds.


πŸ”¬ Self-QA: Built-in Quality Reflection Gate

Before acceptance, the same LLM working on the task inspects its own output: Self-QA scores each output on a 0-10 scale using rubric-based self-evaluation. If the score is below the configured threshold (default: 6), it enters a reflection loop and retries.

Self-QA Flow

Self-QA checks: Are all requirements met? Are files actually produced? Is the output professional?


πŸ—οΈ Architecture

Architecture


πŸ”„ GitHub Actions Workflows

batch-run.yml β€” Run Experiments

Feature Detail
Trigger Manual (workflow_dispatch) from Actions tab
Input Experiment YAML filename + optional dry_run flag
Pipeline Step 0 β†’ Step 7 (bootstrap β†’ upload)
Smart skips Smoke tests skip validation; dry_run skips upload + PR
Auto PR Creates a Pull Request with experiment summary
Artifacts Full workspace uploaded for 30 days
Timeout 5 hours max

deploy.yml β€” Deploy Dashboard

Feature Detail
Trigger Push to main (auto) or manual
Build Aggregate test/grade data β†’ React build β†’ GitHub Pages
Scope Only runs when data/, src/, or scripts/ change

πŸ–₯️ Dashboard

β†’ Live Dashboard

Interactive experiment analytics β€” leaderboard, sector heatmaps, error analysis, prompt architecture viewer.

Feature Description
Leaderboard Ranked experiments with strategy, success rate, QA scores
Sector Heatmap 9 sectors Γ— N experiments success rate matrix
Trends Success rate / QA / latency trend lines across experiments
Execution Errors Error distribution, recovery funnel, CONFIDENCE NameError tracking
Prompt Viewer See exactly what prompt was sent to the LLM β€” system, user, QA, config
Grading External evaluation scores (OpenAI Evals)
Experiment Detail Drill into 220 tasks β€” filter by sector, status, search

Built with React 18 + TypeScript + Vite + Tailwind + Recharts + Framer Motion.
Deployed automatically to GitHub Pages on every push to main.

πŸ“– Dashboard Documentation β†’ Β· πŸ‡°πŸ‡· ν•œκ΅­μ–΄ β†’


πŸ§ͺ Testing

cd batch-runner
pip install -r requirements.txt

# Unit tests only (no API keys needed)
pytest

# Integration tests (requires real credentials)
pytest -m integration

# With coverage
pytest --cov=core --cov-report=html

πŸ–₯️ Run Locally (step by step)

cd batch-runner
export HF_TOKEN="hf_xxx"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com"
export AZURE_OPENAI_API_KEY="xxx"

./step0_bootstrap.sh experiments/exp998_smoke_baseline_sample.yaml
./step1_prepare_tasks.sh experiments/exp998_smoke_baseline_sample.yaml
./step2_run_inference.sh condition_a
./step3_format_results.sh
./step4_fill_parquet.sh
./step5_validate.sh
./step6_report.sh
./step7_upload_hf.sh --test

πŸ’‘ Local execution works, but for full 220-task runs we recommend GitHub Actions.
The batch workflow parallelizes as fast as your TPM (Tokens Per Minute) quota allows β€” let the cloud do the heavy lifting while you grab a coffee. β˜•


πŸ“š References


πŸ‘€ Author

Hyeonsang Jeon
Sr. Solution Engineer Β· Global Black Belt β€” AI Apps | Microsoft Asia, Korea
GitHub Dashboard


πŸ“„ License

MIT β€” See LICENSE for details.

Releases

No releases published

Packages

 
 
 

Contributors